Imbalanced datasets are a common and annoying problem you encounter in almost every machine learning project. An imbalanced dataset occurs when the classes in your target variable are not represented equally. For instance, in a binary classification problem, you might have 90% of the data belonging to one class and only 10% to the other. This can cause your model to perform poorly, especially on the minority class.
Let me explain this with a real-world example that everyone can easily understand.
Scene: A Regular Market Visit
You and your mother step into a vegetable market. She has a shopping list with four items:
✅ Tomatoes — 10 kg
✅ Onions — 8 kg
✅ Potatoes — 6 kg
✅ Green Chilies — 500g
But when you enter the shop, there is an issue:
🍅 Tomatoes are everywhere — huge piles!
🧅 Onions and Potatoes are in good quantity — easy to find.
🌶️ Green Chilies are barely there — only a handful left.
Your mother sighs, “Why are there so many tomatoes and so few green chilies?”
You smile and say, “This is exactly like data imbalance in machine learning!”
Data imbalance happens when one category (class) has a lot of samples while another has very few.
Here, tomatoes represent a majority class (most frequent data points).
Green chilies represent a minority class (rare data points).
If a machine learning model were trained on this data, it would mostly learn about tomatoes and onions but struggle to recognize green chilies.
Your mother nods, “That makes sense! But how do we fix this problem?”
Now, let us explore some techniques that help balance the dataset, just like how we solve the green chili shortage while shopping.
Over-Sampling (Buying More from Another Shop)
Problem: The store does not have enough green chilies.
Solution: You go to another store and buy more green chilies.
This is like over-sampling, where we artificially increase the number of minority class samples by duplicating them or generating synthetic ones.
Example in ML: If a dataset has very few fraudulent transactions (rare class), we can generate more similar fraud cases to balance the data.
Under-Sampling (Buying Fewer Majority Items)
Problem: There are too many tomatoes, but you need a balanced shopping bag.
Solution: Instead of buying 10 kg tomatoes, you buy only 5 kg.
This is like under-sampling, where we remove some samples from the majority class to make it more balanced with the minority class.
Example in ML: If there are 10,000 spam emails and only 500 non-spam emails, we can randomly reduce spam emails to balance the dataset.
SMOTE (Synthetic Minority Over-sampling Technique.)
Creating Artificial Green Chilies!
Your mother frowns, “What if we can not find more green chilies anywhere?”
Problem: Green chilies are rare, and there are no other stores.
Solution: The shopkeeper suggests an alternative — he gives you dried green chilies or makes a chili paste, simulating fresh ones.
This is like SMOTE (Synthetic Minority Over-sampling Technique), where instead of duplicating existing data, we generate synthetic samples based on existing ones.
Example in ML: If a dataset lacks enough fraud cases, SMOTE can create new synthetic fraud patterns based on the existing ones.
Class Weighing (Giving More Importance to Green Chilies)
Your mother says, “But green chilies are very important in our cooking. Even if they are fewer, they matter a lot!”
Solution: The shopkeeper marks green chilies as premium items, making them more valuable despite being fewer.
This is like class weighting, where we give higher importance to the minority class so that the model pays extra attention to it.
Example in ML: When training a classifier, we assign higher weights to rare diseases so that the model does not ignore them.
Choosing the Right Algorithm (Smart Shopping Strategy)
Your mother realizes that this store always runs out of green chilies.
Solution: Next time, she chooses a different shop where all vegetables are stocked properly.
This is like choosing better ML algorithms that handle data imbalance well, such as:
✔ Decision Trees
✔ Random Forest
✔ XGBoost
Example in ML: Instead of using simple Logistic Regression, which struggles with imbalanced data, we use Random Forest, which works better.
A Balanced Shopping Bag = A Balanced Dataset!
By carefully handling the imbalance, your mother gets all items for cooking. Similarly, by applying these techniques, a machine learning model learns from both common and rare data points.
Your mother smiles, “So, just like I need balanced groceries, machine learning needs balanced data!”