Unbalanced data handling in machine learning
Unbalanced data
Unbalanced data in this context refers to categorization issues if we have unequal examples for various classes. In fact, working with disease data, where we typically have more healthy control samples than disease cases, makes having unbalanced data extremely likely. With regard to fraud identification, there is an even greater extreme
asymmetry, as most credit card usage are legitimate and only a very small
percentage are fraudulent.
Why does machine learning have an issue with uneven data?
The majority of machine learning classification methods are sensitive to predictor class imbalance. Supposing we had 180 benign samples compared to 20 malignant ones. After being trained and tested on such a dataset, a machine learning model could now predict “benign” for all samples while maintaining a very high level of accuracy. An uneven dataset will skew the prediction model in favor of the class that is more prevalent.
balancing data for modeling
Over- and under-underlying sampling’s theoretical principles are relatively straightforward:
· Under-sampling: In order to match the quantity of samples coming from each class, we randomly choose a subset of samples from the class with more occurrences. For example, let assume we have X (20) malignant cases, we would choose Y (20) at random from the 180 benign cases. We lose potentially important data from the samples that are left out, which is the fundamental drawback of under-sampling.
· Over-Sampling: In order to equalize the number of samples in each class, oversampling involves randomly duplicating samples from the class with fewer instances or creating more instances depending on the data we already have. With this method, we avoid information loss, but we also face the danger of overfitting our model because it is more likely that we will receive the identical samples in the training and test data, meaning that the test data is no longer independent of the training data. This would result in an overestimation of the effectiveness and generalizability of our model.
SMOTE or ROSE
In addition to over- and under-sampling, hybrid approaches exist that combine under-sampling with the production of extra data. The two most well-known ones are SMOTE (Technique for Synthetic Minority Over-Sampling) and ROSE (A Binary Imbalanced Learning Package). sampling = “ROSE” or sampling = “SMOTE”
WEIGHT_COLUMN
column that, if applicable, shows the observation weight. This column must have a number higher than or equal to 0. Higher weighted rows are more significant. The weight has an impact on both model scoring and model training through the use of weighted metrics.
When producing test set predictions, the weight column is not utilized (but scoring of the test set predictions can use the weight).
Without increasing the amount of data, the weight column gives the minority class more weight. It allocates higher loss for a certain row if the forecast were incorrect for positive, and it is preferable to over and under sampling. In order to acquire the best weight, you must take the square root of the positive number (for instance, if the ratio is 30 (0 class) to 1 (1 class) (a/b), then 5 to 1), that is, 5 weight for each 1 class and 1 weight for each 0 class).
We already saw how several strategies can affect how well a model performs. Specificity describes the proportion of malignant cases that have been accurately predicted, whereas
sensitivity (or recall) describes the proportion of benign cases that have been correctly anticipated. The true positives, or the percentage of benign forecasts from benign samples that occurred, are referred to as precision. F1 is the weighted average of sensitivity/recall and precision.
Here, compared to the original model, all five techniques increased specificity and precision. Precision and the F1 score were also enhanced by Weight Column and ROSE
sampling.
Leave A Comment