6/12/2023 0 Comments Smote data creatorThe left image shows the decision boundary of the original model, while the right one displays that of the SMOTE’d model. The following piece of code shows how we can create our fake dataset and plot it using Python’s Matplotlib.īalanced model and SMOTE’d model hyperplanes. Lastly, I’ll useflip_y=0.06 to reduce the amount of noise. To simplify it, I’ll remove the redundant features and set the number of informative features to 2. To make sure each class is one blob of data, I’ll set the parameter n_clusters_per_class to 1. My fake dataset consists of 700 sample points, two features, and two classes. To generate a balanced dataset, I’ll use scikit-learn’s make_classification function which creates n clusters of normally distributed points suitable for a classification problem. I chose this kind of model because of how easy it is to visualize and understand its decision boundary, namely, the hyperplane that separates one class from the other. įor the initial task, I’ll fit a support-vector machine (SVM) model using a created, perfectly balanced dataset. By training a new model at each step, We’ll be able to better understand how an imbalanced dataset can affect a machine learning system.Įxample code for this article may be found at the Kite Blog repository. Then, I’ll unbalance the dataset and train a second system which I’ll call an “ imbalanced model.”įinally, I’ll use SMOTE to balance out the dataset, followed by fitting a third model with it which I’ll name the “ SMOTE’d ” model. In this tutorial, I explain how to balance an imbalanced dataset using the package imbalanced-learn.įirst, I create a perfectly balanced dataset and train a machine learning model with it which I’ll call our “ base model ”. To show how SMOTE works, suppose we have an imbalanced two-dimensional dataset, such as the one in the next image, and we want to use SMOTE to create new data points. Creating synthetic data is where SMOTE shines. Instead of merely making new examples by copying the data we already have (as explained in the last paragraph), a synthetic data generator creates data that is similar to the existing one. Synthetic data is intelligently generated artificial data that resembles the shape or values of the data it is intended to enhance. Hence, if overfitting affects our training due to randomly generated, upsampled data– or if plain oversampling is not suitable for the task at hand– we could resort to another, smarter oversampling technique known as synthetic data generation. This inherently comes with the issue of creating more of the same data we currently have, without adding any diversity to our dataset, and producing effects such as overfitting. Oversampling’s purpose is for us to feel confident the data we generate are real examples of already existing data. The simplest case of oversampling is simply called oversampling or upsampling, meaning a method used to duplicate randomly selected data observations from the outnumbered class. al., SMOTE has become one of the most popular algorithms for oversampling. SMOTE is an oversampling algorithm that relies on the concept of nearest neighbors to create its synthetic data. In this article, I explain how we can use an oversampling technique called Synthetic Minority Over-Sampling Technique or SMOTE to balance out our dataset. Oversampling involves using the data we currently have to create more of it.ĭata oversampling is a technique applied to generate data in such a way that it resembles the underlying distribution of the real data. Luckily for us, there’s an alternative known as oversampling. ![]() However, this is typically not feasible in fact, it’s costly, time-consuming and in most cases, impossible. In these extreme cases, the ideal course of action would be to collect more data. Nevertheless, there are some extreme cases in which the class ratio is just wrong, for example, a dataset where 95% of the labels belong to class A, while the remaining 5% fall under class B– a ratio not so rare in use cases such as fraud detection.
0 Comments
Leave a Reply. |