Prediction Model for an unbalanced Dataset

Please share this blog post
Share on Facebook
Tweet about this on Twitter
Share on LinkedIn


Brake test is a time consuming and costly operation for the factories. The plant needs to shut down the robots to perform the test. It worth exploring to see if it is possible to predict the test results based on historical data. At the time of this writing, we approach this problem as a binary classification problem.

Data Preprocessing

The receiving data needs to be cleaned and wrangled to be ready to be fed to a machine learning model. In this section, the necessary steps are explained.

Feature Engineering

Brake test data is in the form of a time series as it records several features for each robot during the brake test over time. There are limited tools and methods suitable for working with time series so that would be helpful if we could transform the problem to classification problem which is easier to solve. To do that, a moving window can be used. For example, averaging `torque_peak` for the past 5 days (excluding the current day) and assign it to a variable in the current day, see the table below for more clarity. Please note we do that for each axis and individual robots separately, so in the below example, all other axis and robots are not being shown.

row no. torque_peak torque_peak_avg_5 axis_test_date robot_serial axis_number
1 19.0936 Avg of row 2 to 7 2017-04-27 737411 1
2 19.0939 Avg of row 3 to 8 2017-04-26 737411 1
3 19.0877 Avg of row 4 to 9 2017-04-25 737411 1
4 19.0981 Avg of row 5 to 10 2017-04-24 737411 1
5 19.0932 2017-04-23 737411 1
6 19.0888 2017-04-22 737411 1

The following metrics were calculated:Average, Standard Deviation, and Correlation (of the variable and timestamp) for all variables of interest for a 3-day and 5-day time window: [‘avgMotTemp’, ‘observed_brake_torque’, ‘observed_static_friction’, ‘torque_const_move’, ‘torque_limit_cmd’, ‘torque_peak’] => [‘avgMotTemp_avg_5’, ‘avgMotTemp_stddev_5’, ‘avgMotTemp_corr_5’, ‘observed_brake_torque_avg_5’, …]. We add an additional 3-day time window because we have a few robots with an available 5-day time window.

Another variable that was added is `test_timestamp` which basically is the result of `axis_test_date` to seconds.

Output variables

Dealing with binary class output variables is easier and more efficient than working with multiclass ones. Furthermore, labels other than `BT_BRAKE_TORQUE_OK` and `BT_BRAKE_NOT_OK` are so rare. So we decided to fold other labels into these two.

Old overall_test_result New Count
BT_INCOMPLETE removed 15


An n-day time window could result in Null values if the rows in the window are less than n. Furthermore, the Corr function could produce Null if one of the variables has zero variance. As a result, we need to impute the Null values:

  1. Considering 3-day time window instead of a 5-day time window as most of the robots have no data for 5 consecutive days
  2. Fill N/A using mean with the same Robot Type and overall_test_result
  3. Then fill N/A using mean with the just same overall_test_resut
  4. Then drop rows with N/A values as Scikit learn can’t handle N/As
  5. Remove all the robots without any recorded failure to make the dataset more balance

Here is the result:

overall_test_result Count Ratio
BT_BRAKE_NOT_OK 2781 0.0060005178
BT_BRAKE_TORQUE_OK 460679 0.9939994822


Predicting brake test is a classification problem and because only %0.6 of observations are related to brake failure we are dealing with a highly imbalanced dataset. Base on my initial investigations Random Forest looks like a good choice for this problem for the following reason:

  • It’s highly parallelizable in compare to Boosting based models
  • In Scikit Learn the Random Forest function has a large set of parameters including the ones to deal with imbalanced datasets
    Note: As mentioned before, the dataset is highly imbalanced as less than 0.1% of the observations are labeled ‘BT_BRAKE_NOT_OK’ so it makes sense to use some advanced methods such as undersampling ‘BT_BRAKE_TORQUE_OK’. Spoiler alert: It turned out the Random Forest can manage an imbalanced dataset even without using any balancing technic.
  • The ensembling nature of RF reduce the possibility of overfitting

Why undersampling methods would not work

With only around 2000 failed brake tests, undersampling provide us a too small dataset to train a model with. Some of my colleagues did try undersampling and got a great F1-score of 0.92 but there is a major flaw in their approach. They test the model with undersampled data which is wrong because do not represent real-world data. I reproduced their work and got a high F1-score of 0.96 but after testing it with the imbalanced test dataset, the F1-score dropped to 0.2.

The main reason for this phenomenon is that even if the model misclassifies %0.5 of the passed tests to fail, the Precision score falls below 0.5 and result in a low F1-Score.

Even more complicated methods such as Instance Hardness Threshold do not do better.

Trying oversampling methods

Oversampling means generating more observations of the minority class. Just duplicating current observation will not work so we need to use more advanced methods such as the Synthetic Minority Oversampling Technique (SMOTE) [CBHK2002] and the Adaptive Synthetic (ADASYN) [HBGL2008]. All of which perform perfectly on 5-fold cross-validation but poorly on unseen imbalanced data. Here is the Recall-Precision threshold for a Random Forest model fitted the oversampled data by ADASYC. Applying SMOTE results in a similar performance.

Leveraging Ensemble Learning

The Scikit-learn RandomForest supports three types of weighted sampling: ‘balanced’, ‘balanced subsample’, and No weight assignment. We will see if these methods help us to build a better prediction model.

I tried a randomized hyperparameter grid search to have an idea of the range of hyperparameters values.

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 600, num = 10)]
# Number of features to consider at every split
max_features = ['sqrt', 'log2', None]
# Maximum number of levels in tree
max_depth = [50, 100]
# Minimum number of samples required to split a node
min_samples_split = [10, 20, 40, 80]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4, 8]
# Method of selecting samples for training each tree
bootstrap = [True, False]
criterion = ['gini']
class_weight=['balanced','balanced_subsample', None]

This is how the best performing model trained with all the features looks like:

Model with rank: 1
Mean validation score: 0.734 (std: 0.010)
Parameters: {'random_state': 42, 'n_estimators': 100, 'min_samples_split': 80, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 50, 'criterion': 'gini', 'class_weight': None, 'bootstrap': False}

Model with rank: 2
Mean validation score: 0.734 (std: 0.013)
Parameters: {'random_state': 42, 'n_estimators': 377, 'min_samples_split': 40, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 100, 'criterion': 'gini', 'class_weight': None, 'bootstrap': True}

Model with rank: 3
Mean validation score: 0.715 (std: 0.015)
Parameters: {'random_state': 42, 'n_estimators': 211, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': None, 'criterion': 'gini', 'class_weight': None, 'bootstrap': False}

The performance of the model is shown in the form of a learning curve diagram. It helps us check the bias-variance tradeoff and check the progress of the model. The current diagram shows that we cannot do much better with the current model and data.

Interestingly “class_weight” is None in the top 3 best performing model. So, the best model is achieved when we ignore the imbalance problem altogether.

Feature Importance

Before talking about feature selection it could be interesting to look at the correlation matrix (for a 3-day time window) to see if variables are correlated or not:

Using the sk-learn “SelectFromModel” function we can find the most important feature in which their score is the above median of all scores. Please bear in mind that because of the random nature of ensemble methods the order of the list is not deterministic. I also ran this for the combination of 3-day and 5-day time window but as expected the first 5-day feature (‘observed_static_friction_stddev_5’) appears at the 16th place. Here the correlation matrix of top features for a 3-day time window.

Note: “torque_limit_cmd_stddev_3” added later based on a model with a 0.3 threshold. See the Threshold Tuning section.

Fit the model to the data with selected features

I used the same hyperparameter grid as before to narrow the exhaustive search space, here are the top results:

Model with rank: 1
Mean validation score: 0.744 (std: 0.010)
Parameters: {'random_state': 42, 'n_estimators': 377, 'min_samples_split': 40, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 100, 'criterion': 'gini', 'class_weight': None, 'bootstrap': True}

Model with rank: 2
Mean validation score: 0.743 (std: 0.013)
Parameters: {'random_state': 42, 'n_estimators': 100, 'min_samples_split': 80, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 50, 'criterion': 'gini', 'class_weight': None, 'bootstrap': False}

Model with rank: 3
Mean validation score: 0.719 (std: 0.014)
Parameters: {'random_state': 42, 'n_estimators': 211, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': None, 'criterion': 'gini', 'class_weight': None, 'bootstrap': False}

And the learning curve for the top-performing model. We improved the result slightly:

Threshold tuning

Considering that “class_weight” remained None in the top model performers, it is worth checking if changing the decision threshold improves the results. Here is the Precision-Recall curve:

Normally, in binary class datasets, sk-learn considers 0.5 as the threshold to decide which class to choose. The above shows that the 0.5 threshold results in low Recall. Move the threshold to 0.3 to help us achieve the highest F1 score possible.

By using the original hyperparameter grid and apply the 0.3 thresholds, the F score increased to 0.751

Model with rank: 1
Mean validation score: 0.751 (std: 0.015)
Parameters: {'random_state': 42, 'n_estimators': 400, 'min_samples_split': 40, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': None, 'criterion': 'gini', 'class_weight': None, 'bootstrap': True}

Model with rank: 2
Mean validation score: 0.750 (std: 0.014)
Parameters: {'random_state': 42, 'n_estimators': 366, 'min_samples_split': 40, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': None, 'criterion': 'gini', 'class_weight': None, 'bootstrap': True}

Model with rank: 3
Mean validation score: 0.750 (std: 0.016)
Parameters: {'random_state': 42, 'n_estimators': 100, 'min_samples_split': 80, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 100, 'criterion': 'gini', 'class_weight': None, 'bootstrap': True}

Exhaustive Hyperparameter search

After finding out the important features, general values of hyperparameters, and the decision threshold, now it’s time to use an exhaustive search:

Expand source

{'n_estimators': [395, 400, 410],
               'criterion': ['gini'],
               'max_features': ['sqrt'],
               'max_depth': [50, None],
               'class_weight' : [None],
               'min_samples_split': [37,38, 39],
               'bootstrap': [True],
                'random_state': [42],
              'n_jobs' : [10]}


Model with rank: 1
Mean validation score: 0.754 (std: 0.016)
Parameters: {'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_depth': 50, 'max_features': 'sqrt', 'min_samples_leaf': 3, 'min_samples_split': 38, 'n_estimators': 400, 'n_jobs': 10, 'random_state': 42}

Model with rank: 1
Mean validation score: 0.754 (std: 0.016)
Parameters: {'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 3, 'min_samples_split': 38, 'n_estimators': 400, 'n_jobs': 10, 'random_state': 42}

Model with rank: 3
Mean validation score: 0.753 (std: 0.016)
Parameters: {'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_depth': 50, 'max_features': 'sqrt', 'min_samples_leaf': 3, 'min_samples_split': 38, 'n_estimators': 410, 'n_jobs': 10, 'random_state': 42}


Despite all our effort, we could not reach F1-Score above 0.76. I believe that’s because of the following reasons:

  • The data do not cover robots and dates consistently. That results in a low learning rate as the model does not have enough data for a given robot to find the patterns.
  • Too many brake tests with inconsistent results are being run on each robot per day. That makes building a classification model harder.
  • The low rate of brake test failure result is a highly imbalanced dataset which is hard to approach with common methods.

What to do next?

The possible solutions to the mentioned problems could be:

  • Trying to clean data by applying more domain knowledge. i.e. removing or merging multiple test results per day per robot into one observation
  • Acquiring more consistent brake tests from manufactures
  • Extracting more variables/information from the brake test logs currently we use “[‘avgMotTemp’, ‘observed_brake_torque’, ‘observed_static_friction’, ‘torque_const_move’, ‘torque_limit_cmd’, ‘torque_peak’]”, there are as many more.
  • Trying new ways to deal with the time series problem other than folding it by aggregate functions. e.g. Recurrent Neural Networks
Please share this blog post
Share on Facebook
Tweet about this on Twitter
Share on LinkedIn
I am an experienced software developer who loves to solve problems that matter. Now, as a graduate student, I study Big Data and Data Science to explore new ways to make the world a better place.

Leave a Reply

Your email address will not be published. Required fields are marked *