Short Python code for Backward elimination with detailed explanation

Jatin Grover
6 min readMar 28, 2021

--

Backward elimination is an advanced technique for feature selection to select optimal number of features. Sometimes using all features can cause slowness or other performance issues in your machine learning model.

Introduction to Backward Elimination in Machine Learning

If your model has several features, it is possible that not all features are equally important. Some features actually can be derived from other features. So o improve performance or accuracy, you can ignore a few features.

Sometimes, you have to make a personal call on whether you would like to keep the derived features or dump those. For instance, total land area of your house is a derived field from length and breadth of your total land. So can we safely remove the total land area feature from a machine learning algorithm to predict house prices?

Think about it this way. Would you like a house which has a wider front face than one with smaller front but goes deeper in the alley? So, in this case we will have to retain at least one of the two redundant features (length or breadth) as well as the total land area feature.

To be sure that you have the optimal number of features, you have to follow some dimensionality reduction techniques like lasso reduction (shrinking large regression coefficients in order to reduce overfitting), Principal Component Analysis (PCA) or Backward Elimination.

To start using the backward elimination code in Python, you need to first prepare your data. First step is to add an array of ones (all elements of that array are “1”) for this regression algorithm to work — array of 1’s represents the constant assigned to first dimension of independent variable X, generally called x0.

The code shown here can be used as the code steps following the code written in the Car mileage prediction article. In other words, the python code for backward elimination is the PART 2 of the Car mileage prediction article.

A quick rundown of steps for the Backward Elimination python code is as follows:

https://fivestepguide.com/technology/machine-learning/backward-elimination-code-in-python-0321/

1. Select a P-value level

Generally a 5% significance level for P-value is perfect for normal circumstances. So keep the P-value = 0.05

2. Fit the model with all features

Now fit your machine learning model with all features. If you have 50 features, fit the model on your test dataset with all of them.

import statsmodels.api as sm X_train_opt = np.append(arr = np.ones((274,1)).astype(int), values = X_train, axis = 1) X_train_opt = X_train_opt[:,[0, 1, 2, 3, 4, 5, 6, 7]] regressor_OLS = sm.OLS(endog = y_train, exog = X_train_opt).fit()
regressor_OLS.summary()

The output is a large table of statistics results. Note that we are interested only in the P-value result (highlighted in yellow).

3. Which feature has highest P-value?

Note P-values of all features. Thereafter, we will search for the feature with the highest P-value. Proceed only if its P-value is more than the significance level selected e.g. 0.05. Otherwise consider this as the final list of features.

As per the screenshot above, the P-value of x1 and x2 are greater than significance level, and the greatest of the higher values is for x2.

4. Remove the feature with highest P-value

Modify the set of features to contain all features apart from the one identified in last step. In our case, its x2.

5. Fit the model again (Step 2) and stop if p-value of all features is more than significance level

Now use the statsmodels.api library to use OLS function for the penultimate step of python code for Backward Elimination.

Now fit the model without x2.

Now, the P-value of x1 is greater than significance level.

As explained earlier, repeat the Backward Elimination code in Python until we remove all features with p-value higher the significance level i.e. 0.05.

6. Now, remove x1 and Fit the model again

# Note that now the model is run without 1st and 2nd features
X_train_opt = np.append(arr = np.ones((274,1)).astype(int), values = X_train, axis = 1)
X_train_opt = X_train_opt[:,[0,3, 4, 5, 6, 7]] regressor_OLS = sm.O#### LS(endog = y_train, exog = X_train_opt).fit()
regressor_OLS.summary()

Now you can see that all features have a P-value less than the significance level.

Test ML model performance with reduced feature set

Now we know that the optimal feature-set required for our algo is just feature number 3 to 7. So we create another X_train and X_test with feature number 3 to 7 only (in red-bordered rectangle).

X_train2 = X_train.iloc[:,[2,3, 4, 5, 6]]

Try Random forest machine learning on reduced features

Now, we try to check performance using Random forest model.

rf = RandomForestRegressor(n_estimators = 10)
rf.fit(X_train2,y_train)
y_pred = rf.predict(X_test2)
import matplotlib.gridspec as gridspecfig = plt.figure(figsize=(12,5))
grid = gridspec.GridSpec(ncols=2, nrows=1, figure=fig)
ax1 = fig.add_subplot(grid[0, 0])
ax2 = fig.add_subplot(grid[0, 1])
sns.scatterplot(x = y_test['mpg'], y = y_pred, ax=ax1)
sns.regplot(x = y_test['mpg'], y=y_pred, ax=ax1)
ax1.set_title("Log of Predictions vs. actuals")
ax1.set_xlabel('Actual MPG')
ax1.set_ylabel('predicted MPG')
sns.scatterplot(x = np.exp(y_test['mpg']), y = np.exp(y_pred), ax=ax2,)
sns.regplot(x = np.exp(y_test['mpg']), y=np.exp(y_pred), ax=ax2)
ax2.set_title("Real values of Predictions vs. actuals")
ax2.set_xlabel('Actual MPG')
ax2.set_ylabel('predicted MPG')
print(‘MAE:’, metrics.mean_absolute_error(y_test, y_pred))
print(‘MSE:’, metrics.mean_squared_error(y_test, y_pred))
print(‘RMSE:’, np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Output →
MAE: 0.07700061119950408
MSE: 0.010134333336198278
RMSE: 0.1006694260249768

For larger datasets with large number of features, you will find a stark difference in MAE etc. between feature-full and feature-reduced Random Forest. However, this is dataset with very few features, so we don’t see much difference between the performance or accuracy of the predictions.

Thank you for reading this article. I have written several other such articles under Machine Learning topics with extensive knowledge, especially on Machine Learning basics. You may like to click the “Data Science” category and read those articles.

FAQ: What are the different methods of feature selection?

There are several Dimensionality reduction or Feature Selection techniques:
– Lasso reduction: shrink large regression coefficients in order to reduce overfitting
– Principal Component Analysis (PCA)
– Discard correlated variables to create a reduce features dataset
– Discard derived features. This is a judgemental call.
– Eliminate features after identifying by plotting charts of independent variables after using Random Forest
– Use Linear Regression to select the features based on ‘p’ values
– Forward selection,
– Backward selection
– Stepwise selection

FAQ: What is the “Curse of Dimensionality”?

It signifies that the underlying dataset has more features than possibly required.
Additionally, if you have more features than observations, you run the risk of overfitting. Observations may become more difficult to cluster. Because if you have too many dimensions, it can cause each observation to appear close to each other.
PCA is the most popular Dimensionality reduction techniques.

--

--

Jatin Grover

Data Engineer. Investor, blogger (https://fivestepguide.com) quadragenarian, father of 4 year old. I like to share knowledge & experience gained over 20+ years