Predict car mileage — Machine Learning Regression using auto-mpg dataset

Step-by-step instructions along with backward elimination, cross_val_score and KFold explanation

13 min readDec 23, 2019

The mission is to predict the mileage of a particular car in city driving, given data of some parameters (features) for hundreds of cars.

This project uses UCI dataset of almost 400 cars with accurate values of following parameters.

1. mpg:           continuous
2. cylinders:     multi-valued discrete
3. displacement:  continuous
4. horsepower:    continuous
5. weight:        continuous
6. acceleration:  continuous
7. model year:    multi-valued discrete
8. origin:        multi-valued discrete
9. car name:      string (unique for each instance)

The idea is to train a machine learning model to learn the relationship (weights for regression equation) between dependent variable (y) and independent variables or features (x1, x2, x3 etc).

It’s obvious that the mileage of a vehicle doesn’t depend purely on only these parameters. There are several other factors in play like direction and strength of wind, city roads, city traffic, weather, driver experience and ability etc.

The steps I have followed are more or less commonly followed steps for all regression machine learning problems. However, I have displayed and explained 2 other key aspects of approaching and solving a regression machine learning problem

Using cross_val_score to choose the best ML algo
Feature Selection using backward elimination

Importing the libraries and creating a Pandas DataFrame

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inlinefrom sklearn import preprocessingfrom sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn import metrics

Next, read UCI data and create a Pandas DataFrame.

df = pd.read_csv(‘auto-mpg.data’, sep=’\s+’, header=None, 
 names=[‘mpg’, ‘cylinders’, ‘displacement’, ‘horsepower’, ‘weight’, ‘acceleration’, ‘modelyear’, ‘origin’, ‘carname’])display(df.head(3))

‘\s+’ is a regex matching any white-space character i.e. [\t\n\r\f\v]
read a *.dat file using pandas.read_csv function by specifying the delimiter as space/tab etc.

The output is like this:

Data pre-processing and visualization

Next, we check for any null, missing, incomplete, or inappropriate values using the following code:

df.isnull().sum()
df.info()

The first command will tell you whether there’s any missing value for any numerical data, not string data since string datatype data can be blank and this command doesn't capture that.

The second command will tell you whether the datatype of every feature is as per our expectation i.e. we expect displacement, horsepower, mpg etc. to be numerical datatype (float/int). df.info()will help check whether the data type is exactly what we are expecting. Out put will be like:

We see that horsepower column is perceived as object data type by Pandas, whereas we should be expecting a floating value. It means there is a string somewhere. Now our goal is to find that string values(s) and deduce what to do with the corrupt data.

Generally the steps are to check for null, missing, incomplete, inappropriate values and subsequently clean the data by converting data type to appropriate data types, filling missing values, normalizing etc.

We follow the steps below to clean the corrupt data for horsepower column

Display unique values in horsepower column.

str(set(df[‘horsepower’]))Output:
"{'?', '160.0', '65.00', '129.0', '167.0', '66.00', '208.0', '103.0', '116.0', '60.00', '52.00', '92.00', '115.0', '139.0', '90.00', '180...... <truncated output>

We can see that the only string present in the entire column is ‘?’ in few rows.

2. Find percentage of non-numeric data in horsepower column

def removenotnum(list1):
 notnum = []
 for x in list1:
   try:
     float(x)
   except:
     notnum.append(x)
 return notnumnotnumtable = removenotnum(df['horsepower'])
print(‘all rubbish values →’, set(notnumtable))print(‘Percent of identified rubbish data in Table →’, len(notnumtable) / len(df[‘horsepower’])*100)OUTPUT -->
ll rubbish values --> {'?'}
Percent of identified rubbish data in Table --> 1.507537688442211

It turns out that only 1.5% of data is corrupt. Identify the row index of those rows containing rubbish value for horsepower column and remove those rows.

indexnames = df[(df[‘horsepower’] == ‘?’)].index
df.drop(axis=0,index=indexnames,inplace=True)

Now convert the remaining clean data in horsepower column to float and see the data types now.

df[‘horsepower’] = df[‘horsepower’].astype(float)
df.info()

Now, we use some data visualisation techniques to visualise our data on charts, histograms etc to further do any data pre-processing or feature engineering if required. In short, we plot to check any anomaly, outlier, distribution, range of values etc.

pairplot of dependent variable (y) with respect to every independent variable or feature (x1, x2, x3 etc) except car name

sns.pairplot(df, x_vars=df.drop([‘carname’,’mpg’], 
             axis=1, inplace=False).columns, y_vars= [‘mpg’])

histogram of dependent variable (y) and every independent variable or feature (x1, x2, x3 etc) except car name. Define and describe a histplot() function and later call it to plot all histograms

def histplot(df, listvar):
   fig, axes = plt.subplots(nrows=1, ncols=len(listvar), 
                figsize=(20, 3))
   counter=0
   for ax in axes:
   df.hist(column=listvar[counter], bins=20, ax=axes[counter])
   plt.ylabel(‘Price’)
   plt.xlabel(listvar[counter])
   counter = counter+1
   plt.show()
 
histplot(df, df.drop([‘carname’], axis=1, inplace=False).columns)

To see if any outliers, plot a boxplot of every independent variable or feature (x1, x2, x3 etc) except car name. Define and describe a dfboxplot() function
Define list of continuous variables and call dfboxplot() for only those to detect outliers

def dfboxplot(df, listvars):
   fig,axes=plt.subplots(nrows=1,ncols=len(listvars),figsize=(20,3))
   counter=0
   for ax in axes:
   df.boxplot(column=listvars[counter], ax=axes[counter])
   plt.ylabel(‘Price’)
   plt.xlabel(listvars[counter])
   counter = counter+1
   plt.show()# Create a list of continuous variables
linear_vars = df.select_dtypes(include=[np.number]).columns# call dfboxplot() for only linear_vars to detect outliers
dfboxplot(df, linear_vars)

Lastly, remove outliers using z-score. Generally a z-score of 3 is considered practically useful to detect and remove outliers. You can read more about other methods of removing outliers from the following links:

Finding outliers in dataset using python

In this article, we will use z score and IQR -interquartile range to identify any outliers using python

medium.com

Ways to Detect and Remove the Outliers

While working on a Data Science project, what is it, that you look for? What is the most important part of the EDA…

towardsdatascience.com

# this removes dataframe’s outliers inplace
def removeoutliers(df, listvars, z):
 from scipy import stats
 for var in listvars:
 df1 = df[np.abs(stats.zscore(df[var])) < z]
 return df1# remove outliers where z score > 3
df = removeoutliers(df, linear_vars,3)

Set up Machine learning model

1. Set up X and y dataframes

— for dependent variable (y) and independent features (x1, x2, x3 etc)

X = df.drop([‘carname’,’mpg’], axis=1, inplace=False)
y = df[[‘mpg’]]

Two square brackets [[… ]] are needed to create a dataframe.
Single [] will create a series / array

2. Convert features to log

— Since we say in histogram plots that features are not normally distributed, we will convert them to log.

Most Machine Learning equations rely on the assumption that the underlying data is normally distributed. I have elaborated this in my post on Diamond Price prediction using machine learning (search the page for this string: ‘Convert to log’)

https://fivestepguide.com/technology/machine-learning/diamond-price-prediction-using-machine-learning/#convert-to-log

def convertfeatures2log(df, listvars):
   for var in listvars:
   df[var] = np.log(df[var])convertfeatures2log(X, X.columns)
convertfeatures2log(y, y.columns)histplot(X, X.columns)
y.hist(bins=20)

3. Test Train Split

Data scientists generally split the data for machine learning into either two or three subsets: 2 subsets for training and testing, while 3 for training, validation and testing. I have elaborated this in my earlier post (search the page for this string: ‘ Train Test Split’)

“Diamonds are forever” — price prediction using Machine Learning regression models and neural…

Using PriceScope and CaratLane dimaond listings

medium.com

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.3, random_state=42)

4. Screen test for ML applicability for this use case — using Random Forest

Calling a machine learning algo nowadays is a piece of cake since there are several libraries which help call and run an algo with just few lines of code.

rf = RandomForestRegressor(n_estimators = 300)
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)

Explanation of the code above — The first line creates an instance of RandomForestRegressor class. The second line fits the training data to that regressor. The last line predicts y values based on the X_test data and puts all values in y_pred.

Now, we plot a scatter plots for predictions vs. actuals:

ML algo log results vs. log of mpg in dataset
exponent of predicted values vs. real mpg in dataset

import matplotlib.gridspec as gridspecfig = plt.figure(figsize=(12,5))
grid = gridspec.GridSpec(ncols=2, nrows=1, figure=fig)ax1 = fig.add_subplot(grid[0, 0])
ax2 = fig.add_subplot(grid[0, 1])sns.scatterplot(x = y_test['mpg'], y = y_pred, ax=ax1)
sns.regplot(x = y_test['mpg'], y=y_pred, ax=ax1)
ax1.set_title("Log of Predictions vs. actuals")
ax1.set_xlabel('Actual MPG')
ax1.set_ylabel('predicted MPG')sns.scatterplot(x = np.exp(y_test['mpg']), y = np.exp(y_pred), ax=ax2,)
sns.regplot(x = np.exp(y_test['mpg']), y=np.exp(y_pred), ax=ax2)
ax2.set_title("Real values of Predictions vs. actuals")
ax2.set_xlabel('Actual MPG')
ax2.set_ylabel('predicted MPG')

Now check metrics to test whether its worthwhile

print(‘MAE:’, metrics.mean_absolute_error(y_test, y_pred))
print(‘MSE:’, metrics.mean_squared_error(y_test, y_pred))
print(‘RMSE:’, np.sqrt(metrics.mean_squared_error(y_test, y_pred)))Output →
MAE: 0.07270868117493247
MSE: 0.009611197970657645
RMSE: 0.0980367174616615

After inspecting the scatter plot, we realize its quite well predicted. So now, I will try to improve the ML algo for more accurate predictions.

In the next section, I will talk more about the 2 other key aspects of approaching and solving a regression machine learning problem that I mentioned earlier:

Using cross_val_score to choose the best ML algo
Feature Selection using backward elimination

Eyeing the best Machine Learning lens

How to select the best machine learning algorithm out of the set of algorithms you have identified for the particular problem.

There are many ways to select the best machine learning algorithm out of the set of algorithms for the identified problem. One of the most well know methods is using cross_val_score.

First, we run the KFold() function. K-Fold Cross Validation uses a given data set, splits it into a K number of folds where each fold is used as a testing set while other K-1 are used as training set. For example, for 10-Fold cross validation (K=10), the dataset is split into 10 folds. In the first iteration, the first fold is used for validation while the 9 remaining folds form the training set. In the second iteration, 2nd fold is used as the testing set while the rest serve as the training set. This process is repeated until each fold of the 10 folds have been used as testing sets.
Subsequently, the cross_val_score function takes the model, X and y, and kfold’s result as inputs and outputs multiple results — a list of regression model metrics scores. The cross_val_score function splits the data, using KFold as described above, into K pieces, trains on each combination of K-1 folds and gives back the metrics of the model.

Kfold splits the data into n_splits number of folds where for n_splits times, the dataset will randomly be split into train and test set. Cross_val_score returns array of scores of the estimator for each run of the cross validation.

See the following page for better explanation of cross_val_score and KFold — their explanation and real use:

Cross Validation and Model Selection

Summary: In this section, we will look at how we can compare different machine learning algorithms, and choose the best…

www.pythonforengineers.com

The following code uses KFold to get 10 splits of training data and subsequently cross_val_score to score multiple regression algorithms on all 10 folds and get the score matrix. The mean and standard deviation of the scores of all 10 folds of every regression algorithm used is then displayed.

The output turns out to be:

model: mean of score across 10 folds (std dev of score)
LR: 0.878770 (0.042622)
RF: 0.880276 (0.053347)
KNN: 0.796948 (0.092200)
CART: 0.820365 (0.054995)
SVR: 0.840068 (0.066258)

It seems the best algorithm is Linear Regression or Random forest for our use case.

Lots of people wonder on whats the difference between cross_val_score and cross_val_predict and which one to use when. The following 2 links possibly will clear their doubts. I could have written two full paragraphs on the same but there’s no use if the same content is present elsewhere — read, discussed and appreciated by hundreds of people in the community.

How is scikit-learn cross_val_predict accuracy score calculated?

Does the cross_val_predict (see doc, v0.18) with k-fold method as shown in the code below calculate accuracy for each…

stackoverflow.com

Difference between cross_val_score and cross_val_predict

So this question also bugged me and while the other's made good points, they didn't answer all aspects of OP's…

stackoverflow.com

Which features to select? — Feature selection

Backward Elimination to select optimal number of features (feature selection)

There are a few techniques for feature selection (or dimension reduction in this case) if you have hundreds of features for your model and you are somewhat of the notion that possibly not all features are as important. So you can dump a few features to improve performance or accuracy.

To be sure that one is convinced that one is choosing the optimal number of features — neither more nor less — one have to follow some dimensionality reduction techniques. There are many other dimensionality reduction techniques like lasso reduction (shrinking large regression coefficients in order to reduce overfitting), Principal Component Analysis (PCA) and so on.

The technique I will explain here is Backward Elimination.

This is just an exercise really useful to reduce the number of features when there are hundreds or thousands of features but in current use case, we have less than 10 features, so it is not needed. However…

Steps to be followed here are as follows:

Backward elimination ->

Add an array of 1’s for this regression algorithm to work — array of 1’s represents the constant assigned to x0
select a significance level e.g. P-value = 0.05
fit model with all predictors (features)
consider predictor with highest P-value. If its P-value > significance level, go to step 4, else end
remove that predictor and fit model without this
go to step 3

Start with the first one with all predictors

import statsmodels.api as smX_train_opt = np.append(arr = np.ones((274,1)).astype(int), values = X_train, axis = 1)
X_train_opt = X_train_opt[:,[0, 1, 2, 3, 4, 5, 6, 7]]
regressor_OLS = sm.OLS(endog = y_train, exog = X_train_opt).fit()
regressor_OLS.summary()

The output is a large table of statistics results. Note that we are interested only in the P-value result (highlighted in yellow).

We will consider the predictor, apart from the constant, with highest P-value. In this case, P-value of x1 and x2 are greater than significance level, and the greatest of the higher values is for x2.

Run the model again without x2 feature

# Note that the second feature is not used now.X_train_opt = X_train_opt[:,[0, 1, 3, 4, 5, 6, 7]]
regressor_OLS = sm.OLS(endog = y_train, exog = X_train_opt).fit()
regressor_OLS.summary()

The output this time is:

Now, the P-value of x1 is greater than significance level.

Run the model again without x1 feature

# Note that now the model is run without 1st and 2nd featuresX_train_opt = np.append(arr = np.ones((274,1)).astype(int), values = X_train, axis = 1)
X_train_opt = X_train_opt[:,[0,3, 4, 5, 6, 7]]
regressor_OLS = sm.O#### LS(endog = y_train, exog = X_train_opt).fit()
regressor_OLS.summary()

Now you can see that all features have a P-value less than the significance level. This ends our exercise of Backward elimination. Now we know that the optimal feature-set required for our algo is just feature number 3 to 7. So we create another X_train and X_test with feature number 3 to 7 only (in red-bordered rectangle). Code = X_train2 = X_train.iloc[:,[2,3, 4, 5, 6]]

Random forest algo on reduced number of features

rf = RandomForestRegressor(n_estimators = 10)
rf.fit(X_train2,y_train)
y_pred = rf.predict(X_test2)import matplotlib.gridspec as gridspecfig = plt.figure(figsize=(12,5))
grid = gridspec.GridSpec(ncols=2, nrows=1, figure=fig)ax1 = fig.add_subplot(grid[0, 0])
ax2 = fig.add_subplot(grid[0, 1])sns.scatterplot(x = y_test['mpg'], y = y_pred, ax=ax1)
sns.regplot(x = y_test['mpg'], y=y_pred, ax=ax1)
ax1.set_title("Log of Predictions vs. actuals")
ax1.set_xlabel('Actual MPG')
ax1.set_ylabel('predicted MPG')sns.scatterplot(x = np.exp(y_test['mpg']), y = np.exp(y_pred), ax=ax2,)
sns.regplot(x = np.exp(y_test['mpg']), y=np.exp(y_pred), ax=ax2)
ax2.set_title("Real values of Predictions vs. actuals")
ax2.set_xlabel('Actual MPG')
ax2.set_ylabel('predicted MPG')

print(‘MAE:’, metrics.mean_absolute_error(y_test, y_pred))
print(‘MSE:’, metrics.mean_squared_error(y_test, y_pred))
print(‘RMSE:’, np.sqrt(metrics.mean_squared_error(y_test, y_pred)))Output →
MAE: 0.07700061119950408
MSE: 0.010134333336198278
RMSE: 0.1006694260249768

We see that there is not much difference between the performance or accuracy of the predictions since as explained, this is a smaller dataset with very less feature-set.

Hope this gave you a first step-by-step code to implement a regression machine learning algorithm to predict a number along with some flavor of cross_val_score and backward elimination techniques.

I created this as a part of a series of posts on Machine Learning projects & examples. A much more detailed information of this post is present at https://fivestepguide.com/technology/machine-learning/machine-learning-model-predict-car-mileage/

References

For this particular post, I have referred to several websites including Machine Learning Mastery and others few posts and websites mentioned in this article itself.

I have also referred to the backward elimination code from the Udemy course by Kirill Eremenko

About Me

I work in IT projects side of an Investment bank. I have 15 years of experience in building production ready applications in Front-office Trading and for the last 3 years in Machine Learning, NLP, NER, Anomaly detection etc.

Feel free to connect with me on LinkedIn or follow me on Medium