“Diamonds are forever” — price prediction using Machine Learning regression models and neural networks.

It’s holiday season and you can use this algorithm to predict the price of the diamond you really desire before purchasing it from the local retailer — best for a little help in negotiation.

Import all basic libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn import preprocessing

Getting the data

df = pd.read_csv(“diamonds.csv”)
df.drop(‘Unnamed: 0’, axis=1, inplace=True)
Top 3 rows of Diamonds dataset downloaded from Kaggle

Data pre-processing

# plot price vs. carat
sns.pairplot(df, x_vars=[’carat’], y_vars = [’price’])
# plot carat vs other Cs
sns.pairplot(df, x_vars=[’cut’, 'clarity’, 'color’], y_vars = [’carat’])
linear_vars = df.select_dtypes(include=[np.number]).columns

Convert the features to log scale

  1. Check for any ZERO value amongst features namely table, depth, l, w, d. Check if any continuous variable has zero value. This would give a division by zero error when converting to log. Add a tiny number 0.01 to any zero value.
print(‘0 values →’, 0 in df.values)
df[linear_vars] = df[linear_vars] + 0.01
print(‘Filled all 0 values with 0.01. Now any 0 values? →’, 0 in df.values)
0 values --> True
Filled all 0 values with 0.01. Now any 0 values? --> False
'sorted by carat --> [5.02, 4.51, 4.14, 4.02, 4.02]''sorted by depth --> [79.01, 79.01, 78.21000000000001, 73.61, 72.91000000000001]''sorted by table --> [95.01, 79.01, 76.01, 73.01, 73.01]''sorted by price --> [21646.459999999995, 21640.709999999995, 21626.909999999996, 21624.609999999997, 21623.459999999995]''sorted by l --> [10.75, 10.24, 10.15, 10.03, 10.02]''sorted by w --> [58.91, 31.810000000000002, 10.549999999999999, 10.17, 10.11]''sorted by d --> [31.810000000000002, 8.07, 6.99, 6.7299999999999995, 6.4399999999999995]'
dfboxplot(df, linear_vars)
# this log converts dataframe's features inplace
def convertfeatures2log(df, listvars):
for var in listvars:
df[var] = np.log(df[var])
convertfeatures2log(df, linear_vars)
histplot(df, linear_vars)

Convert categorical column to numerical column using labelencoder


Now is the time to start coding machine learning algorithms on this data

X_df = df.drop([‘price’, ‘l’, ‘w’, ‘d’], axis=1)
y_df = df[[‘price’]] # two [[ to create a DF
  • I will be combining both X (already converted categorical to numerical) and y to form a new dataframe for correlation
df_le = X_df.copy()# add a new column in dataframe — join 2 dataframe columns-wise
df_le[‘price’] = y_df[‘price’].values
  • It seems price is highly corr with carat and fairly with table, color and clarity, not much with cut
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_df = sc_X.fit_transform(X_df)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.3, random_state=42)
  • split the data into training set and test set.
  • Train the algorithm on training set data
  • Use the trained algorithm (or trained ML model) to predict prices from diamond properties in test data.
  • Verify / visualize / measure the differences between predicted prices and actual prices using scatterplots, histograms, accuracy metrics etc.

Linear regression

  1. Import LinearRegression class from Sci-kit learn
  2. Create an object of LinearRegression model
  3. Fit the model to X_train and y_train
  4. Make predictions
# Import the class from Sci-kit learn
from sklearn.linear_model import LinearRegression
# Create an object of LinearRegression model
reg_all = LinearRegression()
# Fit the model to X_train and y_train
# Make predictions
import matplotlib.pyplot as plt
import seaborn as sns
# convert prices and predictions back to exp
y_pred2 = np.exp(y_pred)
y_test2 = np.exp(y_test)

K-nearest neighbors (KNN)

from sklearn.neighbors import KNeighborsRegressorreg_all = KNeighborsRegressor(n_neighbors = 8, metric = ‘minkowski’, p = 2)

Support vector machines (SVM)

from sklearn.svm import SVRregressor = SVR(kernel=’rbf’)
y_pred = regressor.predict(X_test)

Regression Evaluation Metrics

from sklearn import metricsprint(‘MAE:’, metrics.mean_absolute_error(y_test, y_pred))
print(‘MSE:’, metrics.mean_squared_error(y_test, y_pred))
print(‘RMSE:’, np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
MAE: 0.08815796872409032
MSE: 0.012811091991743056
RMSE: 0.11318609451581522

Random Forest

from sklearn.ensemble import RandomForestRegressorrf = RandomForestRegressor(n_estimators = 10)
y_pred = rf.predict(X_test)
MAE: 0.08413000449351056
MSE: 0.012880789432555585
RMSE: 0.1134935655997977

The big daddy — Artificial neural networks

  1. Construct, compile and return a Keras model, which will then be used to fit/predict.
  2. Predict diamond prices
  3. Evaluate the model using metrics between ypred vs. ytest
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import RMSprop
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
  • Then, we run the KerasRegressor() function, which returns a baseline ANN model built in Keras. It takes as input the model, epochs and batch-size. An epoch defines the number of times the learning algorithm will work through the entire training dataset to update the weights for a neural network. An epoch that has one batch is called the batch gradient descent learning algorithm. For batch training, all of the training samples pass through the learning algorithm simultaneously in one epoch before weights are updated. The batch size is a number of samples processed before the model is updated.
  • More information on KerasRegressor can be found at Tensorflow website:
  • Thereafter, we run the KFold() function. K-Fold Cross Validation uses a given data set, splits it into a K number of folds where each fold is used as a testing set while other K-1 are used as training set. For example, for 10-Fold cross validation (K=10), the dataset is split into 10 folds. In the first iteration, the first fold is used for validation while the 9 remaining folds form the training set. In the second iteration, 2nd fold is used as the testing set while the rest serve as the training set. This process is repeated until each fold of the 10 folds have been used as testing sets.
  • Subsequently, the cross_val_score function takes the model, X and y, and kfold’s result as inputs and outputs multiple results — a list of regression model metrics scores. The cross_val_score function splits the data, using KFold as described above, into K pieces, trains on each combination of K-1 folds and gives back the metrics of the model.
estimator = KerasRegressor(build_fn=baseline_model, epochs=10, batch_size=5)kf = KFold(n_splits=5)
results = cross_val_score(estimator, X_train, y_train, cv=kf)
print(“Results: %.2f (%.2f) MSE” % (results.mean(), results.std()))
estimator.fit(X_train, y_train)
y_pred = estimator.predict(X_test)
# Plot a scatter plot like above to see prediction perfection

Predictions on Pricescope data


Get Pricescope data

Pricescope is the premier diamond and jewelry community on the Internet. Visitors to Pricescope find clear and concise tutorials from industry experts and have their questions answered by knowledgeable forum members. The majority of consumers go online to learn about diamonds, while 90% buy diamonds from brick and mortar shops. Pricescope exists to help consumers get the best value online or in-store.


About Me




Data Engineer. Investor, blogger (https://fivestepguide.com) quadragenarian, father of 4 year old. I like to share knowledge & experience gained over 20+ years

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jatin Grover

Jatin Grover

Data Engineer. Investor, blogger (https://fivestepguide.com) quadragenarian, father of 4 year old. I like to share knowledge & experience gained over 20+ years

More from Medium

CS373 Spring 2022: Kevin Joseph

Introducing yourself to an advanced world of Artificial Intelligence

University-level factors that predict a strong graduation rate

Salsa vs. Hudson Valley: Game #67 Preview