“Diamonds are forever” — price prediction using Machine Learning regression models and neural networks.

Jatin Grover
16 min readNov 26, 2019

Using PriceScope and CaratLane Diamond listing

Amongst a variety of items that are not quantitatively or statistically valued by buyers, Diamonds are possibly the most valuable. The purchase is far less from rational with a heavy bent on emotional ties. The jewelers would entice every man (and woman) by marketing it as a necessity for the occasion and as a status symbol, and by calling this pricey and unaffordable item as priceless.

The actual value of a diamond however is determined by a gemologist after inspecting its various “features” (let’s start using the proper machine learning words now since this article is about predicting diamond price using machine learning) and applying a relative valuation principle of “compare and price”.

But in recent years, perhaps last 2 decades, the valuation and pricing has become more or less quantitative i.e. calculations based on values of many properties not just limiting to 4Cs (carat, cut, colour, clarity). Properties like culet, pavilion, crown, girdle, girdle thickness polish, symmetry, fluorescence, table, depth and so on are the most easily identifiable and recordable features while the diamond is actually cut.

The following is a no-brainer, easy-to-follow, quickie article for newbie learners on machine learning for a Linear regression problem i.e. how to predict some number using minimal dataset at a fairly good accuracy.

Let’s begin with predicting diamond prices now.

A dataset of roughly 1500 rows is generally not a helpful one if there are tens of features. The algo would need lots of training for acceptable accuracy metrics

However, I will show very simple approaches to a complex problem which have yielded relatively good accuracy.

It’s holiday season and you can use this algorithm to predict the price of the diamond you really desire before purchasing it from the local retailer — best for a little help in negotiation.

Lets begin :

Import all basic libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn import preprocessing

Getting the data

I will begin with training and predicting price based on dataset from Kaggle — it has basic data on more than 54000 diamonds.

Later we will train the same model and predict prices based on fresh data from pricescope.com. By fresh data, I mean the data I downloaded just few days back.

df = pd.read_csv(“diamonds.csv”)
df.drop(‘Unnamed: 0’, axis=1, inplace=True)
display(df.head(3))

The output is something like the following:

Top 3 rows of Diamonds dataset downloaded from Kaggle

Data pre-processing

We now begin with basic data prepossessing. We first look if any null values or unexpected datatypes are present.

df.isnull().sum()
df.info()

Now we plot a Pair-plot of Price vs. 4 Cs (Carat, Cut, Color, Clarity) — the most popular and marketed properties of a diamond.

Read more about 4Cs at the following links:

# plot price vs. carat
sns.pairplot(df, x_vars=[’carat’], y_vars = [’price’])
# plot carat vs other Cs
sns.pairplot(df, x_vars=[’cut’, 'clarity’, 'color’], y_vars = [’carat’])
plt.show()

The output is something like the following:

We can see that the properties charts reveal a lot about how and where bulk of diamonds fall under each property value e.g. most bigger diamonds (higher carat) fall in Fair cut, I1 clarity and H-I color. These are poor (commercial grade) diamonds generally sold by retail jewelry shops across the world to attract consumers with advertisements like ‘Diamonds at 50% off

The price vs. carat chart also show that there are some outliers in the dataset i.e. few diamonds that are really over priced!

Now we need to see the distribution of the dataset. We will create a histogram plot for this. First we define the histplot function.

Now we list the continuous variables and leave out the categorical variable. A continuous variable e.g. carat is one which has numercial values whereas a categorical variable is the one with alphanumeric values as categories e.g. clarity

linear_vars = df.select_dtypes(include=[np.number]).columns
display(list(linear_vars))

Output is [‘carat’, ‘depth’, ‘table’, ‘price’, ‘l’, ‘w’, ‘d’]

Now we plot the histogram

histplot(df,linear_vars)

This revels the distribution of each property. As expected, we see that the data is not normally distributed. After all, how can you expect a 1 carat diamond to be priced just at twice the price of a half-carat given all properties remain the same, while a 1 carat diamond looks much bigger to the eye when in a ring, or earrings for that matter, than a half carat one.

Convert the features to log scale

  1. Check for any ZERO value amongst features namely table, depth, l, w, d. Check if any continuous variable has zero value. This would give a division by zero error when converting to log. Add a tiny number 0.01 to any zero value.
print(‘0 values →’, 0 in df.values)
df[linear_vars] = df[linear_vars] + 0.01
print(‘Filled all 0 values with 0.01. Now any 0 values? →’, 0 in df.values)

The output is:

0 values --> True
Filled all 0 values with 0.01. Now any 0 values? --> False

2. View and remove outliers using z-score

Since we could briefly sense some outliers in the pairplot charts, lets dwell deeper and see whether there genuinely are any outliers.

Lets first begin by printing top X values of each diamond property.

Output is like:

'sorted by carat --> [5.02, 4.51, 4.14, 4.02, 4.02]''sorted by depth --> [79.01, 79.01, 78.21000000000001, 73.61, 72.91000000000001]''sorted by table --> [95.01, 79.01, 76.01, 73.01, 73.01]''sorted by price --> [21646.459999999995, 21640.709999999995, 21626.909999999996, 21624.609999999997, 21623.459999999995]''sorted by l --> [10.75, 10.24, 10.15, 10.03, 10.02]''sorted by w --> [58.91, 31.810000000000002, 10.549999999999999, 10.17, 10.11]''sorted by d --> [31.810000000000002, 8.07, 6.99, 6.7299999999999995, 6.4399999999999995]'

From this list itself, we see that there are some outliers for w,d. Lets visualize those using boxplots.

Create a boxplot function:

Call dfboxplot to view outliers for all properties

dfboxplot(df, linear_vars)

We can now clearly visualize that there are outliers for table, w and d properties. Lets call removeoutliers() function to remove the outliers based on z-score. There are several methods of removing outliers but I am going to follow the z-score process here since its the easiest to implement and delivers optimal results — after all, this is a no-brainer quickie article for newbie users.

For more on all other alogs to remove outliers, check these out:

Now, after calling dfboxbplot again to view outliers for all properties, we see the output as:

3. Convert to log

Since we saw earlier that most features (or properties of a diamond) are not normally distributed, and one of the most favored approach, if not a prerequisite, is to use Gaussian distributed (another name for normally distributed) data.

# this log converts dataframe's features inplace
def convertfeatures2log(df, listvars):
for var in listvars:
df[var] = np.log(df[var])
convertfeatures2log(df, linear_vars)
histplot(df, linear_vars)

The output now is:

Convert categorical column to numerical column using labelencoder

We now have to convert all categorical columns to numerical columns using labelencoder. You may read more about labelencoder in the following webpage — quick and easy to understand description there.

First we define the convert_catg() function to convert categorical columns to numerical columns

Next, we run the function and see the head of the dataframe.

convert_catg(df)
df.head(3)

Now is the time to start coding machine learning algorithms on this data

Divide the data into X and y

First we set X and y, where X is the matrix (or DataFrame) for all the properties (independent features) and y is the vector for output (dependent variables) i.e. diamond price.

X_df = df.drop([‘price’, ‘l’, ‘w’, ‘d’], axis=1)
y_df = df[[‘price’]] # two [[ to create a DF

Now, we determine correlations between price and all other attributes.

  • I will be combining both X (already converted categorical to numerical) and y to form a new dataframe for correlation
df_le = X_df.copy()# add a new column in dataframe — join 2 dataframe columns-wise
df_le[‘price’] = y_df[‘price’].values
df_le.corr()

df_le = X_df → df_le will be like a pointer to X_df. Any change made to df_le will actually be a change to X_df. So, df_le = X_df.copy() is better

  • It seems price is highly corr with carat and fairly with table, color and clarity, not much with cut

Note on Feature scaling — it seems its not needed here since we have already log the properties. Nevertheless, if I had feature scaled, then the code would have been as written below:

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_df = sc_X.fit_transform(X_df)
X_df[0:3]

Train Test Split

Data scientists generally split the data for machine learning into either two or three subsets: 2 subsets for training and testing, while 3 for training, validation and testing. I will talk about it in detail a bit later.

This data split prevents an algorithm from overfitting and underfitting.

I have explained overfitting and underfitting briefly in my other post here:

The code for train_test_split is as follows:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.3, random_state=42)

Now we will run different algorithms. Once you are really ready with your data, coding a simple Machine Learning algorithm is a cakewalk.

Let me demonstrate a few with code, output and charts.

What we will do here is:

  • split the data into training set and test set.
  • Train the algorithm on training set data
  • Use the trained algorithm (or trained ML model) to predict prices from diamond properties in test data.
  • Verify / visualize / measure the differences between predicted prices and actual prices using scatterplots, histograms, accuracy metrics etc.

Linear regression

Lets start with the simplest of all, the ubiquitous linear regression model.

  1. Import LinearRegression class from Sci-kit learn
  2. Create an object of LinearRegression model
  3. Fit the model to X_train and y_train
  4. Make predictions
# Import the class from Sci-kit learn
from sklearn.linear_model import LinearRegression
# Create an object of LinearRegression model
reg_all = LinearRegression()
# Fit the model to X_train and y_train
reg_all.fit(X_train,y_train)
# Make predictions
y_pred=reg_all.predict(X_test)

Now visualize the discrepancy of predictions vs. actual prices using scatterplot and histogram

import matplotlib.pyplot as plt
plt.scatter(y_test,y_pred)
import seaborn as sns
sns.distplot((y_test-y_pred),bins=50);

Note that this is a comparison between logarithm of prices and predictions → See the code again. It is written as plt.scatter(y_test,y_pred) after all features were converted to logarithm using convertfeatures2log()

This means to see actual discrepancies, we should un-logarithm it i.e. find the exponent of every price and prediction and then plot. The following code does this:

# convert prices and predictions back to exp
y_pred2 = np.exp(y_pred)
y_test2 = np.exp(y_test)

Now see the scatterplot and histogram again:

K-nearest neighbors (KNN)

Because so many API libraries exist for several machine learning algorithms, the code for all simple machine learning algorithms is straightforward.

On the lines of linear regression code, we use sklearn library for KNN algo as well.

from sklearn.neighbors import KNeighborsRegressorreg_all = KNeighborsRegressor(n_neighbors = 8, metric = ‘minkowski’, p = 2)
reg_all.fit(X_train,y_train)
y_pred=reg_all.predict(X_test)

The distplot and scatterplot for logarithmic features are as follows. You can confirm from values in x-axis and y-axis that y_test and y_pred are log here.

The distplot and scatterplot for absolute values of features are as follows. You can confirm from values in x-axis and y-axis that y_test and y_pred are NOT log here.

Support vector machines (SVM)

from sklearn.svm import SVRregressor = SVR(kernel=’rbf’)
regressor.fit(X_train,y_train)
y_pred = regressor.predict(X_test)

Below is a comparison of scatterplots of log values of features (left plot) and absolute values (right plot) of features.

As of now, we can deduce that SVM is a better option since it gives better metrics score and a better scatterplot than Linear Regression and KNN

Regression Evaluation Metrics

Here are three common evaluation metrics for regression problems:

Mean Absolute Error (MAE) is the mean of the absolute value of the errors. It is average error, the easiest to understand.

Mean Squared Error (MSE) is the mean of the squared errors. It “punishes” larger errors i.e. better in real world

Root Mean Squared Error (RMSE) is the root of the mean of the squared errors. It is popular because of its interpretability in “y” units.

from sklearn import metricsprint(‘MAE:’, metrics.mean_absolute_error(y_test, y_pred))
print(‘MSE:’, metrics.mean_squared_error(y_test, y_pred))
print(‘RMSE:’, np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Output for SVM is :

MAE: 0.08815796872409032
MSE: 0.012811091991743056
RMSE: 0.11318609451581522

Random Forest

Random forest is one of the most popular algorithms in most use cases / projects across industries. Its fast, easier to implement, needs lesser data, doesnt require extensive training and produces almost equally good results.

Again, because so many API libraries exist for several machine learning algorithms, the code for all simple machine learning algorithms is straightforward.

from sklearn.ensemble import RandomForestRegressorrf = RandomForestRegressor(n_estimators = 10)
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)

Now the metrics and their outputs

MAE: 0.08413000449351056
MSE: 0.012880789432555585
RMSE: 0.1134935655997977

It turns out that Random Forest is similar to the far slower SVM.

The big daddy — Artificial neural networks

Generally neural networks or ANNs are more suited for classification problems requiring lots of complex logic / decision making and huge computations. They require larger datasets for their optimization to get the benefit of generalization and nonlinear mapping. But, if there’s not enough data, a plain regression model may be better suited despite a few nonlinearities.

However, just for sake of completeness, I will show you how to predict diamond price (a regression problem) using ANNs.

You can read more about whether Neural networks are really needed for regression problems or not in the following webpage.

And for ANN, the best, fastest and easiest to use and code library is Keras. Read more about Keras at their official website:

The step-wise process for creating an ANN is:

  1. Construct, compile and return a Keras model, which will then be used to fit/predict.
  2. Predict diamond prices
  3. Evaluate the model using metrics between ypred vs. ytest

Firstly import the libraries for our project:

from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import RMSprop
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

Next, we construct a baseline_model() function to create and return a Keras model

  • Then, we run the KerasRegressor() function, which returns a baseline ANN model built in Keras. It takes as input the model, epochs and batch-size. An epoch defines the number of times the learning algorithm will work through the entire training dataset to update the weights for a neural network. An epoch that has one batch is called the batch gradient descent learning algorithm. For batch training, all of the training samples pass through the learning algorithm simultaneously in one epoch before weights are updated. The batch size is a number of samples processed before the model is updated.
  • More information on KerasRegressor can be found at Tensorflow website:
  • Thereafter, we run the KFold() function. K-Fold Cross Validation uses a given data set, splits it into a K number of folds where each fold is used as a testing set while other K-1 are used as training set. For example, for 10-Fold cross validation (K=10), the dataset is split into 10 folds. In the first iteration, the first fold is used for validation while the 9 remaining folds form the training set. In the second iteration, 2nd fold is used as the testing set while the rest serve as the training set. This process is repeated until each fold of the 10 folds have been used as testing sets.
  • Subsequently, the cross_val_score function takes the model, X and y, and kfold’s result as inputs and outputs multiple results — a list of regression model metrics scores. The cross_val_score function splits the data, using KFold as described above, into K pieces, trains on each combination of K-1 folds and gives back the metrics of the model.
estimator = KerasRegressor(build_fn=baseline_model, epochs=10, batch_size=5)kf = KFold(n_splits=5)
results = cross_val_score(estimator, X_train, y_train, cv=kf)
print(“Results: %.2f (%.2f) MSE” % (results.mean(), results.std()))

Brief information and steps on KFold and cross_val_score can be found in my other post; link below.

Finally, we fit the estimator to our training data to get predictions.

estimator.fit(X_train, y_train)
y_pred = estimator.predict(X_test)
# Plot a scatter plot like above to see prediction perfection
plt.scatter(y_test,y_pred)

The scatterplot output for log(features) is as below. It turns out that it is not as bad as we expected when I said that ANNs are mostly used for classification, not regression.

Predictions on Pricescope data

https://www.mining.com/web/de-beers-make-largest-investment-diamond-marketing-since-2008/

Now we start the process of diamond price predictions using roughly 1500 rows of Pricescope diamonds data.

Assuming all the libraries are already imported, as shown in the beginning of this article.

Get Pricescope data

An excerpt from the website itself:

Pricescope is the premier diamond and jewelry community on the Internet. Visitors to Pricescope find clear and concise tutorials from industry experts and have their questions answered by knowledgeable forum members. The majority of consumers go online to learn about diamonds, while 90% buy diamonds from brick and mortar shops. Pricescope exists to help consumers get the best value online or in-store.

Take your time to digest all that’s mentioned above. Copy the code and run on your machine / cloud. The rest of the story specific to generating predictions using PriceScope and CaratLane data is shown in the following post.

References

For this particular post, I have referred to several websites including Beyond4Cs, Machine Learning Mastery and others few posts and websites on internet.

About Me

I work in IT projects side of an Investment bank. I have 15 years of experience in building production ready applications in Front-office Trading and for the last 3 years in Machine Learning, NLP, NER, Anomaly detection etc.

Over the course of my career in Artificial Intelligence and Machine Learning, I studied various medium / hackernoon / kdnuggets posts; took various courses on Coursera, Udemy, edX; watched numerous YouTube videos.
I believe gaining all this knowledge gives you a real boost in confidence and gets you ready for the next challenge in life.

However, even years of study and experience (projects), one can hardly know even 1% about the area of AI ML.
It is possibly as vast as the universe, ever-evolving and explored by humans not even at the level of an iceberg tip.

Feel free to connect with me on LinkedIn or follow me on Medium

--

--

Jatin Grover

Data Engineer. Investor, blogger (https://fivestepguide.com) quadragenarian, father of 4 year old. I like to share knowledge & experience gained over 20+ years