Diamond price prediction using Python on PriceScope and CaratLane diamond listings

Continuation story of Diamond Price Prediction using ML

This post is more of a continuation of the Diamond Price Prediction using Machine Learning story.

The first post sows the seed for Machine Learning regression algorithms to predict the price of a high value item using publicly available database of roughly 54000 diamonds on Kaggle.

This particular post talks about downloading data from PriceScope and CaratLane and using it to predict diamond prices

I have downloaded the data for roughly 1500 diamonds for my own learning and not for any commercial purpose. The data is available on public domain on pricescope.com and is viewable and downloadable by anyone.

df = pd.read_csv("pricescope1.csv")
df.drop(’Unnamed: 0’, axis=1, inplace=True)

p.s. all functions called in the code such asconvertfeatures2log()in this post are either defined / described here or in my earlier post.

Courtesy: whiteflash

Data pre-processing

Now we perform all necessary steps to process data, just like we did earlier.

Convert price in dollars to integers

As we see, the prices in pricescope dataset are of currency datatype i.e. prices are shown in dollars with thousands separators. Since a machine learning algorithm needs numerical data, we will need to convert dollar values to integers. The way we do it as follows.

  1. Replace $ with blank using regex
  2. Convert to int using astype()
  3. Convert array / series to dataframe using pd.DataFrame
priceint=pd.DataFrame(df[’price’].replace(’[\$,]’, '’, regex=True).astype(int))
df.drop([’price’], axis=1, inplace=True)
df[’price’] = priceint[’price’].values

Check if any null values present — the same way we did earlier.


Note that Flouroscence has empty values. Let’s see how many rows have empty Flouroscence, and get rid of that data if empty values for rows less than 0.5% of total rows.

df[‘flr’].isnull().sum() / len(df[‘flr’]) *100

Output is: 0.4240282685512367

We can see that null values are for less 0.5% of the data. So, let’s delete those rows.

indexnames = df[df[’flr’].isnull()].index

Now check if there are expected datatypes in all columns, using same function as used earlier.


The output is as follows:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1409 entries, 6 to 1414
Data columns (total 12 columns):
carat 1409 non-null float64
cut 1409 non-null object
color 1409 non-null object
clarity 1409 non-null object
depth 1409 non-null object
table 1409 non-null object
lab 1409 non-null object
sym 1409 non-null object
pol 1409 non-null object
flr 1409 non-null object
hna 1409 non-null object
price 1409 non-null int32
dtypes: float64(1), int32(1), object(10)
memory usage: 137.6+ KB

We see that depth and table, the only 2 continuous features apart from carat, are object datatype, whereas they should have been float or int. This means there are some string values present. Lets examine and convert to numeric

lenT = len([x for x in df['table'] if not x.isnumeric()]) / len(df['table'])*100
print('Percent of non-numeric data in Table -->', lenT)
lenD = len([x for x in df['depth'] if not x.isnumeric()]) / len(df['depth'])*100
print('Percent of non-numeric data in Depth -->', lenD)

Percent of non-numeric data in Table → 5.25195173882186
Percent of non-numeric data in Depth → 92.902767920511

Now let’s see if these are floats stored as string or some rubbish values.

Create a removenotnum() function and call it to see values

The output is:

all rubish values --> {'-'}
Percent of identified rubbish data in Table --> 0.07097232079488999
all rubish values --> {'-'}
Percent of identified rubbish data in Depth --> 0.07097232079488999

It seems only a single type of rubbish character, ‘-’, is stored at 0.5% of data. Since its a miniscule part of the entire dataset, we can safely remove those entire rows for now. Drop rows with rubbish values for Table and Depth.

indexnames = df[(df[‘table’] == ‘-’) | (df[‘depth’] == ‘-’)].index

Now convert object datatyped table and depth to float.

df[‘table’] = df[‘table’].astype(float)
df[‘depth’] = df[‘depth’].astype(float)

Next, we plot a Pair-Plot of Price vs. 4 Cs

sns.pairplot(df, x_vars=[‘carat’, ‘cut’, ‘clarity’, ‘color’], y_vars = [‘price’])

Now, run the convert_catg() function on pricescope dataset to convert categorical columns to numerical columns. The output will be as follows:

Now, we check for any outliers by calling the dfboxplot() function to plot boxplots of all properties. We will again remove the rows containing outlier data for any property if such rows are less than 0.5% of the dataset.

Since I downloaded the data from a currently running website, it is not quite possible that we will get any irregular data.

As expected, we see there are no outliers.

Now we convert values of all continuous features (‘carat’, ‘depth’, ‘table’, ‘price’) to log by calling convertfeatures2log() for better use by machine learning algorithms.

Run ML algorithms

First we set X and y, as previously.

X_df = df.drop([‘price’], axis=1)
y_df = df[[‘price’]]

Now, we see the correlation between price vs all other attributes as follows:

Train Test Split again:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.3, random_state=42)

Linear model

We run the Linear regression model first.

from sklearn.linear_model import LinearRegressionreg_all = LinearRegression()

The hisplot and scatterplot display this simple linear regression algorithm to be a really good prediction algorithm

The metrics however tell a different story:

MAE: 0.11652136924155168
MSE: 0.02292105733384392
RMSE: 0.15139701890672722

Random forest

Everyone knows and we just tested in predicting diamond prices from Kaggle dataset, that Random Forest is a dependable algorithm.

MAE: 0.06791631018129721
MSE: 0.010798208786551588
RMSE: 0.10391443011705154

Neural network

We tweak the definition of the baseline model and provide different number of layers and activation functions.

Note that we provide the first layer with 11 input features and 18 hidden nodes. For the next layer, 12 output nodes are expected. I tried trying with several combinations and these turned out to be acceptable ones.

The output for ANN, after running KerasRegressor, KFold, and cross_val_score functions as we did earlier is just a tad better than Linear regression but not as good as random forest.

To calculate correct price of a diamond you see in the diamond district, you call the predict function of this model as follows. The diamond in consideration is this.

newdiamond = [‘0.3’, ‘Premium’, ‘ G’, ‘VS1’, 57, 516, ‘GIA’, ‘X’, ‘X’, ’N’, ‘N’]


For this particular post, I have referred to several websites including Beyond4Cs, Machine Learning Mastery and others few posts and websites on internet.

About Me

I work in IT projects side of an Investment bank. I have 15 years of experience in building production ready applications in Front-office Trading and for the last 3 years in Machine Learning, NLP, NER, Anomaly detection etc.

Over the course of my career in Artificial Intelligence and Machine Learning, I studied various medium / hackernoon / kdnuggets posts; took various courses on Coursera, Udemy, edX; watched numerous YouTube videos.
I believe gaining all this knowledge gives you a real boost in confidence and gets you ready for the next challenge in life.

However, even years of study and experience (projects), one can hardly know even 1% about the area of AI ML.
It is possibly as vast as the universe, ever-evolving and explored by humans not even at the level of an iceberg tip.

Feel free to connect with me on LinkedIn or follow me on Medium

Data Engineer. Investor, blogger (https://fivestepguide.com) quadragenarian, father of 4 year old. I like to share knowledge & experience gained over 20+ years

Get the Medium app