# Diamond price prediction using Python on PriceScope and CaratLane diamond listings

Continuation story of Diamond Price Prediction using ML

This post is more of a continuation of the Diamond Price Prediction using Machine Learning story.

The first post sows the seed for Machine Learning regression algorithms to predict the price of a high value item using publicly available database of roughly 54000 diamonds on Kaggle.

I have downloaded the data for roughly 1500 diamonds for my own learning and not for any commercial purpose. The data is available on public domain on pricescope.com and is viewable and downloadable by anyone.

`df = pd.read_csv("pricescope1.csv")df.drop(’Unnamed: 0’, axis=1, inplace=True)display(df.head(3))`

p.s. all functions called in the code such as`convertfeatures2log()`in this post are either defined / described here or in my earlier post.

## Data pre-processing

Now we perform all necessary steps to process data, just like we did earlier.

Convert price in dollars to integers

As we see, the prices in pricescope dataset are of currency datatype i.e. prices are shown in dollars with thousands separators. Since a machine learning algorithm needs numerical data, we will need to convert dollar values to integers. The way we do it as follows.

1. Replace \$ with blank using regex
2. Convert to int using astype()
3. Convert array / series to dataframe using pd.DataFrame
`priceint=pd.DataFrame(df[’price’].replace(’[\\$,]’, '’, regex=True).astype(int))df.drop([’price’], axis=1, inplace=True)df[’price’] = priceint[’price’].values`

Check if any null values present — the same way we did earlier.

`df.isnull().sum()`

Note that Flouroscence has empty values. Let’s see how many rows have empty Flouroscence, and get rid of that data if empty values for rows less than 0.5% of total rows.

`df[‘flr’].isnull().sum() / len(df[‘flr’]) *100`

Output is: `0.4240282685512367`

We can see that null values are for less 0.5% of the data. So, let’s delete those rows.

`indexnames = df[df[’flr’].isnull()].indexdf.drop(axis=0,index=indexnames,inplace=True)`

Now check if there are expected datatypes in all columns, using same function as used earlier.

`df.info()`

The output is as follows:

`<class 'pandas.core.frame.DataFrame'>Int64Index: 1409 entries, 6 to 1414Data columns (total 12 columns):carat      1409 non-null float64cut        1409 non-null objectcolor      1409 non-null objectclarity    1409 non-null objectdepth      1409 non-null objecttable      1409 non-null objectlab        1409 non-null objectsym        1409 non-null objectpol        1409 non-null objectflr        1409 non-null objecthna        1409 non-null objectprice      1409 non-null int32dtypes: float64(1), int32(1), object(10)memory usage: 137.6+ KB`

We see that depth and table, the only 2 continuous features apart from carat, are object datatype, whereas they should have been float or int. This means there are some string values present. Lets examine and convert to numeric

`lenT = len([x for x in df['table'] if not x.isnumeric()]) / len(df['table'])*100print('Percent of non-numeric data in Table -->', lenT)lenD = len([x for x in df['depth'] if not x.isnumeric()]) / len(df['depth'])*100print('Percent of non-numeric data in Depth -->', lenD)`

Percent of non-numeric data in Table → 5.25195173882186
Percent of non-numeric data in Depth → 92.902767920511

Now let’s see if these are floats stored as string or some rubbish values.

Create a `removenotnum()` function and call it to see values

The output is:

`all rubish values --> {'-'}Percent of identified rubbish data in Table --> 0.07097232079488999all rubish values --> {'-'}Percent of identified rubbish data in Depth --> 0.07097232079488999`

It seems only a single type of rubbish character, ‘-’, is stored at 0.5% of data. Since its a miniscule part of the entire dataset, we can safely remove those entire rows for now. Drop rows with rubbish values for Table and Depth.

`indexnames = df[(df[‘table’] == ‘-’) | (df[‘depth’] == ‘-’)].indexdf.drop(axis=0,index=indexnames,inplace=True)`

Now convert object datatyped table and depth to float.

`df[‘table’] = df[‘table’].astype(float)df[‘depth’] = df[‘depth’].astype(float)`

Next, we plot a Pair-Plot of Price vs. 4 Cs

`sns.pairplot(df, x_vars=[‘carat’, ‘cut’, ‘clarity’, ‘color’], y_vars = [‘price’])plt.show()`

Now, run the `convert_catg() `function on pricescope dataset to convert categorical columns to numerical columns. The output will be as follows:

Now, we check for any outliers by calling the `dfboxplot()` function to plot boxplots of all properties. We will again remove the rows containing outlier data for any property if such rows are less than 0.5% of the dataset.

Since I downloaded the data from a currently running website, it is not quite possible that we will get any irregular data.

As expected, we see there are no outliers.

Now we convert values of all continuous features (‘carat’, ‘depth’, ‘table’, ‘price’) to log by calling `convertfeatures2log()` for better use by machine learning algorithms.

## Run ML algorithms

First we set X and y, as previously.

`X_df = df.drop([‘price’], axis=1)y_df = df[[‘price’]]`

Now, we see the correlation between price vs all other attributes as follows:

Train Test Split again:

`from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.3, random_state=42)`

Linear model

We run the Linear regression model first.

`from sklearn.linear_model import LinearRegressionreg_all = LinearRegression()reg_all.fit(X_train,y_train)y_pred=reg_all.predict(X_test)`

The hisplot and scatterplot display this simple linear regression algorithm to be a really good prediction algorithm

The metrics however tell a different story:

`MAE: 0.11652136924155168MSE: 0.02292105733384392RMSE: 0.15139701890672722`

Random forest

Everyone knows and we just tested in predicting diamond prices from Kaggle dataset, that Random Forest is a dependable algorithm.

`MAE: 0.06791631018129721MSE: 0.010798208786551588RMSE: 0.10391443011705154`

Neural network

We tweak the definition of the baseline model and provide different number of layers and activation functions.

Note that we provide the first layer with 11 input features and 18 hidden nodes. For the next layer, 12 output nodes are expected. I tried trying with several combinations and these turned out to be acceptable ones.

The output for ANN, after running `KerasRegressor`, `KFold`, and `cross_val_score` functions as we did earlier is just a tad better than Linear regression but not as good as random forest.

To calculate correct price of a diamond you see in the diamond district, you call the predict function of this model as follows. The diamond in consideration is this.

`newdiamond = [‘0.3’, ‘Premium’, ‘ G’, ‘VS1’, 57, 516, ‘GIA’, ‘X’, ‘X’, ’N’, ‘N’]rf.predict(newdiamond)`

# References

For this particular post, I have referred to several websites including Beyond4Cs, Machine Learning Mastery and others few posts and websites on internet.

I work in IT projects side of an Investment bank. I have 15 years of experience in building production ready applications in Front-office Trading and for the last 3 years in Machine Learning, NLP, NER, Anomaly detection etc.

Over the course of my career in Artificial Intelligence and Machine Learning, I studied various medium / hackernoon / kdnuggets posts; took various courses on Coursera, Udemy, edX; watched numerous YouTube videos.
I believe gaining all this knowledge gives you a real boost in confidence and gets you ready for the next challenge in life.

However, even years of study and experience (projects), one can hardly know even 1% about the area of AI ML.
It is possibly as vast as the universe, ever-evolving and explored by humans not even at the level of an iceberg tip.

Feel free to connect with me on LinkedIn or follow me on Medium

Data Engineer. Investor, blogger (https://fivestepguide.com) quadragenarian, father of 4 year old. I like to share knowledge & experience gained over 20+ years

## More from Jatin Grover

Data Engineer. Investor, blogger (https://fivestepguide.com) quadragenarian, father of 4 year old. I like to share knowledge & experience gained over 20+ years