Diamond price prediction using Python on PriceScope and CaratLane diamond listings

I have downloaded the data for roughly 1500 diamonds for my own learning and not for any commercial purpose. The data is available on public domain on pricescope.com and is viewable and downloadable by anyone.

df = pd.read_csv("pricescope1.csv")
df.drop(’Unnamed: 0’, axis=1, inplace=True)
Courtesy: whiteflash

Data pre-processing

  1. Replace $ with blank using regex
  2. Convert to int using astype()
  3. Convert array / series to dataframe using pd.DataFrame
priceint=pd.DataFrame(df[’price’].replace(’[\$,]’, '’, regex=True).astype(int))
df.drop([’price’], axis=1, inplace=True)
df[’price’] = priceint[’price’].values
df[‘flr’].isnull().sum() / len(df[‘flr’]) *100
indexnames = df[df[’flr’].isnull()].index
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1409 entries, 6 to 1414
Data columns (total 12 columns):
carat 1409 non-null float64
cut 1409 non-null object
color 1409 non-null object
clarity 1409 non-null object
depth 1409 non-null object
table 1409 non-null object
lab 1409 non-null object
sym 1409 non-null object
pol 1409 non-null object
flr 1409 non-null object
hna 1409 non-null object
price 1409 non-null int32
dtypes: float64(1), int32(1), object(10)
memory usage: 137.6+ KB
lenT = len([x for x in df['table'] if not x.isnumeric()]) / len(df['table'])*100
print('Percent of non-numeric data in Table -->', lenT)
lenD = len([x for x in df['depth'] if not x.isnumeric()]) / len(df['depth'])*100
print('Percent of non-numeric data in Depth -->', lenD)
all rubish values --> {'-'}
Percent of identified rubbish data in Table --> 0.07097232079488999
all rubish values --> {'-'}
Percent of identified rubbish data in Depth --> 0.07097232079488999
indexnames = df[(df[‘table’] == ‘-’) | (df[‘depth’] == ‘-’)].index
df[‘table’] = df[‘table’].astype(float)
df[‘depth’] = df[‘depth’].astype(float)
sns.pairplot(df, x_vars=[‘carat’, ‘cut’, ‘clarity’, ‘color’], y_vars = [‘price’])

Run ML algorithms

X_df = df.drop([‘price’], axis=1)
y_df = df[[‘price’]]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.3, random_state=42)
from sklearn.linear_model import LinearRegressionreg_all = LinearRegression()
MAE: 0.11652136924155168
MSE: 0.02292105733384392
RMSE: 0.15139701890672722
MAE: 0.06791631018129721
MSE: 0.010798208786551588
RMSE: 0.10391443011705154
newdiamond = [‘0.3’, ‘Premium’, ‘ G’, ‘VS1’, 57, 516, ‘GIA’, ‘X’, ‘X’, ’N’, ‘N’]


About Me




Data Engineer. Investor, blogger (https://fivestepguide.com) quadragenarian, father of 4 year old. I like to share knowledge & experience gained over 20+ years

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Current Job Trends 2021

FRED API and Python


Linear Regression Types and Implementation!

Meet our team: Trouni, Data Science bootcamp manager

The #RDOF Is Progress in Closing the Digital Divide

Curse of Batch Normalization

What I Learned Writing my First Academic Article

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store