# “Diamonds are forever” — price prediction using Machine Learning regression models and neural networks.

It’s holiday season and you can use this algorithm to predict the price of the diamond you really desire before purchasing it from the local retailer — best for a little help in negotiation.

## Import all basic libraries

`import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inlinefrom sklearn import preprocessing`

## Getting the data

`df = pd.read_csv(“diamonds.csv”)df.drop(‘Unnamed: 0’, axis=1, inplace=True)display(df.head(3))`

## Data pre-processing

`df.isnull().sum()df.info()`
`# plot price vs. caratsns.pairplot(df, x_vars=[’carat’], y_vars = [’price’])# plot carat vs other Cssns.pairplot(df, x_vars=[’cut’, 'clarity’, 'color’], y_vars = [’carat’])plt.show()`
`linear_vars = df.select_dtypes(include=[np.number]).columnsdisplay(list(linear_vars))`
`histplot(df,linear_vars)`

## Convert the features to log scale

1. Check for any ZERO value amongst features namely table, depth, l, w, d. Check if any continuous variable has zero value. This would give a division by zero error when converting to log. Add a tiny number 0.01 to any zero value.
`print(‘0 values →’, 0 in df.values)df[linear_vars] = df[linear_vars] + 0.01print(‘Filled all 0 values with 0.01. Now any 0 values? →’, 0 in df.values)`
`0 values --> TrueFilled all 0 values with 0.01. Now any 0 values? --> False`
`'sorted by carat --> [5.02, 4.51, 4.14, 4.02, 4.02]''sorted by depth --> [79.01, 79.01, 78.21000000000001, 73.61, 72.91000000000001]''sorted by table --> [95.01, 79.01, 76.01, 73.01, 73.01]''sorted by price --> [21646.459999999995, 21640.709999999995, 21626.909999999996, 21624.609999999997, 21623.459999999995]''sorted by l --> [10.75, 10.24, 10.15, 10.03, 10.02]''sorted by w --> [58.91, 31.810000000000002, 10.549999999999999, 10.17, 10.11]''sorted by d --> [31.810000000000002, 8.07, 6.99, 6.7299999999999995, 6.4399999999999995]'`
`dfboxplot(df, linear_vars)`
`# this log converts dataframe's features inplacedef convertfeatures2log(df, listvars):    for var in listvars:        df[var] = np.log(df[var])convertfeatures2log(df, linear_vars)histplot(df, linear_vars)`

## Convert categorical column to numerical column using labelencoder

`convert_catg(df)df.head(3)`

# Now is the time to start coding machine learning algorithms on this data

`X_df = df.drop([‘price’, ‘l’, ‘w’, ‘d’], axis=1)y_df = df[[‘price’]] # two [[ to create a DF`
• I will be combining both X (already converted categorical to numerical) and y to form a new dataframe for correlation
`df_le = X_df.copy()# add a new column in dataframe — join 2 dataframe columns-wisedf_le[‘price’] = y_df[‘price’].valuesdf_le.corr()`
• It seems price is highly corr with carat and fairly with table, color and clarity, not much with cut
`from sklearn.preprocessing import StandardScalersc_X = StandardScaler()X_df = sc_X.fit_transform(X_df)X_df[0:3]`
`from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.3, random_state=42)`
• split the data into training set and test set.
• Train the algorithm on training set data
• Use the trained algorithm (or trained ML model) to predict prices from diamond properties in test data.
• Verify / visualize / measure the differences between predicted prices and actual prices using scatterplots, histograms, accuracy metrics etc.

## Linear regression

1. Import `LinearRegression `class from Sci-kit learn
2. Create an object of LinearRegression model
3. Fit the model to X_train and y_train
4. Make predictions
`# Import the class from Sci-kit learnfrom sklearn.linear_model import LinearRegression# Create an object of LinearRegression modelreg_all = LinearRegression()# Fit the model to X_train and y_trainreg_all.fit(X_train,y_train) # Make predictionsy_pred=reg_all.predict(X_test)`
`import matplotlib.pyplot as pltplt.scatter(y_test,y_pred)`
`import seaborn as snssns.distplot((y_test-y_pred),bins=50);`
`# convert prices and predictions back to expy_pred2 = np.exp(y_pred)y_test2 = np.exp(y_test)`

## K-nearest neighbors (KNN)

`from sklearn.neighbors import KNeighborsRegressorreg_all = KNeighborsRegressor(n_neighbors = 8, metric = ‘minkowski’, p = 2)reg_all.fit(X_train,y_train)y_pred=reg_all.predict(X_test)`

## Support vector machines (SVM)

`from sklearn.svm import SVRregressor = SVR(kernel=’rbf’)regressor.fit(X_train,y_train)y_pred = regressor.predict(X_test)`

## Regression Evaluation Metrics

`from sklearn import metricsprint(‘MAE:’, metrics.mean_absolute_error(y_test, y_pred))print(‘MSE:’, metrics.mean_squared_error(y_test, y_pred))print(‘RMSE:’, np.sqrt(metrics.mean_squared_error(y_test, y_pred)))`
`MAE: 0.08815796872409032MSE: 0.012811091991743056RMSE: 0.11318609451581522`

## Random Forest

`from sklearn.ensemble import RandomForestRegressorrf = RandomForestRegressor(n_estimators = 10)rf.fit(X_train,y_train)y_pred = rf.predict(X_test)`
`MAE: 0.08413000449351056MSE: 0.012880789432555585RMSE: 0.1134935655997977`

# The big daddy — Artificial neural networks

1. Construct, compile and return a Keras model, which will then be used to fit/predict.
2. Predict diamond prices
3. Evaluate the model using metrics between ypred vs. ytest
`from keras.models import Sequentialfrom keras.layers import Densefrom keras.optimizers import RMSpropfrom keras.wrappers.scikit_learn import KerasRegressorfrom sklearn.model_selection import cross_val_scorefrom sklearn.model_selection import KFold`
• Then, we run the `KerasRegressor()` function, which returns a baseline ANN model built in Keras. It takes as input the model, epochs and batch-size. An epoch defines the number of times the learning algorithm will work through the entire training dataset to update the weights for a neural network. An epoch that has one batch is called the batch gradient descent learning algorithm. For batch training, all of the training samples pass through the learning algorithm simultaneously in one epoch before weights are updated. The batch size is a number of samples processed before the model is updated.
• More information on `KerasRegressor `can be found at Tensorflow website:
• Thereafter, we run the `KFold()` function. K-Fold Cross Validation uses a given data set, splits it into a K number of folds where each fold is used as a testing set while other K-1 are used as training set. For example, for 10-Fold cross validation (K=10), the dataset is split into 10 folds. In the first iteration, the first fold is used for validation while the 9 remaining folds form the training set. In the second iteration, 2nd fold is used as the testing set while the rest serve as the training set. This process is repeated until each fold of the 10 folds have been used as testing sets.
• Subsequently, the cross_val_score function takes the model, X and y, and kfold’s result as inputs and outputs multiple results — a list of regression model metrics scores. The `cross_val_score` function splits the data, using `KFold` as described above, into K pieces, trains on each combination of K-1 folds and gives back the metrics of the model.
`estimator = KerasRegressor(build_fn=baseline_model, epochs=10, batch_size=5)kf = KFold(n_splits=5)results = cross_val_score(estimator, X_train, y_train, cv=kf)print(“Results: %.2f (%.2f) MSE” % (results.mean(), results.std()))`
`estimator.fit(X_train, y_train)y_pred = estimator.predict(X_test)# Plot a scatter plot like above to see prediction perfection plt.scatter(y_test,y_pred)`

# Predictions on Pricescope data

## Get Pricescope data

Pricescope is the premier diamond and jewelry community on the Internet. Visitors to Pricescope find clear and concise tutorials from industry experts and have their questions answered by knowledgeable forum members. The majority of consumers go online to learn about diamonds, while 90% buy diamonds from brick and mortar shops. Pricescope exists to help consumers get the best value online or in-store.

# References

--

--

--

## More from Jatin Grover

Data Engineer. Investor, blogger (https://fivestepguide.com) quadragenarian, father of 4 year old. I like to share knowledge & experience gained over 20+ years

Love podcasts or audiobooks? Learn on the go with our new app. ## Jatin Grover

Data Engineer. Investor, blogger (https://fivestepguide.com) quadragenarian, father of 4 year old. I like to share knowledge & experience gained over 20+ years

## CS373 Spring 2022: Kevin Joseph ## University-level factors that predict a strong graduation rate ## Salsa vs. Hudson Valley: Game #67 Preview 