Other posts in this series:
Diamonds - Part 1 - In the rough - An Exploratory Data Analysis
Diamonds - Part 2 - A cut above - Building Linear Models
In a couple of previous posts, we tried to understand what attributes of diamonds are important to determine their prices. We showed that carat, clarity and color are the most important predictors of price. We arrived at this conclusion after doing a detailed exploratory data analysis.
In a previous post in this series, we did an exploratory data analysis of the diamonds dataset and found that carat, x, y, z were strongly correlated with price. To some extent, clarity also appeared to provide some predictive ability.
In this post, we will build linear models and see how well they predict the price of diamonds.
Before we do any transformations, feature engineering or feature selections for our model, let’s see what kind of results we get from a base linear model, that uses all the features to predict price:
In this case study, we will explore the diamonds dataset, then build linear and non-linear regression models to predict the price of diamonds.
Data Description
The diamonds dataset contains the prices in 2008 USD terms, and other attributes of almost 54,000 diamonds.
Attribute
Description
price
price in 2008 USD
carat
weight of a diamond (1 carat = 0.2 gms)
cut
quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color
diamond color from D (best) to J (worst)
clarity
a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x
length in mm
y
width in mm
z
depth in mm
depth
total depth percentage = z/mean(x, y)
table
width of the top of diamond relative to widest point
Data Summaries
A preliminary visual summary of the whole dataset shows all the features and their types.