In this case study, we will use the Ames Housing dataset to explore regression techniques and predict the sale price of houses.
Data Summaries
The Ames Housing dataset contains the sale prices of properties in Ames, Iowa along with 80 other features. Each property has an Id associated with it.
Here are the dimensions of the training and testing sets respectively:
[1] "Dimensions of the training set"
[1] 1460 81
[1] "Dimensions of the testing set"
[1] 1459 81
Now, let’s combine training and testing into a single dataset and take a look at the count of missing values:
In this case study, we will explore the diamonds dataset, then build linear and non-linear regression models to predict the price of diamonds.
Data Description
The diamonds dataset contains the prices in 2008 USD terms, and other attributes of almost 54,000 diamonds.
Attribute
Description
price
price in 2008 USD
carat
weight of a diamond (1 carat = 0.2 gms)
cut
quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color
diamond color from D (best) to J (worst)
clarity
a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x
length in mm
y
width in mm
z
depth in mm
depth
total depth percentage = z/mean(x, y)
table
width of the top of diamond relative to widest point
Data Summaries
A preliminary visual summary of the whole dataset shows all the features and their types.