Iowa Housing: Exploration
My latest project (which I started talking about in my last post here) is related to Kaggle's ML competition to predict housing prices given 81 features of homes in Ames, Iowa. Let's jump back in!
I want to visualize some of the data, and see what relations exists between the features and our target output SalePrice.
After taking a look at the feature descriptions, I'm going to generate a few plots to check out:
OverallQual
- Rates the overall material and finish of the houseBldgType
: Type of dwellingHouseStyle
: Style of dwellingUtilities
: Type of utilities availableGrLivArea
: Above grade (ground) living area square feetGarageType
: Garage locationGarageArea
: Size of garage in square feetMoSold
: Month SoldSaleType
: Type of sale (Warranty deed, contract, estate, etc)
For this project, I used Seaborn (a library that works on top of matplotlib) to generate a few quick plots to give me an idea of what my dataset looks like.
EDIT: After exploring the dataset even more during my next phase, I realized I overlooked a key feature: OverallQual
. If we sort SalePrice
by OverallQual
, we see something quite obvious, but nonetheless telling:
As OverallQual
increases, it appears SalePrice
does too - there are more ways of testing this hypothesis, which I'll explore in another blog post. For now, we should take away that our dataset mostly consists of homes rated between 4-8 inclusive, with a slightly right skewed distribution.
Next up, I wanted to explore how SalePrice
and BldgType
were related. I created these plots to see the spread of SalePrice
when separated into BldgTypes, and see the count for each BldgType
in our dataset.
(Thank you to this poster for helping me out with the labels on the second plot!)
As you can see, there's an overwhelming number of Single-family Detatched homes for our algorithm to train on, but not a lot of other types - as well, the spread of SalePrice
pretty large for all BldgTypes. Hopefully there are other features that will help divy up the Single-family Detatched category!
Here are some more graphs I generated from the data - there are interesting things happening with GarageArea, GarageType, and SalePrice:
If you squint your eyes and ignore the strong underfitting, it appears that the price of each GarageType
may scale differently according to GarageArea
. This could be useful for our algorithm to pick up on, and may also be explained by the relation of the GarageType
being associated with a different HouseStyle
.
As well, check out what happens when we look at GrLivArea
(Above ground living area in sq. feet) vs SalePrice
when sorted by HouseStyle
- some styles are well represented in our dataset, others not so much, but we do see different sliding price scales for each HouseStyle
given.
In my exploration phase, other helpful bits of information I found:
Utilities
- only 1 house was listed with no Sewer or Water hookups, all 1459 other homes have all public utilities (not very helpful for dividing up our dataset into meaningful sub-trees)Street
- only 6 homes are listed with gravel road access to the property, but this appears to greatly decrease the averageSalePrice
by around $50,000 (although this could be due to the extremely small comparative sample size)
- More home sales happen in the summer months than the winter (typical of the market in the Northern hemisphere, no one wants to move when it's 40º below freezing), but it doesn’t look like there’s a strong relationship between
MoSold
andSalePrice
.
We can also see that the type of sale (TypeSale
) can determine a different price bracket - good for our algorithm to recognize as well!
SaleType: Type of sale
WD Warranty Deed - Conventional
CWD Warranty Deed - Cash
VWD Warranty Deed - VA Loan
New Home just constructed and sold
COD Court Officer Deed/Estate
Con Contract 15% Down payment
regular terms
ConLw Contract Low Down payment and
low interest
ConLI Contract Low Interest
ConLD Contract Low Down
Oth Other
Also good to note the top 3 SaleType
variables:
- Conventional Warranty Deed - 1267
- New - 122
- Court Officer Deed / Estate - 43
This means the number of the remaining variables in SaleType
is quite small in this dataset, which may also affect how our accurate our algorithm is.
Now that I've had a good look at the dataset, I'm going to start reading this blog post to give me a better idea of how to debug the Random Forest algorithm, and how I can interpret my results.
If you would like any of the code I used to generate my graphs, I've posted it on my github under explorationGraphsPost.py.
Thanks for reading! I'll leave you with a few pictures of my cats. This time, unrelated to RF or any ML algorithm - unless one comes out soon that deals with bread and/or toast.


Until next time!