Iowa Housing: Cleaning House
In my last post about the Iowa housing data set, I spent some time working with a specific logical flow for filling categorical NA values - you can find that post here. In the post previous to that, I visualized some features, using a lot of box / violin plots, count plots, and some linear regression scatter plots comparing the distribution of SalePrices when sorted by specific features. If you would like a refresher, check that post out here.
In this post, I’m going to walk through my process of cleaning the data in order to retain as much information as possible. Thanks to Kaggle users Ertuğrul Demir and Golden for posting their in-depth notebooks. Both contain a wealth of information and provide a great starting base for other novice data analysts.
Examining the Missing Data
By now, I’m quite familiar with this data set - but it won’t hurt us to generate a table of values specifically to highlight our missing data!
The first thing I want to do is examine the null / NA
values in both of the train and test datasets. To do this, we’ll have three datasets: train_data
, test_data
, and merged_data
. The merged_data
dataset has all of our samples from both of the train and test datasets without the SalePrice
feature, which will make it easier for us to look at the aggregated data.
# create features dataframes,
# dropping SalePrice from
# train_features
train_features =
train_data.drop(
['SalePrice'],
axis=1)
test_features =
test_data
# merge the two data sets
# for analysis, and reset index
merged_data = pd.concat(
[train_features,
test_features]).reset_index(
drop=True)
datasets = [train_data,
test_data,
merged_data]
Having an array of datasets will come in handy later when we need to access each dataset to make any adjustments.
Using merged_data
, I created a summary table with three purposes:
(1) Calculate how many samples have NA
s for each feature,
(2) Describe the data type of each feature,
(3) Give a list of unique values for each categorical feature, and a min / max / mean / median summary for each numerical feature.
This is easily visualized in the table below:
Feature | NA Count | % Missing | Data Type | Overview |
---|---|---|---|---|
PoolQC | 2909 | 0.99657 | object | [nan, Ex, Fa, Gd] |
MiscFeature | 2814 | 0.96403 | object | [nan, Shed, Gar2, Othr, TenC] |
Alley | 2721 | 0.93217 | object | [nan, Grvl, Pave] |
Fence | 2348 | 0.80439 | object | [nan, MnPrv, GdWo, GdPrv, MnWw] |
FireplaceQu | 1420 | 0.48647 | object | [nan, TA, Gd, Fa, Ex, Po] |
LotFrontage | 486 | 0.16650 | float64 | [21, 313, 69.0, 68] |
... | ... | ... | ... | ... |
We’ve got our work laid out, time to start cleaning!
Filling the Missing Data
Categorical Features with “NA / nan” which mean “None”
For many of the categorical features, NA
is used if the sample doesn’t contain the given feature. For these samples, we will change the sample feature to None
instead.
none_cols = [
'PoolQC',
'MiscFeature',
'Alley',
'Fence',
'FireplaceQu',
'GarageFinish',
'GarageQual',
'GarageCond',
'GarageType',
'BsmtCond',
'BsmtExposure',
'BsmtQual',
'BsmtFinType1',
'BsmtFinType2',
'MasVnrType'
]
for col in none_cols:
for ds in datasets:
ds[col] =
ds[col].fillna("None")
Numerical features with “NA / nan” which mean 0
These numerical features just need 0 instead of NA
, since for these samples, the feature doesn’t exist.
zero_cols = [
'GarageYrBlt',
'MasVnrArea',
'BsmtFullBath',
'BsmtHalfBath',
'BsmtFinSF1',
'BsmtFinSF2',
'BsmtUnfSF',
'TotalBsmtSF']
for col in zero_cols:
for ds in datasets:
ds[col] =
ds[col].fillna(
0, inplace=True)
Categorical features with “NA / nan” which need examination
These are categorical features which are missing from 1-4 samples.
MSZoning / Exterior1st / Exterior2nd
Filling MSZoning
is the subject of another post, which can be found here!
Since the distribution of values for MSZoning
, Exterior1st
, and Exterior2nd
appear to depend on which Neighborhood
the house is located in, we need to group our features by Neighborhood
, then use the mode of the given feature to fill in the NAs.
There is only 1 sample that is missing both Exterior1st and Exterior2nd features, and it is in the Edwards Neighborhood
:
mask1 = merged_data[
'Exterior1st'].isna()
mask2 = merged_data[
'Exterior2nd'].isna()
merged_data[mask1 & mask2]
['ExterQual',
'ExterCond',
'Neighborhood']
EQ EC Neighborhood
2151 TA 3 Edwards
To make an accurate prediction, we can examine other houses in the Edwards Neighborhood
with identical ExterQual
and ExterCond
values to see what the distribution looks like:
mask3 = merged_data[
'Neighborhood'] == 'Edwards'
mask4 = merged_data[
'ExterQual'] == 'TA'
mask5 = merged_data[
'ExterCond'] == 'TA'
merged_data[mask3
& mask4 & mask5][[
'Exterior1st',
'Exterior2nd']].value_counts()
Exterior1st Exterior2nd
Wd Sdng Wd Sdng 36
MetalSd MetalSd 28
VinylSd VinylSd 22
Plywood Plywood 12
HdBoard HdBoard 9
WdShing Wd Shng 9
HdBoard Plywood 5
Stucco Stucco 4
WdShing Plywood 4
BrkFace Wd Sdng 3
AsbShng Plywood 2
AsbShng 2
Plywood Wd Shng 1
VinylSd Wd Sdng 1
MetalSd Wd Sdng 1
HdBoard 1
Wd Sdng Plywood 1
BrkFace Plywood 1
BrkComm Brk Cmn 1
AsphShn AsphShn 1
From the above result, we should be able to see Wd Sdng is the most common Exterior1st
and Exterior2nd
value for homes in the Edwards Neighborhood
with TA for both ExterQual
and ExterCond
values. This is the value we want assigned to both Exterior1st
and Exterior2nd
for the missing value.
nbh_mode_list = ['MSZoning',
'Exterior1st',
'Exterior2nd']
for ds in datasets:
for col in nbh_mode_list:
ds[col] =
ds[col].fillna(
ds.groupby('Neighborhood')
[col].transform(
lambda x:
x.mode().iloc[0]))
Functional
It appears Functional
measures how many safety deductions the house has / how much overall damage there is. I started this investigation by examining 4 features: OverallCond
, BsmtCond
, ExterCond
, and GarageCond
. These features could give us an idea of any damage to the home, and if Functional
is affected.
condFeatures = [
'OverallCond',
'BsmtCond',
'ExterCond',
'GarageCond']
First, only 2 samples from the test set are missing this feature:
mask4 = merged_data[
'Functional'].isna()
merged_data[mask4][
['Id',
'Functional'] +
condFeatures]
Id Func OvC BC EC GC
2217 NaN 5 None Po Po
2474 NaN 1 Fa Fa Fa
Something a little funky to notice here: While house with Id
2217 has no basement, poor exterior condition, and poor garage condition, it still has an overall condition score of 5, while house with Id
2474 has a fair basement, fair exterior condition, and fair garage condition still only has an overall condition rating of 1. This doesn’t really make sense under my hypothesis of a higher condition rating correlating with a higher functionality rating, so we had better keep investigating.
We need to get the data looking like something we can numerically manipulate. This will take two steps:
(1) Pop out all of these features into a separate deep copied dataframe so none of my fiddling will affect the underlying data.
mask5 = merged_data['Functional'].isna()
condfeatdf = merged_data[~mask5][
['Id',
'Functionality',
condFeatures]].copy(deep=True)
Id Func OC BC EC GC
1 Typ 5 TA TA TA
2 Typ 8 TA TA TA
3 Typ 5 TA TA TA
4 Typ 5 Gd TA TA
5 Typ 5 TA TA TA
... ... ... ... ... ...
(2) Encode these variables as integers instead of categorical variables despite my lingering doubts about the linearity of the condition scale (we don’t know the exact differences between a ‘Po’ house and a ‘Fa’ house) so we can actually perform a correlation test between all of the selected features.
cond_map = {
'None': 0,
'Po': 1,
'Fa': 2,
'TA': 3,
'Gd': 4,
'Ex': 5
}
func_map = {
'Sal': 0,
'Sev': 1,
'Maj2': 2,
'Maj1': 3,
'Mod': 4,
'Min2': 5,
'Min1': 6,
'Typ': 7
}
for feat in condFeatures[1:]:
condfeatdf[feat] =
condfeatdf[feat].map(
cond_map).astype('int')
condfeatdf['Functional'] =
condfeatdf['Functional'].map(
func_map).astype('int')
Id Func OC BC EC GC
1 7 5 3 3 3
2 7 8 3 3 3
3 7 5 3 3 3
4 7 5 4 3 3
5 7 5 3 3 3
That looks better! Now we can attempt some numerical manipulations:
condfeatdf.corr()[1:][
['Functional',
'OverallCond',
'BsmtCond',
'ExterCond',
'GarageCond']]
OC BC EC GC
Func 0.118 0.190 0.074 0.090
OC 0.090 0.403 0.045
BC 0.096 0.140
EC 0.093
This doesn’t bode well for my hypothesis at all - if we consider the condition scales linear, then the feature with the highest correlation to Functional
is BsmtCond
with a value of only 0.19.
Another issue is that many homes have a BsmtCond
value of 0 under the new mapping, and it is not immediately clear that if a house has a BsmtCond
of 0, it does not have a lower quality basement than a home with a BsmtCond
value of 1. There are a lot of assumptions here, and frankly for only 2 missing values, this entire expedition may have been overkill. Ultimately, we will fill the two NA
values with the mode, and fondly remember how to use map
for the next time we need to change from a categorical variable to a numerical one. Good thing I made a deep copy to play around with instead!
for ds in datasets:
ds['Functional'] =
ds['Functional'].fillna(
ds['Functional'].mode()[0])
Electrical / KitchenQual / SaleType / Utilities
For these variables, there’s only 1-2 samples missing each feature, so we will fill them with the mode of the feature from the dataset.
for col in (
'Electrical',
'KitchenQual',
'SaleType',
'Utilities'):
for ds in datasets:
ds[col] = ds[col].fillna(
ds[col].mode()[0]
Numerical features with “NA / nan” which need examination
There’s a little more happening here than what appears at surface level, so let’s take a deeper dive into these features:
LotFrontage
This is one of the features which a significant number of samples (486 / 2919) are missing. Lot frontage is defined as “linear feet of street connected to property.” Certainly we can draw the conclusion that LotArea
might be correlated to LotFrontage
since one is used to calculate the other, but there are a few other features that can help us interpolate.
First, let’s see the correlation between LotArea
and LotFrontage
.
merged_data[
['LotArea', \
'LotFrontage']].corr()
LotArea LotFrontage
LA 1.000000 0.489896
LF 0.489896 1.000000
0.48 isn’t a very strong correlation value - there isn’t a clear linear relationship between our two features. However, we’re going to examine another feature LotConfig
which can tell us more about the relationship of LotFrontage
and LotArea
.
LotConfig: Lot configuration
Inside Inside lot
Corner Corner lot
CulDSac Cul-de-sac
FR2 Frontage on 2 sides
of property
FR3 Frontage on 3 sides
of property
Considering how Corner
lots may have twice as much LotFrontage
as Inside
lots, and FR2
s and FR3
s should have comparatively more as well, first sorting into LotConfig
might help us calculate a more accurate prediction.
First, we can examine the distribution of the LotConfig
feature of the samples which are missing LotFrontage
:
mask6 = merged_data['LotFrontage'].isna()
merged_data[mask6]
['LotConfig'].value_counts()
Inside 271
Corner 104
CulDSac 87
FR2 20
FR3 4
~77% (374/486) of our samples with a missing `` value have either Inside or Corner as a LotConfig
. To get a better idea of the correlation between these variables, we can look at how a simple linear model lines up on the subplots of each configuration:
Important to notice - I limited the LotArea <= 80000
since there is one outlier which skews all figures off to the right, at over 200000 square feet (200 when scaled, as seen here).
You can also see there is an issue with CulDSac
properties - the linear model does not fit the data at all. Below, you can see this reflected in the correlations between LotArea
and LotFrontage
when grouped by LotConfig
.
mask6 = merged_data[
'LotArea'] <= 80000
merged_data[mask6].groupby(
'LotConfig')[
['LotArea',
'LotFrontage']].corr()
LA LF
LotConfig
Inside LA 1.000000 0.630001
LF 0.630001 1.000000
Corner LA 1.000000 0.787955
LF 0.787955 1.000000
CulDSac LA 1.000000 0.195327
LF 0.195327 1.000000
FR2 LA 1.000000 0.827626
LF 0.827626 1.000000
FR3 LA 1.000000 0.835891
LF 0.835891 1.000000
Comparing this to our previous correlation calculation, there is a strict improvement in correlation for all LotConfig
features except for CulDSac
which could be due to the irregular shape of CulDSac
lots, and their disproportionately small LotFrontage
measure. I predict that, even with the poor predictions for CulDSac
lots which make up 87 of our missing 486, we will have success using the linear relationship between LotArea
and LotFrontage
grouped by LotConfig
to fill in our missing LotFrontage
feature.
To do this, I will separate merged_data
into 5 subsets (one for each possible value of LotConfig
), remove the NA
s from each subset, and then use the resulting subset to train a linear model.
The code used to do this is much easier to view on github, since I used a custom function, which, when applied to the data, is able to replace an NA
LotFrontage
value with a predicted one.
To see how this affected our data, we can compare the previous graphs to updated ones - we should be able to see new data points running along our previously displayed line of best fit for each LotConfig
.
It is not perfect, but it does let us keep LotFrontage
as a feature, and this might help boost the performance of our algorithm. Another way we can visualize how the data was changed is by using .describe()
on merged_data['LotFrontage']
before and after filling.
Prefill Postfill
count 2433.00 2919.00
mean 69.30 70.17
std 23.34 27.03
min 21.00 21.00
25% 59.00 58.32
50% 68.00 68.00
75% 80.00 80.00
max 313.00 806.26
GarageArea / GarageCars
First, we should confirm that the single sample that is missing both GarageArea
and GarageCars
has a GarageType
other than None
, otherwise we will simply fill these in with 0
.
merged_data[
merged_data[
'GarageCars'].isna()][[
'GarageType',
'GarageArea',
'GarageCars']]
GType GArea GCars
2576 Detchd NaN NaN
Now, we can fill these based on the average GarageArea
of Detchd
garages, and the most common value for GarageCars
of Detchd
garages.
for ds in (
test_data,
train_data,
merged_data):
ds['GarageCars'] = \
ds['GarageCars'].fillna(
groupby('GarageType') \
['GarageCars'].transform(
lambda x: x.mode().iloc[0]))
ds['GarageArea'] = \
ds['GarageArea'].fillna(
groupby('GarageType') \
['GarageArea'].transform(
lambda x: x.mean().iloc[0]))
Reviewing the Cleaned Data
We can now run the same script to generate our summary of the data - we should see there are no more NA
values!
Feature | NA Count | Percentage | Data Type | Overview |
---|---|---|---|---|
Id | 0 | 0.0 | int64 | [1, 2919, 1460, 1460] |
MSSubClass | 0 | 0.0 | int64 | [20, 190, 57, 50] |
MSZoning | 0 | 0.0 | object | [RL, RM, C (all), FV, RH] |
LotFrontage | 0 | 0.0 | float64 | [21, 806, 70, 68] |
... | ... | ... | ... | ... |
No more NAs!!
This has been a beast of a post, both in length and content! Now that we have a squeaky clean dataset, the next step will be some feature engineering, and possibly some exploration into feature selection using a Random Forest Regressor.
Thank you for reading, until next time!