Iowa Housing: Cleaning House

17 Mar 2021

In my last post about the Iowa housing data set, I spent some time working with a specific logical flow for filling categorical NA values - you can find that post here. In the post previous to that, I visualized some features, using a lot of box / violin plots, count plots, and some linear regression scatter plots comparing the distribution of SalePrices when sorted by specific features. If you would like a refresher, check that post out here.

In this post, I’m going to walk through my process of cleaning the data in order to retain as much information as possible. Thanks to Kaggle users Ertuğrul Demir and Golden for posting their in-depth notebooks. Both contain a wealth of information and provide a great starting base for other novice data analysts.

Examining the Missing Data

By now, I’m quite familiar with this data set - but it won’t hurt us to generate a table of values specifically to highlight our missing data!

The first thing I want to do is examine the null / NA values in both of the train and test datasets. To do this, we’ll have three datasets: train_data, test_data, and merged_data. The merged_data dataset has all of our samples from both of the train and test datasets without the SalePrice feature, which will make it easier for us to look at the aggregated data.

# create features dataframes,
# dropping SalePrice from
# train_features
train_features =
  train_data.drop(
    ['SalePrice'],
     axis=1)
test_features =
  test_data

# merge the two data sets
# for analysis, and reset index
merged_data = pd.concat(
  [train_features,
   test_features]).reset_index(
    drop=True)

datasets = [train_data,
  test_data,
  merged_data]

Having an array of datasets will come in handy later when we need to access each dataset to make any adjustments.

Using merged_data, I created a summary table with three purposes:

(1) Calculate how many samples have NAs for each feature,

(2) Describe the data type of each feature,

(3) Give a list of unique values for each categorical feature, and a min / max / mean / median summary for each numerical feature.

This is easily visualized in the table below:

Feature	NA Count	% Missing	Data Type	Overview
PoolQC	2909	0.99657	object	[nan, Ex, Fa, Gd]
MiscFeature	2814	0.96403	object	[nan, Shed, Gar2, Othr, TenC]
Alley	2721	0.93217	object	[nan, Grvl, Pave]
Fence	2348	0.80439	object	[nan, MnPrv, GdWo, GdPrv, MnWw]
FireplaceQu	1420	0.48647	object	[nan, TA, Gd, Fa, Ex, Po]
LotFrontage	486	0.16650	float64	[21, 313, 69.0, 68]
...	...	...	...	...

See full table

The script that generated this table is available on my github.

We’ve got our work laid out, time to start cleaning!

Filling the Missing Data

Categorical Features with “NA / nan” which mean “None”

For many of the categorical features, NA is used if the sample doesn’t contain the given feature. For these samples, we will change the sample feature to None instead.

none_cols = [
  'PoolQC',
  'MiscFeature',
  'Alley',
  'Fence',
  'FireplaceQu',
  'GarageFinish',
  'GarageQual',
  'GarageCond',
  'GarageType',
  'BsmtCond',
  'BsmtExposure',
  'BsmtQual',
  'BsmtFinType1',
  'BsmtFinType2',
  'MasVnrType'
]

for col in none_cols:
  for ds in datasets:
      ds[col] =
        ds[col].fillna("None")

Numerical features with “NA / nan” which mean 0

These numerical features just need 0 instead of NA, since for these samples, the feature doesn’t exist.

zero_cols = [
  'GarageYrBlt',
  'MasVnrArea',
  'BsmtFullBath',
  'BsmtHalfBath',
  'BsmtFinSF1',
  'BsmtFinSF2',
  'BsmtUnfSF',
  'TotalBsmtSF']

for col in zero_cols:
  for ds in datasets:
      ds[col] =
        ds[col].fillna(
        0, inplace=True)

Categorical features with “NA / nan” which need examination

These are categorical features which are missing from 1-4 samples.

MSZoning / Exterior1st / Exterior2nd

Filling MSZoning is the subject of another post, which can be found here!

Since the distribution of values for MSZoning, Exterior1st, and Exterior2nd appear to depend on which Neighborhood the house is located in, we need to group our features by Neighborhood, then use the mode of the given feature to fill in the NAs.

There is only 1 sample that is missing both Exterior1st and Exterior2nd features, and it is in the Edwards Neighborhood:

mask1 = merged_data[ 
  'Exterior1st'].isna()
mask2 = merged_data[ 
  'Exterior2nd'].isna()
merged_data[mask1 & mask2]
  ['ExterQual',
  'ExterCond',
  'Neighborhood']

      EQ  EC  Neighborhood
2151  TA   3  Edwards

To make an accurate prediction, we can examine other houses in the Edwards Neighborhood with identical ExterQual and ExterCond values to see what the distribution looks like:

mask3 = merged_data[
  'Neighborhood'] == 'Edwards'
mask4 = merged_data[
  'ExterQual'] == 'TA'
mask5 = merged_data[
  'ExterCond'] == 'TA'

merged_data[mask3
  & mask4 & mask5][[
  'Exterior1st',
  'Exterior2nd']].value_counts()

Exterior1st  Exterior2nd
Wd Sdng      Wd Sdng        36
MetalSd      MetalSd        28
VinylSd      VinylSd        22
Plywood      Plywood        12
HdBoard      HdBoard         9
WdShing      Wd Shng         9
HdBoard      Plywood         5
Stucco       Stucco          4
WdShing      Plywood         4
BrkFace      Wd Sdng         3
AsbShng      Plywood         2
             AsbShng         2
Plywood      Wd Shng         1
VinylSd      Wd Sdng         1
MetalSd      Wd Sdng         1
             HdBoard         1
Wd Sdng      Plywood         1
BrkFace      Plywood         1
BrkComm      Brk Cmn         1
AsphShn      AsphShn         1

From the above result, we should be able to see Wd Sdng is the most common Exterior1st and Exterior2nd value for homes in the Edwards Neighborhood with TA for both ExterQual and ExterCond values. This is the value we want assigned to both Exterior1st and Exterior2nd for the missing value.

nbh_mode_list = ['MSZoning',
  'Exterior1st',
  'Exterior2nd']

for ds in datasets:
  for col in nbh_mode_list:
    ds[col] =
      ds[col].fillna(
      ds.groupby('Neighborhood')
      [col].transform(
      lambda x:
        x.mode().iloc[0]))

Functional

It appears Functional measures how many safety deductions the house has / how much overall damage there is. I started this investigation by examining 4 features: OverallCond, BsmtCond, ExterCond, and GarageCond. These features could give us an idea of any damage to the home, and if Functional is affected.

condFeatures = [
  'OverallCond',
  'BsmtCond',
  'ExterCond',
  'GarageCond']

First, only 2 samples from the test set are missing this feature:

mask4 = merged_data[
'Functional'].isna()

merged_data[mask4][
  ['Id',
   'Functional'] +
  condFeatures]

Id     Func  OvC BC   EC  GC
2217   NaN    5  None Po  Po
2474   NaN    1  Fa   Fa  Fa

Something a little funky to notice here: While house with Id 2217 has no basement, poor exterior condition, and poor garage condition, it still has an overall condition score of 5, while house with Id 2474 has a fair basement, fair exterior condition, and fair garage condition still only has an overall condition rating of 1. This doesn’t really make sense under my hypothesis of a higher condition rating correlating with a higher functionality rating, so we had better keep investigating.

We need to get the data looking like something we can numerically manipulate. This will take two steps:

(1) Pop out all of these features into a separate deep copied dataframe so none of my fiddling will affect the underlying data.

mask5 = merged_data['Functional'].isna()
condfeatdf = merged_data[~mask5][
  ['Id',
   'Functionality',
   condFeatures]].copy(deep=True)

Id   Func  OC 	BC   EC   GC
1    Typ   5 	TA   TA   TA
2    Typ   8 	TA   TA   TA
3    Typ   5 	TA   TA   TA
4    Typ   5 	Gd   TA   TA
5    Typ   5 	TA   TA   TA
...  ...   ... 	...  ...  ...

(2) Encode these variables as integers instead of categorical variables despite my lingering doubts about the linearity of the condition scale (we don’t know the exact differences between a ‘Po’ house and a ‘Fa’ house) so we can actually perform a correlation test between all of the selected features.

cond_map = {
  'None': 0,
  'Po': 1,
  'Fa': 2,
  'TA': 3,
  'Gd': 4,
  'Ex': 5
}
func_map = {
  'Sal': 0,
  'Sev': 1,
  'Maj2': 2,
  'Maj1': 3,
  'Mod': 4,
  'Min2': 5,
  'Min1': 6,
  'Typ': 7
}

for feat in condFeatures[1:]:
  condfeatdf[feat] =
    condfeatdf[feat].map(
     cond_map).astype('int')
condfeatdf['Functional'] =
  condfeatdf['Functional'].map(
    func_map).astype('int')

Id  Func  OC  BC  EC  GC
1   7 	  5   3   3   3
2   7 	  8   3   3   3
3   7     5   3   3   3
4   7     5   4   3   3
5   7     5   3   3   3

That looks better! Now we can attempt some numerical manipulations:

condfeatdf.corr()[1:][
  ['Functional',
   'OverallCond',
   'BsmtCond',
   'ExterCond',
   'GarageCond']]

     OC     BC     EC     GC
Func 0.118  0.190  0.074  0.090
OC          0.090  0.403  0.045
BC                 0.096  0.140
EC                        0.093

This doesn’t bode well for my hypothesis at all - if we consider the condition scales linear, then the feature with the highest correlation to Functional is BsmtCond with a value of only 0.19.

Another issue is that many homes have a BsmtCond value of 0 under the new mapping, and it is not immediately clear that if a house has a BsmtCond of 0, it does not have a lower quality basement than a home with a BsmtCond value of 1. There are a lot of assumptions here, and frankly for only 2 missing values, this entire expedition may have been overkill. Ultimately, we will fill the two NA values with the mode, and fondly remember how to use map for the next time we need to change from a categorical variable to a numerical one. Good thing I made a deep copy to play around with instead!

for ds in datasets:
  ds['Functional'] =
    ds['Functional'].fillna(
    ds['Functional'].mode()[0])

Electrical / KitchenQual / SaleType / Utilities

For these variables, there’s only 1-2 samples missing each feature, so we will fill them with the mode of the feature from the dataset.

for col in (
  'Electrical',
  'KitchenQual',
  'SaleType',
  'Utilities'):
    for ds in datasets:
      ds[col] = ds[col].fillna(
        ds[col].mode()[0]

Numerical features with “NA / nan” which need examination

There’s a little more happening here than what appears at surface level, so let’s take a deeper dive into these features:

LotFrontage

This is one of the features which a significant number of samples (486 / 2919) are missing. Lot frontage is defined as “linear feet of street connected to property.” Certainly we can draw the conclusion that LotArea might be correlated to LotFrontage since one is used to calculate the other, but there are a few other features that can help us interpolate.

First, let’s see the correlation between LotArea and LotFrontage.

merged_data[
  ['LotArea', \
  'LotFrontage']].corr()

    LotArea   LotFrontage
LA  1.000000  0.489896
LF  0.489896  1.000000

0.48 isn’t a very strong correlation value - there isn’t a clear linear relationship between our two features. However, we’re going to examine another feature LotConfig which can tell us more about the relationship of LotFrontage and LotArea.

LotConfig:  Lot configuration
  Inside    Inside lot
  Corner    Corner lot
  CulDSac   Cul-de-sac
  FR2       Frontage on 2 sides
            of property
  FR3       Frontage on 3 sides
            of property

Considering how Corner lots may have twice as much LotFrontage as Inside lots, and FR2s and FR3s should have comparatively more as well, first sorting into LotConfig might help us calculate a more accurate prediction.

First, we can examine the distribution of the LotConfig feature of the samples which are missing LotFrontage:

mask6 = merged_data['LotFrontage'].isna()
merged_data[mask6]
  ['LotConfig'].value_counts()

Inside     271
Corner     104
CulDSac     87
FR2         20
FR3          4

~77% (374/486) of our samples with a missing `` value have either Inside or Corner as a LotConfig. To get a better idea of the correlation between these variables, we can look at how a simple linear model lines up on the subplots of each configuration:

The code used to generate these graphs is available on my github.

Important to notice - I limited the LotArea <= 80000 since there is one outlier which skews all figures off to the right, at over 200000 square feet (200 when scaled, as seen here).

You can also see there is an issue with CulDSac properties - the linear model does not fit the data at all. Below, you can see this reflected in the correlations between LotArea and LotFrontage when grouped by LotConfig.

mask6 = merged_data[
  'LotArea'] <= 80000
merged_data[mask6].groupby(
  'LotConfig')[
    ['LotArea',
    'LotFrontage']].corr()

              LA        LF
LotConfig
Inside    LA  1.000000  0.630001
          LF  0.630001  1.000000
Corner    LA  1.000000  0.787955
          LF  0.787955  1.000000
CulDSac   LA  1.000000  0.195327
          LF  0.195327  1.000000
FR2       LA  1.000000  0.827626
          LF  0.827626  1.000000
FR3       LA  1.000000  0.835891
          LF  0.835891  1.000000

Comparing this to our previous correlation calculation, there is a strict improvement in correlation for all LotConfig features except for CulDSac which could be due to the irregular shape of CulDSac lots, and their disproportionately small LotFrontage measure. I predict that, even with the poor predictions for CulDSac lots which make up 87 of our missing 486, we will have success using the linear relationship between LotArea and LotFrontage grouped by LotConfig to fill in our missing LotFrontage feature.

To do this, I will separate merged_data into 5 subsets (one for each possible value of LotConfig), remove the NAs from each subset, and then use the resulting subset to train a linear model.

The code used to do this is much easier to view on github, since I used a custom function, which, when applied to the data, is able to replace an NA LotFrontage value with a predicted one.

To see how this affected our data, we can compare the previous graphs to updated ones - we should be able to see new data points running along our previously displayed line of best fit for each LotConfig.

It is not perfect, but it does let us keep LotFrontage as a feature, and this might help boost the performance of our algorithm. Another way we can visualize how the data was changed is by using .describe() on merged_data['LotFrontage'] before and after filling.

         Prefill  Postfill
count    2433.00   2919.00
mean       69.30     70.17
std        23.34     27.03
min        21.00     21.00
25%        59.00     58.32
50%        68.00     68.00
75%        80.00     80.00
max       313.00    806.26

GarageArea / GarageCars

First, we should confirm that the single sample that is missing both GarageArea and GarageCars has a GarageType other than None, otherwise we will simply fill these in with 0.

merged_data[
 merged_data[
  'GarageCars'].isna()][[
   'GarageType',
   'GarageArea',
   'GarageCars']]

         GType  GArea  GCars
2576    Detchd    NaN    NaN

Now, we can fill these based on the average GarageArea of Detchd garages, and the most common value for GarageCars of Detchd garages.

for ds in (
  test_data,
  train_data,
  merged_data):
    ds['GarageCars'] = \
      ds['GarageCars'].fillna(
      groupby('GarageType') \
      ['GarageCars'].transform(
      lambda x: x.mode().iloc[0]))
    ds['GarageArea'] = \
      ds['GarageArea'].fillna(
      groupby('GarageType') \
      ['GarageArea'].transform(
      lambda x: x.mean().iloc[0]))

Reviewing the Cleaned Data

We can now run the same script to generate our summary of the data - we should see there are no more NA values!

Feature	NA Count	Percentage	Data Type	Overview
Id	0	0.0	int64	[1, 2919, 1460, 1460]
MSSubClass	0	0.0	int64	[20, 190, 57, 50]
MSZoning	0	0.0	object	[RL, RM, C (all), FV, RH]
LotFrontage	0	0.0	float64	[21, 806, 70, 68]
...	...	...	...	...

See full table

No more NAs!!

This has been a beast of a post, both in length and content! Now that we have a squeaky clean dataset, the next step will be some feature engineering, and possibly some exploration into feature selection using a Random Forest Regressor.

Thank you for reading, until next time!