Iowa Housing: Continued Cleaning

06 Jul 2021

Welcome back! In my last post about the Iowa Housing dataset, I covered the process of filling categorical and numerical NAs to keep as much data as possible. Somewhat naively, I assumed we'd be ready for modelling; in reality, there is still a lot of work to be done.

To start with, I found more information about the dataset itself, and how others have approached this analysis task though the following links:

The most important discovery was the the full, unaltered dataset. When I started this project, the data provided by the Kaggle competition was incomplete. The target variable SalePrice has already been removed from the test_set, and some features in the train_set had missing values. Now, thanks to Dean De Cock’s article about the dataset, all of the data is available, including SalePrice for each sample.

In my last post about this dataset, I had to use 3 datasets when cleaning: train_data, test_data, and merge_data. Each cleaning step had to iterate over these three datasets, and treat them separately. Now, I will use fulldatadf to refer to all training examples, and will divide fulldatadf into training and testing subsets during modelling.

You can see the changes to missing values in this table, generated by this script:

Feature	NA Count	Percentage	Data Type	Overview
PoolQC	2917	0.99556	object	[nan, Ex, Gd, TA, Fa]
MiscFeature	2824	0.96382	object	[nan, Gar2, Shed, Othr, Elev, TenC]
Alley	2732	0.93242	object	[nan, Pave, Grvl]
Fence	2358	0.80478	object	[nan, MnPrv, GdPrv, GdWo, MnWw]
FireplaceQu	1422	0.48532	object	[Gd, nan, TA, Po, Ex, Fa]
LotFrontage	490	0.16724	float64	[21, 313, 69, 68]
...	...	...	...	...

See full table

While looking at how others approached this analysis task, I made a list of important approaches others took that I missed:

Statistical approaches

Filtering outliers using either graphical approach or IQR approach
Recategorizing incorrectly labled numerical features as categorical features
Mapping ordinal (ordered) categorical features to integer values (features that measure condition, such as Functional and OverallCond)
Merging levels within categories with few instances into those with more and discarding features with too few instances

Domain knowledge approaches

Using map data to determine distances to places of interest in Ames (Iowa State University, Airport, Downtown)
Using YearBuilt and Neighborhood as filters to view their relationship to SalePrice
Knowing a lot's LotFrontage can be 0 linear feet, and a value of NA does not mean we need to interpolate a value

As I walk through cleaning the dataset's features, it may be helpful to have the dataset’s documentation open to check definitions and values.

Filtering Outliers

There are some outliers in important features LotArea and GrLivArea. First, I wanted to filter the data the 1.5 x Interquartile Range Rule. Before filtering, there were 2930 samples; after filtering, this method left 2745 samples. Losing 185 samples, or 6.3% of our dataset is not trivial. The IQR filters approximately marked the following as outliers: GrLivArea >= 2750 and LotArea >= 24000.

I then made another filter, based only on the graphs I had seen, with these marked as outliers: GrLivArea >= 4000 and LotArea >= 80000. Below are the two features mapped against SalePrice, where non-outliers are marked in green, IQR determined outliers are marked in orange, and outliers determined by my graphical filter are marked with a star.

I generated the same graphs, using the logarithm of the features to see the distribution without the clumping around the origin in the second graph:

These graphs helped me make the decision not to use the IQR filtering rule because it removes too many data points. The only points I feel confident removing based on GrLivArea are the three points in the lower right corner of the GrLivArea graph, as the other two marked by my filter still appear to follow the trend. The other points I feel confident removing based on LotArea are the four on the right side of the LotArea graph, since these lot sizes are far larger than any other lot seen in our dataset.

glafil =
  fulldatadf['GrLivArea'] >= 4000

spfil =
  fulldatadf['SalePrice'] <= 200000

lafil  =
  fulldatadf['LotArea']   >= 80000

fulldatadf =
  fulldatadf[~((glafil & spfil) | lafil)]

Improperly encoded features

These are 3 features that are supposed to be nominal but have been encoded using integers: PID, MoSold, and MSSubClass. Leaving these as numerical adds a context of weight to each feature that should not be present.

As discussed in the data's documentation, PID needs a leading 0 in the string when looking up additional information on the assessor's website.

fulldatadf['MoSold'] =
  fulldatadf['MoSold'].map(str)

fulldatadf['MSSubClass'] =
  fulldatadf['MSSubClass'].map(str)

fulldatadf['PID'] =
  '0' + fulldatadf['PID'].map(str)

Note that I will not be including PID in the model.

Ordinal features

Of the 82 features, 46 are considered categorical - 23 ordinal, 23 nominal. In this section, I will recategorize some features, and map ordinal features to discrete numerics.

Remapping ordinal features to integer scale

LotShape, Utilities, LandSlope, OverallQual, OverallCond, ExterQual, ExterCond, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, HeatingQC, Electrical, KitchenQual, Functional, FireplaceQu, GarageFinish, GarageQual, GarageCond, PavedDrive, PoolQC, and Fence all have some implicit ordering associated.

This step takes advantage of the ordered quality of these features, and maps it to a linear scale. Defining a map, then applying said map for 20+ variables is very repetitive, so the full script (without graphing code) is available here on my github.

As a short example, here is the process for Functional:

# convert Functional to numerical
functional_map = {
        'Sal'  : 1,
        'Sev'  : 2,
        'Maj2' : 3,
        'Maj1' : 4,
        'Mod'  : 5,
        'Min2' : 6,
        'Min1' : 7,
        'Typ'  : 8
}
remap(fulldatadf,
             'Functional',
             functional_map)

Nominal features

These are all categorical features that have no implicit ordering, so we will try to either merge low cardinality levels into a broader bin, or create new features using the levels within a feature.

PID, MSSubClass, MSZoning, Street, Alley, LandContour, LotConfig, Neighborhood, Condition1, Condition2, BldgType, HouseStyle, RoofStyle, RoofMatl, Exterior1, Exterior2, MasVnrType, Foundation, Heating, CentralAir, GarageType, MiscFeature, SaleType, SaleCondition

MSSubClass

By the documentation, MSSubClass is a combination of 3 features: HouseStyle, BldgType and YearBuilt. In my opinion, the only important aspect of this feature is determining whether a home is part of a PUD (Planned Unit Development), as all other information is available in another feature. Below are the levels for MSSubClass with PUD in description, and the levels for the equivalent non-PUD home.

Non-PUD Code	Description	PUD Code	Description
120	1 Story, 1946 and newer	120	1 Story PUD, 1946 and newer
045, 050	1 1/2 Story, all ages	150	1 1/2 Story PUD, all ages
060	2 Story, 1946 and newer	160	2 Story PUD, 1946 and newer
080, 085	Split, multilevel, or split foyer	180	Multilevel PUD

Boxplot showing relation between Sale Price and MSSubClass
_{Note different scales for y-axis `SalePrice`}

These graphs tell me that although PUD homes make up a smaller subset of the data and have less outliers, there are differences in mean price when sorted by PUD and non-PUD homes. From this, I will create a new binary column based on MSSubClass which will contain whether the home is part of a PUD or not and drop MSSubClass, as the remaining data can be found in HouseStyle, BldgType and YearBuilt.

fulldatadf['isPUD'] =
  fulldatadf['MSSubClass'].apply(
    lambda x: 1 if x in PUD_list else 0)

MSZoning

This feature identifies the general zoning classification of the sale; since the goal is to predict a home's SalePrice for a potential homeowner, I will not include properties whose zoning is in A (Agriculture), C (Commerical), or I (Industrial) in the model, under the assumption that a potential homeowner will not be purchasing these kinds of property (this excludes 29 properties in total).

# pre filtering
fulldatadf['MSZoning'].value_counts()
RL         2266
RM          462
FV          139
RH           27
C (all)      25
A (agr)       2
I (all)       2


# filter out commercial, industrial,
# and agricultural properties
MSZoning_cols = ['RL',
                 'RM',
                 'FV',
                 'RH']
MSZoning_mask =
  fulldatadf['MSZoning'].isin(
    MSZoning_cols)

fulldatadf =
  fulldatadf[MSZoning_mask]

Condition1 + Condition2

There are 9 conditions detailed here, including adjacency to arterial street (high capacity urban roads), adjacency to feeder street (a secondary road used to bring traffic to a major road), located within 200’ of a railroad, adjacency to a railroad, and adjacency to or located near a positive off-site feature, like a park or greenbelt. The proximity to railroads usually means higher frequency of loud noises, and possible traffic delays, while proximity to larger roads can mean faster commute times, or higher freqency of loud noises. These features are also possibly confusing, as there are both positive and negative levels.

fulldatadf[['Condition1',
            'Condition2']].value_counts()
Cond1   Cond2
Norm    Norm     2522
Feedr   Norm     155
Artery  Norm     89
RRAn    Norm     41
PosN    Norm     35
RRAe    Norm     28
PosA    Norm     17
RRAn    Feedr    8
RRNn    Norm     7
RRNe    Norm     6
Feedr   Feedr    4
PosN    PosN     4
PosA    PosA     3
Feedr   RRNn     2
Artery  Artery   2
Feedr   RRAn     1
RRAn    Artery   1
Feedr   RRAe     1
        Artery   1
Artery  PosA     1
RRNn    Artery   1
        Feedr    1

Based on the numbers here, I will create 4 new binary features:

New column name	Old levels	Description
RRAdj	RRAn, RRAe	Adjacent to any railroad (N-S or E-W)
RRNear	RRNn, RRNe	Near any railroad (N-S or E-W), excluding adjacent
RoadAdj	Artery, Feedr	Adjacent to arterial or feeder street
PosNear	PosN, PosA	Adjacent to or near positive off-site feature

I believe this creation of simple features, although losing a degree of granularity, will cut down on computation and improve performance, especially if our pipeline involves one hot encoding categorical features.

HouseStyle

This feature has 8 levels, where there are 2 levels for 1.5 story homes (finished and unfinished), 2 levels for 2.5 story homes (finished and unfinished) and 2 levels for split homes. I will merge the finished and unfinished for each style as their IQRs and means are not too far off, and there are relatively few instances of each.

Boxplot showing relation between Sale Price and HouseStyle
_{Before merge}

_{After merge}

RoofStyle

I will merge Flat with Shed to decrease underpopulated levels. (Additionally, I'm of the mind that a flat roof is a type of a shed roof, as according to one of the definitions I found, it can have up to a 10 degree slope in one direction and still be considered flat.)

# after merge
fulldatadf['RoofStyle'].value_counts()
Gable      2318
Hip         547
Shed         25
Gambrel      22
Mansard      11

RoofMatl

A lot of these have single samples - we're going to group everything besides CompShg under a generic Other.

# after merge
fulldatadf['RoofMatl'].value_counts()
CompShg    2881
Other        42

Exterior1st + Exterior2nd

First, there are some typos in Exterior2nd: CmentBd is used instead of CemntBd, and Wd Shng is used instead of WdShing. After cleaning these up, only 66 samples do not have the same values for Exterior1st and Exterior2nd, over half with only 1 sample per combination. To decrease the number of combinations left, I will merge Stone, CBlock, ImStucc, PreCast and AsphShn into Other, and BrkComm with BrkFace to make Brick.

One other issue is of alphabetical ordering - Exterior1st and Exterior2nd do not imply ordering, they are just 2 different exterior materials used on the house. So I rearranged the pairs so that [Plywood, Brick] has the same arrangement as [Brick, Plywood].

# post merge, last 10 groups
fulldatadf.groupby(
  'Exterior1st')[
    'Exterior2nd'].value_counts()[-10:]
	
Ext1    Ext2
Stucco  Stucco  32
        Wd Sdng 5
        WdShing 5
        VinylSd 1
VinylSd VinylSd 1006
        WdShing 9
        Wd Sdng 5
Wd Sdng Wd Sdng 358
        WdShing 20
WdShing WdShing 41

MasVnrType

As done earlier, I combined BrkComm with BrkFace to make Brick, and merge the single CBlock into Stone.

# post merge
fulldatadf['MasVnrType'].value_counts()
None     1773
Brick     904
Stone     246

Foundation

Combine Stone and Wood into Other due to low cardinality.

# post merge
fulldatadf['Foundation'].value_counts()
PConc     1307
CBlock    1240
BrkTil     311
Slab        49
Other       16

Heating

Combine all levels except for GasA and GasW.

# post merge
fulldatadf['Heating'].value_counts()
GasA     2879
GasW       26
Other      18

CentralAir

This feature will go from a ‘Y’ / ‘N’ column to a binary 1 / 0 column.

# post mapping
fulldatadf['CentralAir'].value_counts()
1    2727
0     196

MiscFeature

I will not be merging anything here, despite low cardinality, due to the large differences in MiscVal when grouped by levels, as seen below:

fulldatadf.groupby(
  'MiscFeature')['MiscVal'].mean()
  
MiscFeature
Gar2    8760.00
TenC    2000.00
Othr    3250.00
Shed     767.32
None       0.00

SaleType

Similar to HouseStyle, I will merge these based on their similar categories:

Old SaleType level	Description	New SaleType level
WD	Warranty Deed - Conventional	WD
CWD	Warranty Deed - Cash	WD
VWD	Warranty Deed - VA loan	WD
New	Home just constructed and sold	New
COD	Court Officer Deed / Estate	Other
Con	Contract 15% Down payment regular terms	Con
ConLw	Contract Low Down payment and low interest	Con
ConLI	Contract Low Interest	Con
ConLD	Contract Low Down	Con
Oth	Other	Other

Boxplot showing relation between Sale Price and HouseStyle
_{Before merge}

_{After merge}

Sale Condition

First, we can look at the distribution of SalePrice when sorted by SaleCondition:

What we do now depends entirely on what kinds of homes we want to predict the SalePrice of. If the goal is to predict an average home for a potential home buyer, the model should look only at Normal sales, since other types are not representative of prices seen. If the goal is to predict the price of any real estate transaction in Ames, Iowa, all data points should be kept.

In this case, even though Abnorml, Family, Alloca, and AdjLand make up 9.2% of our dataset, I will not keep these samples in the dataset. Because my goal is to predict the price of a home for a potential home buyer, the model should only see prices associated with typical sales.

# filter 'SaleCondition' by including
# Normal and Partial sales only
salecondition_cols = [
  'Normal',
  'Partial']

salecondition_mask =
  fulldatadf[
    'SaleCondition'].isin(
	  salecondition_cols)

fulldatadf =
  fulldatadf[salecondition_mask]

Final comments

A reminder that all cleaning code can be found on my github. If you want any of the graphing code, check out the full script here . Here’s a quick recap of what shape the data is in after cleaning and filtering:

fulldatadf.shape
(2635, 83)

Action	Columns
Dropped	Order, MSSubClass, Condition1, Condition2
Added	RRAdj, RRNear, RoadAdj, PosNear, isPUD

What's up next

In one of my next posts, I want to talk about building my python module cleaning.py with all of my cleaning functions inside, so my working.py script remains readable, and functions are easily repeatable for my next data cleaning project

More samples and additional information for this dataset are difficult to come by, so any feature engineering will come from currently available data, and testing will occur only on the samples in the available file.

If you see anything in my posts that you want to comment on, send me an email and let me know the issue!

Until next time!