Pipeline for predicting house prices

6 minute read

The Jupyter notebook can be accessed here

This is a simple example of how to: take a modest dataset, distill the feature set through exploratory data analysis, and then scale down features and train a variety of regression models via a pipeline. In the end, I select the random forest regression model, which typically yields very good results, and apply a grid search to tune model parameters, before predicting prices.

Loading in data

In this case we are using the standard Scikit-learn dataset called ‘Boston House Prices’, which contains features about houses (e.g. location, age, number of rooms, basement area), where the median values of the properties are the target values.

Notes about the data

Data Set Characteristics:

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive

:Median Value (attribute 14) is usually the target

:Attribute Information (in order):
    - CRIM     per capita crime rate by town
    - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
    - INDUS    proportion of non-retail business acres per town
    - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    - NOX      nitric oxides concentration (parts per 10 million)
    - RM       average number of rooms per dwelling
    - AGE      proportion of owner-occupied units built prior to 1940
    - DIS      weighted distances to five Boston employment centres
    - RAD      index of accessibility to radial highways
    - TAX      full-value property-tax rate per $10,000
    - PTRATIO  pupil-teacher ratio by town
    - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    - LSTAT    % lower status of the population
    - MEDV     Median value of owner-occupied homes in $1000's

I will now load the data into a Pandas dataframe by concatenating the data and the targets from the Boston dataset.

# Load dataset
data = datasets.load_boston()
# Extract feature names, predictors and targets from sklearn dictionary
names = data.feature_names
predictors = data.data
targets = data.target
# Concatenate predictors and targets for easier processing later on. Let's also name the columns.
df = pd.concat([pd.DataFrame(predictors, columns=names), pd.DataFrame(targets, columns=['MEDV'])], axis=1)

Inspecting data

We have everything in one table with all columns labelled, as intended.

df.head()

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
0	0.00632	18.00000	2.31000	0.53800	6.57500	65.20000	4.09000	1.00000	296.00000	15.30000	396.90000	4.98000	24.00000
1	0.02731	0.00000	7.07000	0.46900	6.42100	78.90000	4.96710	2.00000	242.00000	17.80000	396.90000	9.14000	21.60000
2	0.02729	0.00000	7.07000	0.46900	7.18500	61.10000	4.96710	2.00000	242.00000	17.80000	392.83000	4.03000	34.70000
3	0.03237	0.00000	2.18000	0.45800	6.99800	45.80000	6.06220	3.00000	222.00000	18.70000	394.63000	2.94000	33.40000
4	0.06905	0.00000	2.18000	0.45800	7.14700	54.20000	6.06220	3.00000	222.00000	18.70000	396.90000	5.33000	36.20000

Lets confirm what type of values we have (should all be floating values).

df.dtypes.value_counts()

float64    14
dtype: int64

Lets summarize each attribute. From a glance we could expect some of the data to be quite skewed.

df.describe()

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
count	506.00000	506.00000	506.00000	506.00000	506.00000	506.00000	506.00000	506.00000	506.00000	506.00000	506.00000	506.00000	506.00000	506.00000
mean	3.59376	11.36364	11.13678	0.06917	0.55470	6.28463	68.57490	3.79504	9.54941	408.23715	18.45553	356.67403	12.65306	22.53281
std	8.59678	23.32245	6.86035	0.25399	0.11588	0.70262	28.14886	2.10571	8.70726	168.53712	2.16495	91.29486	7.14106	9.19710
min	0.00632	0.00000	0.46000	0.00000	0.38500	3.56100	2.90000	1.12960	1.00000	187.00000	12.60000	0.32000	1.73000	5.00000
25%	0.08204	0.00000	5.19000	0.00000	0.44900	5.88550	45.02500	2.10018	4.00000	279.00000	17.40000	375.37750	6.95000	17.02500
50%	0.25651	0.00000	9.69000	0.00000	0.53800	6.20850	77.50000	3.20745	5.00000	330.00000	19.05000	391.44000	11.36000	21.20000
75%	3.64742	12.50000	18.10000	0.00000	0.62400	6.62350	94.07500	5.18843	24.00000	666.00000	20.20000	396.22500	16.95500	25.00000
max	88.97620	100.00000	27.74000	1.00000	0.87100	8.78000	100.00000	12.12650	24.00000	711.00000	22.00000	396.90000	37.97000	50.00000

We can also plot histograms for each column to get a clear idea of the shape of the data. It might be a good idea to redistribute some of the data. I will skip over it in this case.

df.hist(figsize=(10, 10), bins=20)
plt.show()

png

We need to make sure we don’t have any Nan values in our dataset, otherwise the ML algorithms will fail.

df.isna().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

Good, now let’s see if we can find strong correlations between columns.

df_corr = df.corr().abs()
filter = (df_corr == 1) | (df_corr < 0.75)
df_corr[filter] = 0
df_corr

f, ax = plt.subplots(figsize=(15, 10))
sns.heatmap(df_corr, annot=True, annot_kws={'size': 14},
            cmap=sns.diverging_palette(220, 20, n=11))

png

Dimensionality reduction

From the previous plot, we were able to ascertain that the ‘NOX’ and the ‘RAD’ columns are strongly correlated with other columns. Therefore, let’s go ahead and drop them from our dataframe.

cols_corr_manual = ['NOX', 'RAD']
df = df.drop(columns=cols_corr_manual)

Setting up train and test datasets

Now we are ready to peel of the last column and assign it to the target (y) variable and the rest of the data can be assigned to the predictor (X) variable. After that, we use the sklearn train_test_split function to create train and test datasets for both X and y.

# Train-set predictors/targets
X = df.iloc[:, 0:-1]
y = df.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)

Initialising models, setting up a pipeline, training and predicting

For this part, I would like a list of regression models, then I will create a pipeline that scales the data, reduces feature dimensionality (via PCA) and trains the model, all in one go. We can loop through each model, inserting it into the pipeline, and then finally we can see how the results for each of the models compare.

# Lets append tuples to the list that contain both the name of the model and the model itself
# This is what the pipeline expects
models = []
models.append( ('LR', LinearRegression()) )
models.append( ('Lasso', Lasso()) )
models.append( ('ElasticNet', ElasticNet()) )
models.append( ('KNN', KNeighborsRegressor()) )
models.append( ('CART', DecisionTreeRegressor()) )
models.append( ('SVR', SVR()) )

# Now we can loop through the models and run the pipeline each time
for name, model in models:
    pipelined_model = Pipeline([('minmax', MinMaxScaler()),
                     ('pca', PCA(n_components = 3)),
                     (name, model)])

    # Train model
    pipelined_model.fit(X_train, y_train)

    # Make predictions on the test-set
    y_hat = pipelined_model.predict(X_test)

    # Calculate error
    RMSE = np.sqrt(mean_squared_error(y_test, y_hat))
    print('Model: ', name, ' | RMSE: ', RMSE)
    print('----------------')

    # I also like to save the models each time as a matter of habit
    joblib.dump(pipelined_model, '{}_model.pkl'.format(name))

Model:  LR  | RMSE:  7.15970109322
----------------
Model:  Lasso  | RMSE:  7.57962692126
----------------
Model:  ElasticNet  | RMSE:  8.1579157098
----------------
Model:  KNN  | RMSE:  6.36588579546
----------------
Model:  CART  | RMSE:  9.45087919969
----------------
Model:  SVR  | RMSE:  7.60464954267
----------------

We will now run the pipeline just once with yet another model - the RandomForest regressor - and incorporate a gridsearch to tune the model parameters.

# The pipeline as before, but this time the model type is static
pipelined_model = Pipeline([('minmax', MinMaxScaler()),
                 ('pca', PCA(n_components = 3)),
                 ('RF', RandomForestRegressor())])

grid = {
        'RF__n_estimators': [100, 200, 300, 400, 500],
        'RF__max_depth': [10, 20, 30, 40, 50],
        'RF__min_samples_leaf': [10, 15, 20, 25, 30]
        }

# Run cross-validation & discover best combination of params, with 10 folds of cross-validation
grid_search_cv = GridSearchCV(pipelined_model,
                              grid,
                              cv=KFold(n_splits=10, random_state=7),
                              scoring='neg_mean_squared_error',
                              n_jobs = -1)

results = grid_search_cv.fit(X_train, y_train)

print(results.best_score_)
print(results.best_params_)

Now we can create a new instance of the pipeline and pass the ‘best_params_’ of the grid search directly into the regressor.

pipelined_model = Pipeline([('minmax', MinMaxScaler()),
                 ('pca', PCA(n_components = 3)),
                 ('RF', RandomForestRegressor(n_estimators=results.best_params_['RF__n_estimators'],
                                              max_depth=results.best_params_['RF__max_depth'],
                                              min_samples_leaf=results.best_params_['RF__min_samples_leaf']))])

pipelined_model.fit(X_train, y_train)

Once more, we make predictions using the test set.

# Lets make predictions on the test-set
y_hat = pipelined_model.predict(X_test)

# Calc error
RMSE = np.sqrt(mean_squared_error(y_test, y_hat))
print('Model: Random forest', ' | ', 'RMSE: ', RMSE)

Model: Random forest  |  RMSE:  6.31066783749

Turns out that our RMSE score wins out by a very tight margin. Let’s bare in mind that we off by about $6000 US here, because the target values are counted in the 1000s.

Twitter Facebook Google+ LinkedIn

George Burry