Estimated time needed: 20 minutes
In this lab you will learn how to implement regression trees using ScikitLearn. We will show what parameters are important, how to train a regression tree, and finally how to determine our regression trees accuracy.
After completing this lab you will be able to:
- Train a Regression Tree
- Evaluate a Regression Trees Performance
For this lab, we are going to be using Python and several Python libraries. Some of these libraries might be installed in your lab environment or in SN Labs. Others may need to be installed by you. The cells below will install these libraries when executed.
from js import fetch
import io
URL = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/real_estate_data.csv"
resp = await fetch(URL)
regression_tree_data = io.BytesIO((await resp.arrayBuffer()).to_py())
import piplite
await piplite.install(['pandas'])
await piplite.install(['numpy'])
await piplite.install(['scikit-learn'])
# Pandas will allow us to create a dataframe of the data so it can be used and manipulated
import pandas as pd
# Regression Tree Algorithm
from sklearn.tree import DecisionTreeRegressor
# Split our data into a training and testing data
from sklearn.model_selection import train_test_split
Imagine you are a data scientist working for a real estate company that is planning to invest in Boston real estate. You have collected information about various areas of Boston and are tasked with created a model that can predict the median price of houses for that area so it can be used to make offers.
The dataset had information on areas/towns not individual houses, the features are
CRIM: Crime per capita
ZN: Proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS: Proportion of non-retail business acres per town
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX: Nitric oxides concentration (parts per 10 million)
RM: Average number of rooms per dwelling
AGE: Proportion of owner-occupied units built prior to 1940
DIS: Weighted distances to five Boston employment centers
RAD: Index of accessibility to radial highways
TAX: Full-value property-tax rate per $10,000
PTRAIO: Pupil-teacher ratio by town
LSTAT: Percent lower status of the population
MEDV: Median value of owner-occupied homes in $1000s
Lets read in the data we have downloaded
data = pd.read_csv(regression_tree_data)
data.head()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296 | 15.3 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18.7 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222 | 18.7 | NaN | 36.2 |
Now lets learn about the size of our data, there are 506 rows and 13 columns
data.shape
(506, 13)
Most of the data is valid, there are rows with missing values which we will deal with in pre-processing
data.isna().sum()
CRIM 20 ZN 20 INDUS 20 CHAS 20 NOX 0 RM 0 AGE 20 DIS 0 RAD 0 TAX 0 PTRATIO 0 LSTAT 20 MEDV 0 dtype: int64
First lets drop the rows with missing values because we have enough data in our dataset
data.dropna(inplace=True)
Now we can see our dataset has no missing values
data.isna().sum()
CRIM 0 ZN 0 INDUS 0 CHAS 0 NOX 0 RM 0 AGE 0 DIS 0 RAD 0 TAX 0 PTRATIO 0 LSTAT 0 MEDV 0 dtype: int64
Lets split the dataset into our features and what we are predicting (target)
X = data.drop(columns=["MEDV"])
Y = data["MEDV"]
X.head()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | LSTAT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296 | 15.3 | 4.98 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 9.14 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 4.03 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18.7 | 2.94 |
5 | 0.02985 | 0.0 | 2.18 | 0.0 | 0.458 | 6.430 | 58.7 | 6.0622 | 3 | 222 | 18.7 | 5.21 |
Y.head()
0 24.0 1 21.6 2 34.7 3 33.4 5 28.7 Name: MEDV, dtype: float64
Finally lets split our data into a training and testing dataset using train_test_split
from sklearn.model_selection
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2, random_state=1)
Regression Trees are implemented using DecisionTreeRegressor
from sklearn.tree
The important parameters of DecisionTreeRegressor
are
criterion
: {“mse”, “friedman_mse”, “mae”, “poisson”} – The function used to measure error
max_depth
– The max depth the tree can be
min_samples_split
– The minimum number of samples required to split a node
min_samples_leaf
– The minimum number of samples that a leaf can contain
max_features
: {“auto”, “sqrt”, “log2”} – The number of feature we examine looking for the best one, used to speed up training
First lets start by creating a DecisionTreeRegressor
object, setting the criterion
parameter to mse
for Mean Squared Error
regression_tree = DecisionTreeRegressor(criterion = "mse")
Now lets train our model using the fit
method on the DecisionTreeRegressor
object providing our training data
regression_tree.fit(X_train, Y_train)
/lib/python3.10/site-packages/sklearn/tree/_classes.py:359: FutureWarning: Criterion 'mse' was deprecated in v1.0 and will be removed in version 1.2. Use `criterion='squared_error'` which is equivalent. warnings.warn(
DecisionTreeRegressor(criterion='mse')
To evaluate our dataset we will use the score
method of the DecisionTreeRegressor
object providing our testing data, this number is the $R^2$ value which indicates the coefficient of determination
regression_tree.score(X_test, Y_test)
0.852006811553053
We can also find the average error in our testing set which is the average error in median home value prediction
prediction = regression_tree.predict(X_test)
print("$",(prediction - Y_test).abs().mean()*1000)
$ 2715.189873417721
Train a regression tree using the criterion
mae
then report its $R^2$ value and average error
regression_tree = DecisionTreeRegressor(criterion = "mae")
regression_tree.fit(X_train, Y_train)
print(regression_tree.score(X_test, Y_test))
prediction = regression_tree.predict(X_test)
print("$",(prediction - Y_test).abs().mean()*1000)
/lib/python3.10/site-packages/sklearn/tree/_classes.py:366: FutureWarning: Criterion 'mae' was deprecated in v1.0 and will be removed in version 1.2. Use `criterion='absolute_error'` which is equivalent. warnings.warn(
0.8720206502582719 $ 2537.9746835443034
Click here for the solution
“`python
regression_tree = DecisionTreeRegressor(criterion = “mae”)
regression_tree.fit(X_train, Y_train)
print(regression_tree.score(X_test, Y_test))
prediction = regression_tree.predict(X_test)
print(“$”,(prediction – Y_test).abs().mean()*1000)
“`
Azim Hirjani
Date (YYYY-MM-DD) | Version | Changed By | Change Description |
---|---|---|---|
2020-07-20 | 0.2 | Azim | Modified Multiple Areas |
2020-07-17 | 0.1 | Azim | Created Lab Template |
Copyright © 2020 IBM Corporation. All rights reserved.
Similar Notebooks
- introducci c3 b3n a programaci c3 b3n
- nlp c w4 lecture nb 1
- an c3 a1lisis exploratorio de datos precios vivienda
- module 3 regression trees jupyterlite
- an c3 a1lisis exploratorio de datos precios vivienda checkpoint
- nlp c w4 lecture notebook model architecture
- supermarket visitor
- visual analytics with folium
- nlp c w4 lecture notebook data prep
- day1 data types operators
Copyright © Code Fetcher 2022