Tree-based methods involve stratifying or segmenting the predictor space into a number of simple regions.
Since the set of splitting rules used to segment the predictor space can be summarized in a tree, these types of approaches are known as decision tree methods.
Each of these approaches involves producing multiple trees which are then combined to yield a single consensus prediction.
recursive binary splitting: top-down, greedy
tree pruning
If the relationship between the features and the response is well approximated by a linear model, then an approach such as linear regression will likely work well, and will outperform a method such as a regression tree that does not exploit this linear structure.
If instead there is a highly non-linear and complex relationship between the features and the response as indicated, then decision trees may outperform classical approaches.
\[ \hat f_{bag}(x)=\frac{1}{B}\sum\limits_{b=1}^B\hat f^{*b}(x) \]
reduce the variance
As in bagging, we build a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of \(m\) predictors is chosen as split candidates from the full set of \(p\) predictors.
Suppose that there is one very strong predictor in the data set, along with a number of other moderately strong predictors. Then in the collection of bagged trees, most or all of the trees will use this strong predictor in the top split. Consequently, all of the bagged trees will look quite similar to each other. Hence the predictions from the bagged trees will be highly correlated. Unfortunately, averaging many highly correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated quantities. In particular, this means that bagging will not lead to a substantial reduction in variance over a single tree in this setting.
Boosting works in a similar way, except that the trees are grown sequentially: each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.
learns slowly.
Given the current model, we fit a decision tree to the residuals from the model. That is, we fit a tree using the current residuals, rather than the outcome \(Y\), as the response.
Boosting has three tuning parameters
library(tree)
library(ISLR)
attach(Carseats)
## The following object is masked _by_ .GlobalEnv:
##
## High
## The following objects are masked from Carseats (pos = 4):
##
## Advertising, Age, CompPrice, Education, High, High.1, Income,
## Population, Price, Sales, ShelveLoc, Urban, US
## The following objects are masked from Carseats (pos = 6):
##
## Advertising, Age, CompPrice, Education, High, Income,
## Population, Price, Sales, ShelveLoc, Urban, US
## The following objects are masked from Carseats (pos = 8):
##
## Advertising, Age, CompPrice, Education, Income, Population,
## Price, Sales, ShelveLoc, Urban, US
head(Carseats)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age
## 1 9.50 138 73 11 276 120 Bad 42
## 2 11.22 111 48 16 260 83 Good 65
## 3 10.06 113 35 10 269 80 Medium 59
## 4 7.40 117 100 4 466 97 Medium 55
## 5 4.15 141 64 3 340 128 Bad 38
## 6 10.81 124 113 13 501 72 Bad 78
## Education Urban US High High.1 High.2
## 1 17 Yes Yes Yes Yes Yes
## 2 10 Yes Yes Yes Yes Yes
## 3 12 Yes Yes Yes Yes Yes
## 4 14 Yes Yes No No No
## 5 13 Yes No No No No
## 6 16 No Yes Yes Yes Yes
High = ifelse(Sales <= 8, "No", "Yes")
Carseats = data.frame(Carseats, High)
head(Carseats)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age
## 1 9.50 138 73 11 276 120 Bad 42
## 2 11.22 111 48 16 260 83 Good 65
## 3 10.06 113 35 10 269 80 Medium 59
## 4 7.40 117 100 4 466 97 Medium 55
## 5 4.15 141 64 3 340 128 Bad 38
## 6 10.81 124 113 13 501 72 Bad 78
## Education Urban US High High.1 High.2 High.3
## 1 17 Yes Yes Yes Yes Yes Yes
## 2 10 Yes Yes Yes Yes Yes Yes
## 3 12 Yes Yes Yes Yes Yes Yes
## 4 14 Yes Yes No No No No
## 5 13 Yes No No No No No
## 6 16 No Yes Yes Yes Yes Yes
tree.carseats = tree(High~.-Sales, Carseats)
summary(tree.carseats)
##
## Classification tree:
## tree(formula = High ~ . - Sales, data = Carseats)
## Variables actually used in tree construction:
## [1] "High.1"
## Number of terminal nodes: 2
## Residual mean deviance: 0 = 0 / 398
## Misclassification error rate: 0 = 0 / 400
plot(tree.carseats)
text(tree.carseats, pretty = 0)
In order to properly evaluate the performance of a classification tree on these data, we must estimate the test error rather than simply computing the training error.
set.seed(2)
train = sample(1:nrow(Carseats), 200)
Carseats.test = Carseats[-train, ]
High.test = High[-train]
tree.carseats = tree(High~.-Sales, Carseats, subset = train)
tree.pred = predict(tree.carseats, Carseats.test, type = "class")
table(tree.pred, High.test)
## High.test
## tree.pred No Yes
## No 116 0
## Yes 0 84
use the argument FUN=prune.misclass
in order to indicate that we want the classification error rate to guide the cross-validation and pruning process, rather than the default for the cv.tree()
function, which is deviance.
set.seed(3)
cv.carseats = cv.tree(tree.carseats, FUN = prune.misclass)
names(cv.carseats)
## [1] "size" "dev" "k" "method"
# size: the number of terminal nodes of each tree considered
# dev: corresponds to the cross-validation error rate in this instance.
# k: the value of the cost-complexity parameter used
cv.carseats
## $size
## [1] 2 1
##
## $dev
## [1] 0 80
##
## $k
## [1] -Inf 80
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
par(mfrow = c(1, 2))
plot(cv.carseats$size, cv.carseats$dev, type = "b")
plot(cv.carseats$k, cv.carseats$dev, type = "b")
apply the prune.misclass() function in order to prune the tree to obtain the nine-node tree.
prune.carseats = prune.misclass(tree.carseats, best = 9)
## Warning in prune.tree(tree = tree.carseats, best = 9, method = "misclass"):
## best is bigger than tree size
plot(prune.carseats)
text(prune.carseats, pretty = 0)
tree.pred = predict(prune.carseats, Carseats.test, type = "class")
table(tree.pred, High.test)
## High.test
## tree.pred No Yes
## No 116 0
## Yes 0 84
If we increase the value of best , we obtain a larger pruned tree with lower classification accuracy
prune.carseats = prune.misclass(tree.carseats, best = 15)
## Warning in prune.tree(tree = tree.carseats, best = 15, method =
## "misclass"): best is bigger than tree size
plot(prune.carseats)
text(prune.carseats, pretty = 0)
tree.pred = predict(prune.carseats, Carseats.test, type = "class")
table(tree.pred, High.test)
## High.test
## tree.pred No Yes
## No 116 0
## Yes 0 84
library(MASS)
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
set.seed(1)
train = sample(1:nrow(Boston), nrow(Boston)/2)
tree.boston = tree(medv~., Boston, subset = train)
summary(tree.boston)
##
## Regression tree:
## tree(formula = medv ~ ., data = Boston, subset = train)
## Variables actually used in tree construction:
## [1] "lstat" "rm" "dis"
## Number of terminal nodes: 8
## Residual mean deviance: 12.65 = 3099 / 245
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -14.10000 -2.04200 -0.05357 0.00000 1.96000 12.60000
plot(tree.boston)
text(tree.boston, pretty = 0)
cv.boston = cv.tree(tree.boston)
plot(cv.boston$size, cv.boston$dev, type = "b")
prune.boston = prune.tree(tree.boston, best = 5)
plot(prune.boston)
text(prune.boston, pretty = 0)
yhat = predict(tree.boston, newdata = Boston[-train, ])
boston.test = Boston[-train, "medv"]
plot(yhat, boston.test)
abline(0, 1)
mean((yhat-boston.test)^2)
## [1] 25.04559
library(randomForest)
set.seed(1)
ncol(Boston)
## [1] 14
# p = 14-1 = 13
bag.boston=randomForest(medv~., data = Boston, subset = train, mtry=13, importance = TRUE)
# change the number of trees grown by randomForest() using the ntree argument
bag.boston
##
## Call:
## randomForest(formula = medv ~ ., data = Boston, mtry = 13, importance = TRUE, subset = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 13
##
## Mean of squared residuals: 11.15723
## % Var explained: 86.49
How well does this bagged model perform on the test set?
yhat.bag = predict(bag.boston, newdata = Boston[-train, ])
plot(yhat.bag, boston.test)
abline(0, 1)
mean((yhat.bag-boston.test)^2)
## [1] 13.50808
set.seed(1)
rf.boston = randomForest(medv~., data = Boston, subset = train, mtry = 6, importance = TRUE)
yhat.rf = predict(rf.boston, newdata = Boston[-train, ])
mean((yhat.rf-boston.test)^2)
## [1] 11.66454
importance(rf.boston)
## %IncMSE IncNodePurity
## crim 12.132320 986.50338
## zn 1.955579 57.96945
## indus 9.069302 882.78261
## chas 2.210835 45.22941
## nox 11.104823 1044.33776
## rm 31.784033 6359.31971
## age 10.962684 516.82969
## dis 15.015236 1224.11605
## rad 4.118011 95.94586
## tax 8.587932 502.96719
## ptratio 12.503896 830.77523
## black 6.702609 341.30361
## lstat 30.695224 7505.73936
varImpPlot(rf.boston)
library(gbm)
set.seed(1)
boost.boston = gbm(medv~., data = Boston[train, ], distribution = "gaussian", n.trees = 5000, interaction.depth = 4)
summary(boost.boston)
## var rel.inf
## lstat lstat 37.0661275
## rm rm 25.3533123
## dis dis 11.7903016
## crim crim 8.0388750
## black black 4.2531659
## nox nox 3.5058570
## age age 3.4868724
## ptratio ptratio 2.2500385
## indus indus 1.7725070
## tax tax 1.1836592
## chas chas 0.7441319
## rad rad 0.4274311
## zn zn 0.1277206
par(mfrow=c(1,2))
plot(boost.boston, i="rm")
plot(boost.boston, i="lstat")
use the boosted model to predict medv on the test set:
yhat.boost = predict(boost.boston, newdata = Boston[-train, ], n.trees = 5000)
mean((yhat.boost - boston.test)^2)
## [1] 10.81479
choose a different \(\lambda\)
boost.boston = gbm(medv~., data = Boston[train, ], distribution = "gaussian", n.trees = 5000, interaction.depth = 4, shrinkage = 0.2, verbose = F)
yhat.boost = predict(boost.boston, newdata = Boston[-train, ], n.trees = 5000)
mean((yhat.boost - boston.test)^2)
## [1] 11.51109
Copyright © 2016-2019 weiya