模拟:Fig. 18.1¶
R Notebook | 模拟:Fig. 13.5 |
---|---|
作者 | szcf-weiya |
时间 | 2017-12-29 |
更新 | 2018-02-04 |
本笔记是ESL18.1节例子的模拟。
生成数据¶
生成$N$个样本,每个样本都是$p$个成对相关系数为0.2的标准高斯特征$X$.
genX <- function(p, N = 100){
# mu
mu = numeric(p)
# due to var(x_j) = 1
# covariance matrix equals to correlation matrix
Sigma = matrix(0.2, nrow=p, ncol=p)
diag(Sigma) = 1
library(MASS)
X = mvrnorm(N, mu, Sigma)
if (is.null(dim(X)))
X = matrix(X, ncol = p)
return(X)
}
为了生成$Y$,根据下式
首先需要生成标准高斯的参数$\beta_j$,并根据 来确定$\sigma$,从而确定误差项。
genY <- function(X){
N = nrow(X)
p = ncol(X)
beta = rnorm(p, 0, 1)
fX = X %*% as.matrix(beta, ncol = 1)
#fX = rowSums(fX)
epsi.var = var(as.vector(fX))/2 # not for test, just for train
epsi = rnorm(N, 0, sqrt(epsi.var))
Y = fX + epsi
res = list(Y=Y, epsi.var=epsi.var)
return(res)
}
简单线性回归¶
for (p in c(20, 100, 1000)) {
X = genX(p)
Y = genY(X)$Y
model = lm(Y~X)
# count the number of significant terms
coef.mat = coef(summary(model))
sig.level = 0.05
coef.p = coef.mat[, 4]
# remove the intercept
if (names(coef.p)[1] == "(Intercept)")
coef.p = coef.p[-1]
coef.p[which(is.nan(coef.p))] = 1 # replace NaN with 1
sig.num = sum(coef.p < sig.level)
cat("p = ", p, " sig.num = ", sig.num, "\n")
}
其实当$p=100,1000$时,进行简单线性回归通常会得到0残差,所以标准差的估计都是NA
,这样看来统计其显著变量个数意义是不大的,故推测原文中其实是对单变量回归模型而言的。
单变量线性回归¶
for (p in c(20, 100, 1000)){
num = 0
for (j in 1:100){
X = genX(p)
Y = genY(X)$Y
for (i in 1:p){
model = lm(Y~1+X[,i])
coef.mat = coef(summary(model))
#score = abs(coef.mat[2, 1]/coef.mat[2, 2])
if(coef.mat[2, 4] < 0.05)
num = num + 1
}
}
num = num/100
cat("p = ", p, " num of significant term: ", num ,"\n")
}
模拟100次,统计单变量回归系数显著的个数,分别为9.55, 29.82和290.9,与原文中的结果9, 33, 331非常接近。
岭回归¶
test.err.full = matrix(0, nrow=0, ncol=3)
epsi.var.full = c()
for (p in c(20, 100, 1000)){
X = genX(p)
Y.res = genY(X)
Y = Y.res$Y
epsi.var = Y.res$epsi.var
epsi.var.full = c(epsi.var.full, epsi.var)
model = lm.ridge(Y~X, lambda=c(0.001, 100, 1000))
# calculate test error
#X.test = genX(p)
#Y.test = genY(X.test)
#pred = cbind(1, X.test) %*% t(coef(model))
#test.err = numeric(3)
#for (i in 1:ncol(pred))
#{
# test.err[i] = sum((Y.test - pred[,i])^2)/nrow(pred)
#}
#test.err.full = rbind(test.err.full, test.err)
for (j in 1:100){
x.test = genX(p, N=1)
y.test = sum(x.test * rnorm(p)) + rnorm(1, 0, sqrt(epsi.var))
pred = as.matrix(cbind(1, x.test)) %*% t(coef(model))
test.err = (pred - y.test)^2
test.err.full = rbind(test.err.full, test.err)
}
}
t1 = test.err.full[1:100, ]
t2 = test.err.full[101:200, ]
t3 = test.err.full[201:300, ]
作图
boxplot(t1/epsi.var.full[1], main = "20 features", col = "green")
boxplot(t2/epsi.var.full[2], main = "100 features", col = "green")
boxplot(t3/epsi.var.full[3], main = "1000 features", col = "green")
结果与原文图18.1有较大差异,需要进一步查找该差异。