对 R 内置的数据集 USArrests
演示主成分的使用方法。
数据包含 50 个州,
states = row.names(USArrests)
states
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [5] "California" "Colorado" "Connecticut" "Delaware"
## [9] "Florida" "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa" "Kansas"
## [17] "Kentucky" "Louisiana" "Maine" "Maryland"
## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
有四个特征
names(USArrests)
## [1] "Murder" "Assault" "UrbanPop" "Rape"
先计算下这四个变量的均值和方差,可以发现差异很大
apply(USArrests, 2, mean)
## Murder Assault UrbanPop Rape
## 7.788 170.760 65.540 21.232
apply(USArrests, 2, var)
## Murder Assault UrbanPop Rape
## 18.97047 6945.16571 209.51878 87.72916
所以对数据进行 scale 是很必要的,不然大部分主成分都会被 Assault
主导。采用下面命令进行主成分,
pr.out = prcomp(USArrests, scale = TRUE)
prcomp()
默认会对数据中心化,但只有指定 scale=TRUE
才能对数据方差进行标准化(标准差为 1)。返回结果中包含以下变量
names(pr.out)
## [1] "sdev" "rotation" "center" "scale" "x"
其中 center
和 scale
对应进行中心化和标准化时变量的均值和标准差。rotation
返回主成分载荷。注意到一般会有 \(\min(n-1,p)\) 个主成分。
我们不需要另外计算 scores,因为 pr.out$x
就是 scores。
利用下面命令可以画出前两个主成分图象,
biplot(pr.out, scale = 0)
下面计算每个主成分解释的方差比例
pr.var = pr.out$sdev^2
pve = pr.var / sum(pr.var)
然后绘制 方差解释比例 (PVE) 及 累计方差解释比例 (cumulative PVE),
plot(pve, xlab = "PC", ylab = "PVE", ylim = c(0, 1), type = 'b')
plot(cumsum(pve), xlab = "PC", ylab = "CPVE", ylim = c(0,1), type = 'b')
Copyright © 2016-2019 weiya