# 8.8 模型平均和堆栈¶

8.4 节我们根据一种非参贝叶斯分析，将估计器的 bootstrap 值看成对应参数近似的后验值．从这个角度看，bagged 估计 \eqref{8.51} 是后验贝叶斯均值的近似．相反，训练样本估计量 $\hat f(x)$ 对应后验的中值．因为后验均值（不是中值）最小化平方误差损失，所以 bagging 可以经常降低均方误差也不奇怪．

weiya 注：Stacked Generalization

Abstract from the seminal paper Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2), 241–259.:

This paper introduces stacked generalization, a scheme for minimizing the generalization error rate of one or more generalizers. Stacked generalization works by deducing the biases of the generalizer(s) with respect to a provided learning set. This deduction proceeds by generalizing in a second space whose inputs are (for example) the guesses of the original generalizers when taught with part of the learning set（这里的 $\hat f_m^{-i}(x_i)$） and trying to guess the rest of it（这里的 $y_i$）, and whose output is (for example) the correct guess. When used with multiple generalizers, stacked generalization can be seen as a more sophisticated version of cross-validation, exploiting a strategy more sophisticated than cross-validation’s crude winner-takes-all for combining the individual generalizers (交叉验证中选择最低误差的模型，而这里对模型进行加权，但是损失了解释性？). When used with a single generalizer, stacked generalization is a scheme for estimating (and then correcting for) the error of a generalizer which has been trained on a particular learning set and then asked a particular question. After introducing stacked generalization and justifying its use, this paper presents two numerical experiments. The first demonstrates how stacked generalization improves upon a set of separate generalizers for the NETtalk task of translating text to phonemes. The second demonstrates how stacked generalization improves the performance of a single surface-fitter. With the other experimental evidence in the literature, the usual arguments supporting cross-validation, and the abstract justifications presented in this paper, the conclusion is that for almost any real-world generalization problem one should use some version of stacked generalization to minimize the generalization error rate. This paper ends by discussing some of the variations of stacked generalization, and how it touches on other fields like chaos theory.