An average of several
measurements is often more accurate than a single measurement. This
happens when the errors of individual measurements more often cancel
each other than reinforce each other. An average is also more stable
than an individual measurement: if different sets of measurements
are made on the same object, their averages would be more similar
than individual measurements in a single set.
A similar phenomenon
exists for predictive models: a weighted average of predictions is
often more accurate and more stable than an individual model prediction.
Though similar to what happens with measurements, it is less common
and more surprising. A model relates inputs to a target. It seems
surprising that a better relationship exists than is obtainable with
a single model. Combining the models must produce a relationship not
obtainable in any individual model.
An algorithm for training
a model assumes some form of the relationship between the inputs and
the target. Linear regression assumes a linear relation. Tree-based
models assume a constant relation within ranges of the inputs. Neural
networks assume a nonlinear relationship that depends on the architecture
and activation functions chosen for the network.
Combining predictions
from two different algorithms might produce a relationship of a different
form than either algorithm assumes. If two models specify different
relationships and fit the data well, their average is apt to fit the
data better. If not, an individual model is apt to be adequate. In
practice, the best way to know is to combine some models and compare
the results.
For neural networks,
applying the same algorithm several times to the same data might produce
different results, especially when early stopping is used, since the
results might be sensitive to the random initial weights. Averaging
the predictions of several networks trained with early stopping often
improves the accuracy of predictions.