9.5 Building and Training Models
9.5.1 Model Selection: Parsimony and Generalizability
Two principles guide model selection:
Parsimony. Prefer the simplest model that adequately captures the patterns in the data. Simpler models are easier to interpret, less prone to overfitting, and more likely to generalize to new data. The goal is not to oversimplify but to avoid unnecessary complexity — a linear regression that explains 85% of variance is often more useful than a neural network that explains 88% but cannot be explained to stakeholders (Molnar 2022).
Generalizability. A model’s true test is its performance on data it has never seen. A model that fits the training data perfectly but fails on new data has overfit — it learned the noise in the training data rather than the underlying signal. The tension between fitting the data well and generalizing to new data is the central challenge of model building.
9.5.2 The Training Process
Model training is the process of fitting a model to historical data so it can learn the underlying patterns (Kelleher et al. 2015). The key steps are:
- Split the data into training and test sets. A common split is 80% training, 20% test. The test set is held back and never used during training.
- Select features — the input variables the model will use. Feature selection and engineering can significantly impact performance.
- Choose an algorithm appropriate to the problem: linear regression for continuous outcomes, logistic regression for binary outcomes, decision trees or random forests for complex nonlinear patterns.
- Tune hyperparameters — settings that control the learning process (e.g., the regularization strength in a regression, the depth of a decision tree). This is typically done using cross-validation on the training set.
- Fit the model to the training data, allowing it to adjust its parameters to minimize prediction error.
9.5.3 Overfitting and Underfitting
Overfitting occurs when a model is too complex for the data — it memorizes the training set, including its noise, and performs poorly on new data. Underfitting occurs when a model is too simple — it fails to capture the real patterns. The goal is the sweet spot between these extremes.
A practical demonstration with mtcars:
# Split data into training (80%) and test (20%)
set.seed(42)
n <- nrow(mtcars)
train_idx <- sample(1:n, size = floor(0.8 * n))
train_data <- mtcars[train_idx, ]
test_data <- mtcars[-train_idx, ]
# Fit a simple linear model: MPG predicted by weight
simple_model <- lm(mpg ~ wt, data = train_data)
# Fit a more complex model: MPG predicted by all variables
complex_model <- lm(mpg ~ ., data = train_data)
# Compare model performance
simple_pred <- predict(simple_model, test_data)
complex_pred <- predict(complex_model, test_data)
data.frame(
Model = c("Simple (mpg ~ wt)", "Complex (mpg ~ all)"),
R2_Training = c(summary(simple_model)$r.squared,
summary(complex_model)$r.squared) |> round(3),
RMSE_Test = c(sqrt(mean((test_data$mpg - simple_pred)^2)),
sqrt(mean((test_data$mpg - complex_pred)^2))) |> round(2)
)| Model | R2_Training | RMSE_Test |
|---|---|---|
| Simple (mpg ~ wt) | 0.776 | 4.08 |
| Complex (mpg ~ all) | 0.902 | 4.88 |
The complex model will typically have a higher R² on the training data but may have a similar or worse RMSE on the test data — a classic illustration of why parsimony matters.