11.2 Data Mining vs. Statistical Modeling

Data mining and standard statistical modeling are complementary but distinct approaches to data analysis. Understanding when to use each is essential for BI practitioners.

Statistical Modeling Data Mining
Starting point A specific hypothesis to test Exploration without a predetermined hypothesis
Approach Hypothesis-driven, confirmatory Data-driven, exploratory
Data requirements Structured, cleaned, often smaller Can handle large, messy, unstructured data
Output Parameter estimates, confidence intervals, p-values Patterns, clusters, rules, predictions
Strength Inferential rigor — quantifies uncertainty Discovery — finds what you did not know to look for
Risk May miss patterns not included in the hypothesis May find spurious patterns that do not generalize

In practice, the two approaches work together. Data mining can uncover patterns and generate new hypotheses from large-scale data, which can then be tested and validated through statistical modeling. A clustering algorithm might reveal an unexpected customer segment; a regression model can then quantify how that segment differs from others and test whether the difference is statistically significant (Hastie et al. 2009).

This textbook uses both approaches. The regression models in Chapter 10 are hypothesis-driven statistical models. The anomaly detection techniques in Chapter 12 are data mining — searching for unusual observations without specifying in advance what “unusual” looks like.