11.2 Data Mining vs. Statistical Modeling
Data mining and standard statistical modeling are complementary but distinct approaches to data analysis. Understanding when to use each is essential for BI practitioners.
| Statistical Modeling | Data Mining | |
|---|---|---|
| Starting point | A specific hypothesis to test | Exploration without a predetermined hypothesis |
| Approach | Hypothesis-driven, confirmatory | Data-driven, exploratory |
| Data requirements | Structured, cleaned, often smaller | Can handle large, messy, unstructured data |
| Output | Parameter estimates, confidence intervals, p-values | Patterns, clusters, rules, predictions |
| Strength | Inferential rigor — quantifies uncertainty | Discovery — finds what you did not know to look for |
| Risk | May miss patterns not included in the hypothesis | May find spurious patterns that do not generalize |
In practice, the two approaches work together. Data mining can uncover patterns and generate new hypotheses from large-scale data, which can then be tested and validated through statistical modeling. A clustering algorithm might reveal an unexpected customer segment; a regression model can then quantify how that segment differs from others and test whether the difference is statistically significant (Hastie et al. 2009).
This textbook uses both approaches. The regression models in Chapter 10 are hypothesis-driven statistical models. The anomaly detection techniques in Chapter 12 are data mining — searching for unusual observations without specifying in advance what “unusual” looks like.