11.6 Summary

This chapter introduced data mining as the exploratory counterpart to the hypothesis-driven statistical modeling covered in Chapter 9. We distinguished five core techniques — classification, clustering, association rule learning, regression, and anomaly detection — each suited to a different type of business question. We examined the critical differences between data mining and statistical modeling, emphasizing that the two approaches are complementary: data mining discovers patterns, and statistical modeling tests whether those patterns are real.

We also addressed the challenges that accompany data mining’s power: data quality dependencies, privacy obligations, algorithmic bias, overfitting risk, and the growing demand for transparency. AI and AutoML are making these techniques more accessible, but the ethical and interpretive responsibilities remain firmly with the analyst.

In the next chapter, we apply anomaly detection to the Absenteeism at Work dataset, using linear regression residuals, k-nearest neighbors, and random forests to identify employees with unusual absence patterns.