12 Case Study: Anomaly Detection

PRELIMINARY AND INCOMPLETE

Understanding the underlying patterns in employee behavior and identifying anomalies are essential for enhancing organizational efficiency and managing employee behavior effectively. Anomalies, or outliers, in data can often reveal significant, yet previously unnoticed, events or errors, such as data entry errors, fraud, or deviations in employee behavior. This chapter explores the application of anomaly detection techniques to the “Absenteeism at Work” dataset, a detailed compilation of information on employee absenteeism within a workplace. By leveraging sophisticated statistical methods, we aim to uncover unusual patterns and deepen our understanding of the factors driving employee absenteeism.

In this case study, we will examine three robust anomaly detection methods, each offering a distinct approach to identifying outliers:

  1. Linear Regression Model with Examination of Residuals: We begin with a linear regression analysis to predict absenteeism hours based on various predictors such as the reason for absence, time of year, employee workload, and personal or family responsibilities. The residuals from this model—the differences between observed and predicted values—serve as a means to identify data points that significantly diverge from predicted patterns, suggesting potential anomalies.

  2. K-Nearest Neighbors (k-NN): The k-NN method will be employed to detect anomalies by examining the local neighborhood of each data point. In this approach, an anomaly is considered as an observation that lies at an abnormal distance from its neighbors. By adjusting the parameters of the k-NN algorithm, such as the number of neighbors, we can effectively discern what constitutes normal and abnormal absenteeism behavior.

  3. Random Forest: Random Forest, a powerful ensemble method, will be utilized to classify and detect outliers in the absenteeism data. By building multiple decision trees and observing their predictions about the normality or abnormality of data points, Random Forest offers a consensus approach that bolsters the reliability of anomaly detection.