2.2 The Dataset
The Absenteeism at Work dataset is publicly available from the UCI Machine Learning Repository (Martiniano et al. 2018). It contains 740 records and 21 attributes (variables).
Each record represents a single absence event — one instance of one employee missing work. This is an important distinction: the dataset does not have one row per employee, but one row per absence. A single employee may appear in many rows if they were absent multiple times over the three-year period. This structure allows us to analyze individual absence events in detail, including their timing, duration, and stated reason.
The 21 attributes capture information across several dimensions:
- The outcome we want to understand: how many hours the employee was absent
- Employee demographics: age, education level, number of children, number of pets
- Health indicators: body mass index, height, weight, whether the employee is a social drinker or smoker
- Work factors: distance to work, transportation expense, years of service, daily workload, whether the employee hit performance targets, disciplinary history
- Context of the absence: the stated reason (using medical classification codes), the month, day of the week, and season