2 Case Study: Introducing the Data
PRELIMINARY AND INCOMPLETE
This book incorporates a series of chapters that delve into key concepts, techniques, and strategies essential for understanding and enhancing business operations. An element of our exploration is the “Absenteeism at Work Data Set,” which will be used as an ongoing case study throughout the book.1 This dataset offers a practical context to apply business intelligence methods, providing a real-world backdrop to the theoretical discussions.
Absenteeism in the workplace represents a significant challenge for businesses, characterized by employees habitually not attending work without valid reasons. The implications of absenteeism extend to operational efficiency, productivity, and team morale. The “Absenteeism at Work Data Set” from a courier company in Brazil encompasses monthly records of 36 employees over three years, beginning in July 2007. With 740 instances and 21 attributes, the dataset offers an extensive overview of factors that might influence absenteeism, serving as an excellent basis for applying and understanding business intelligence tools and methods.
Throughout the book, this case study will be revisited in different chapters to demonstrate the application of analytical concepts. We will explore a range of topics, from data visualization techniques to the intricacies of predictive modeling. The insights gleaned and the challenges encountered with this dataset will serve as practical examples to deepen our understanding of business intelligence principles and their real-world applications.
2.1 Data Description
The “Absenteeism at Work Data Set” encompasses a diverse range of attributes that illuminate various aspects of employee absenteeism. This dataset serves as a crucial instrument for delving into the myriad factors that influence workplace absence, including personal, working, and health-related elements. A deeper understanding of the dataset’s attributes is pivotal for comprehensively analyzing absenteeism patterns.
2.1.1 Types of Data in the Dataset
The dataset is characterized by several distinct types of data, each offering unique insights into the dynamics of absenteeism:
Categorical Data includes variables that represent categories or types, such as “Reason for absence,” “Day of the week,” and “Seasons.” These attributes are essential for grouping the data into meaningful segments, facilitating the identification of patterns and trends within specific categories.
Numerical Data represents quantitative measurements and comes in two forms: continuous and discrete. Continuous data, like “Transportation expense” and “Workload Average/day,” can take on any value within a range and are vital for understanding the quantitative nuances of absenteeism, such as the cost of commuting or the average workload. Discrete data, such as “Son” and “Pet,” count occurrences and help quantify aspects of employees’ personal lives that may impact their attendance.
Ordinal Data is a type of categorical data but with an inherent order or ranking. The “Education” attribute is ordinal, as educational levels follow a natural progression. This data type is crucial for analyzing absenteeism in relation to hierarchical factors, such as education levels, which may influence employees’ behavior and attitudes towards work.
Binary Data is a simplified form of categorical data, consisting of only two categories, typically denoted as “Yes” or “No.” Attributes like “Social Drinker” and “Social Smoker” fall into this category and are particularly useful for distinguishing between two distinct groups within the dataset, facilitating analyses that hinge on binary distinctions.
2.1.2 Attribute Descriptions
Each attribute in the dataset contributes to a comprehensive picture of absenteeism:
- ID: A unique identifier for each employee, vital for data management but not directly relevant for absenteeism analysis.
- Reason for Absence (Categorical): Provides insights into the myriad reasons employees might be absent, aiding in identifying common causes.
- Month of Absence: Important for detecting patterns or seasonal trends in absenteeism rates.
- Day of the Week (Categorical): Identifies if certain days (2 for Monday, 3 for Tuesday, etc.) see higher absenteeism, possibly linked to the structure of the workweek.
- Seasons (Categorical): Evaluates the effect of seasonal shifts (1 for summer, 2 for autumn, etc.) on absenteeism.
- Transportation Expense (Numerical): Examines the impact of commuting costs on absenteeism.
- Distance from Residence to Work (Numerical, in kilometers): Investigates the relationship between commute distance and absenteeism rates.
- Service Time (Numerical): Looks at the duration of employment and its potential correlation with absenteeism.
- Age (Numerical): Considers how age-related factors might influence absenteeism.
- Workload Average/Day (Numerical): Explores the connection between high workloads and their potential to increase stress and absenteeism.
- Hit Target (Numerical): Assesses how achieving performance targets might relate to absenteeism.
- Disciplinary Failure (Binary, 1 for yes; 0 for no): Considers the role of disciplinary actions in absenteeism.
- Education (Ordinal, ranging from 1 for high school to 4 for master’s and doctorate levels): Investigates how different education levels might correlate with absenteeism, reflecting varied job roles and responsibilities.
- Son (Numerical, number of children), Social Drinker (Binary, 1 for yes; 0 for no), and Social Smoker (Binary, 1 for yes; 0 for no): Personal and lifestyle factors that could influence absenteeism.
- Pet (Numerical, number of pets): Like family responsibilities, pet care demands can impact the need for absenteeism.
- Weight and Height (Numerical): Offers context for assessing health-related reasons for absenteeism.
- Body Mass Index (Numerical): Provides additional health-related context.
- Absenteeism Time in Hours (Numerical, target variable): The primary metric of interest, representing the total hours of absence.
2.2 Reason for absence
Absences are recorded in alignment with the International Code of Diseases (ICD), which is structured into 21 primary categories (numbers 1 through 21). Furthermore, categories 22 through 28 have been incorporated into the ICD classifications to expand its scope.
- Certain infectious and parasitic diseases
- Neoplasms
- Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism
- Endocrine, nutritional, and metabolic diseases
- Mental and behavioural disorders
- Diseases of the nervous system
- Diseases of the eye and adnexa
- Diseases of the ear and mastoid process
- Diseases of the circulatory system
- Diseases of the respiratory system
- Diseases of the digestive system
- Diseases of the skin and subcutaneous tissue
- Diseases of the musculoskeletal system and connective tissue
- Diseases of the genitourinary system
- Pregnancy, childbirth, and the puerperium
- Certain conditions originating in the perinatal period
- Congenital malformations, deformations, and chromosomal abnormalities
- Symptoms, signs, and abnormal clinical and laboratory findings, not elsewhere classified
- Injury, poisoning, and certain other consequences of external causes
- External causes of morbidity and mortality
- Factors influencing health status and contact with health services
- Patient follow-up
- Medical consultation
- Blood donation
- Laboratory examination
- Unjustified absence
- Physiotherapy
- Dental consultation
2.3 Conclusion
The “Absenteeism at Work Data Set” is not just a focal point for our case study but also a lens through which we can examine the complex issue of workplace absenteeism. By analyzing this dataset’s attributes and deriving insights, we aim to demonstrate how business intelligence can be applied to tackle real-world challenges, enhancing workplace efficiency and employee well-being. Through this practical application, readers will gain valuable skills and knowledge applicable to various business intelligence contexts.
Martiniano, Andrea and Ferreira, Ricardo. (2018). Absenteeism at work. UCI Machine Learning Repository. https://doi.org/10.24432/C5X882.↩︎