10.1 Data Loading and Preparation
10.1.1 Load Data
We begin by loading the dataset titled “Absenteeism at work” from an online CSV file. The code snippet below imports the data into our R environment, utilizing a semicolon as the delimiter to ensure accurate column separation. This step allows us to have the necessary data ready for detailed analysis and modeling in our study on absenteeism.
absenteeism <- read.csv(
"https://ljkelly3141.github.io/datasets/bi-book/Absenteeism_at_work.csv",
sep = ";"
)Now, lets take a quite look at the data.
| ID | Reason.for.absence | Month.of.absence | Day.of.the.week | Seasons | Transportation.expense | Distance.from.Residence.to.Work | Service.time | Age | Work.load.Average.day | Hit.target | Disciplinary.failure | Education | Son | Social.drinker | Social.smoker | Pet | Weight | Height | Body.mass.index | Absenteeism.time.in.hours |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 11 | 26 | 7 | 3 | 1 | 289 | 36 | 13 | 33 | 240 | 97 | 0 | 1 | 2 | 1 | 0 | 1 | 90 | 172 | 30 | 4 |
| 36 | 0 | 7 | 3 | 1 | 118 | 13 | 18 | 50 | 240 | 97 | 1 | 1 | 1 | 1 | 0 | 0 | 98 | 178 | 31 | 0 |
| 3 | 23 | 7 | 4 | 1 | 179 | 51 | 18 | 38 | 240 | 97 | 0 | 1 | 0 | 1 | 0 | 0 | 89 | 170 | 31 | 2 |
| 7 | 7 | 7 | 5 | 1 | 279 | 5 | 14 | 39 | 240 | 97 | 0 | 1 | 2 | 1 | 1 | 0 | 68 | 168 | 24 | 4 |
| 11 | 23 | 7 | 5 | 1 | 289 | 36 | 13 | 33 | 240 | 97 | 0 | 1 | 2 | 1 | 0 | 1 | 90 | 172 | 30 | 2 |
| 3 | 23 | 7 | 6 | 1 | 179 | 51 | 18 | 38 | 240 | 97 | 0 | 1 | 0 | 1 | 0 | 0 | 89 | 170 | 31 | 2 |
10.1.2 Prepare the data
The process of analyzing absenteeism within an organization requires precise data handling and transformation to ensure meaningful insights. This section outlines how we manipulate the dataset titled absenteeism using R, employing various data manipulation techniques to prepare it for deeper analysis. Each step in the process builds on the previous one, refining and enriching the dataset to enhance its analytical value. Here is a detailed breakdown:
Step 1. Data Filtering The initial stage of our analysis involves filtering the dataset to focus on employees with substantial duration of service. This ensures that the data reflects stable employment patterns, crucial for reliable analysis.
# Filter the data for employees with a Service time of at least 6
filtered_data <- absenteeism |>
filter(Service.time >= 6)Service.time measures the duration of employment in months and examines how it potentially correlates with absenteeism. We filter the dataset to include only employees who have been with the company for at least six months, ensuring that our analysis is based on substantial and reliable absenteeism data.
Step 2. Grouping and Mutation
Next, we group the data by employee ID to analyze absenteeism patterns individually and adjust absenteeism metrics according to each employee’s service duration. This normalization is crucial as it accounts for varying employment periods among employees, ensuring a fair comparison across all data points.
# Group by employee ID and calculate new variables
grouped_data <- filtered_data |>
group_by(ID) |>
mutate(Absenteeism.time.in.hours = sum(Absenteeism.time.in.hours) / Service.time,
Number.of.absence = n() / Service.time)Breaking Down the Code
- Grouping by
ID: The data is first grouped by the employee ID. This allows us to aggregate and analyze data specific to each individual, which is essential for identifying unique absenteeism patterns. - Calculating
Absenteeism.time.in.hours: We calculate the total hours of absenteeism per employee and then divide this by their total service time. This results in the average hours of absenteeism per unit of service time, providing a normalized view of absenteeism that can be compared across employees regardless of how long they have been with the company. - Calculating
Number.of.absence: Similarly,n()counts the number of absentee instances for each employee and divides this number by the service time. This calculation offers another perspective on absenteeism, focusing on the frequency of absences rather than the total duration, adjusted for the length of service.
Step 3. Summarization
After calculating key absenteeism metrics, we proceed to aggregate the data by employee, averaging each metric for every individual. This step simplifies the data into a more manageable form for each employee, making it easier to analyze overall trends and patterns.
# Summarize the grouped data by taking the mean of all columns
summarized_data <- grouped_data |>
summarise(across(everything(), mean))Breaking Down the Code
grouped_data: This is the dataset that has previously been grouped by one or more variables (typically usinggroup_by()). In this context, it likely refers to data grouped by employee ID or another identifier.summarise(): This function is used to create a new data frame that contains summary statistics of the grouped data. It applies the specified summary function to each group.across(everything(), mean): Inside thesummarise()function, theacross()function targets all columns (everything()) in the grouped data. It then applies themean()function to each column within each group. The result is that for each group, the mean value of every column is calculated.summarized_data: This is the new data frame created from the summarization process. Each row insummarized_datarepresents a group fromgrouped_data, with each column containing the mean value of the corresponding original column for that group.
Step 4. Selection and Rearrangement
We then select specific columns crucial for understanding the context of absenteeism, such as demographic factors and lifestyle choices.
# Select specific columns for further analysis
selected_data <- summarized_data |>
select(ID, Number.of.absence, Absenteeism.time.in.hours, Body.mass.index, Age, Social.smoker, Social.drinker, Son, Pet)Breaking Down the Code
selected_data: This is the new data frame created by selecting certain columns fromsummarized_data.select(): This function is used here to choose specific columns from the summarized dataset. The columns chosen are:
ID: Employee identification number.Number.of.absence: The average number of times an employee was absent.Absenteeism.time.in.hours: The average total hours of absenteeism per employee.Body.mass.index: Average body mass index, which can correlate with health-related absenteeism.Age: Average age of employees, as age can affect absenteeism patterns.Social.smoker: Smoking status, which may influence absenteeism due to health issues.Social.drinker: Drinking status, potentially relevant to social habits affecting work attendance.Son: Number of children, as family responsibilities can impact absenteeism.Pet: Pet ownership, which might affect time off needed for pet care.
This focused selection of variables allows for a deeper analysis of factors most likely to influence absenteeism patterns, helping to pinpoint specific areas for intervention or further study.
Step 5. Data Recoding
To enhance the interpretability of the data, we convert numeric indicator variables into categorical ones, making the data clearer and easier to analyze.
# Recode the selected data
recoded_data <- selected_data |>
mutate(Social.smoker = ifelse(Social.smoker == 0, "Non-smoker", "Smoker"),
Social.drinker = ifelse(Social.drinker == 0, "Non-drinker", "Social drinker"),
Children = ifelse(Son == 0, "Non-parent", "Parent"),
Pet = ifelse(Pet == 0, "No Pet(s)", "Pet(s)"))Breaking Down the Code
recoded_data: This new data frame is created by modifying specific columns inselected_data.mutate(): This function is used to alter or create new columns in the data frame. Each specified column is transformed based on the provided conditions:
Social.smoker: Transforms theSocial.smokercolumn from a binary (0 or 1) to a categorical label, where0becomes “Non-smoker” and any non-zero value becomes “Smoker”.Social.drinker: Similar toSocial.smoker, this recodes theSocial.drinkercolumn where0is recoded as “Non-drinker” and any non-zero value as “Social drinker”.Children: TheSoncolumn is recoded into “Non-parent” for0and “Parent” for any non-zero values, providing clear labels about parental status.Pet: ThePetcolumn changes from a numeric indicator to a categorical description, where0becomes “No Pet(s)” and any non-zero value becomes “Pet(s)”.
This recoding process clarifies the impact of lifestyle factors on absenteeism by providing descriptive categories that reflect the characteristics of each employee.
Step 6. Final Selection and Cleanup
Finally, we refine the dataset by eliminating columns that are not essential for our analysis or presentation. This step ensures that the dataset is streamlined, focusing only on the most relevant information.
# Final selection of the data, removing unnecessary columns
Absenteeism.by.employee <- recoded_data |>
select(-Son, -ID)Breaking Down the Code
Absenteeism.by.employee: This is the new data frame created by removing specific columns fromrecoded_data.select(-Son, -ID): Theselect()function is used here with a minus sign (-) to exclude columns from the data frame. In this case:
-Son: Removes theSoncolumn, which has already been transformed into a more informativeChildrencategory.-ID: Removes theIDcolumn to de-personalize the data, focusing solely on the statistical patterns and trends without individual identifiers.
10.1.3 Define High absence
Finally, we create a new variable that categorizes employees based on their absenteeism rates. This classification divides employees into “High Absenteeism” or “Low Absenteeism” groups, using the median of absenteeism hours per month of service as the threshold. Employees whose absenteeism time exceeds this median are classified as having “High Absenteeism.”
# Set threshold of
High.Absence <- 0.5
Absenteeism.by.employee <- Absenteeism.by.employee |>
mutate(
High.absenteeism = ifelse(
Absenteeism.time.in.hours >=
quantile(Absenteeism.time.in.hours, High.Absence),
"High Absenteeism",
"Low Absenteeism") |> as.factor())Breaking Down the Code
High.Absence: This variable is set at 0.5, representing the 50th percentile, commonly known as the median.quantile(Absenteeism.time.in.hours, High.Absence): This function calculates the value at the 50th percentile of theAbsenteeism.time.in.hoursdistribution within the data. It determines the threshold between “High Absenteeism” and “Low Absenteeism.”ifelse(): This function checks whether each employee’s absenteeism time is greater than or equal to the median value. If true, it assigns “High Absenteeism”; otherwise, it assigns “Low Absenteeism.”as.factor(): This function is used to convert the resulting labels (“High Absenteeism” or “Low Absenteeism”) into a factor, which is a categorical variable suitable for analysis in R.
This step categorizes employees, allowing for detailed analyses and comparisons based on their absenteeism levels, enhancing our understanding of absenteeism patterns within the organization.We will use this to conduct a logistic regression.