6.5 Conclusion

This chapter walked through a complete data cleaning pipeline — from raw event-level data to a polished employee-level data frame. Along the way, we:

Filtered the data to focus on employees with sufficient service history
Aggregated 740 absence events into per-employee summaries
Recoded numeric variables into readable categorical labels
Created new derived variables (College, High absenteeism)
Converted character columns to factors for modeling

The cleaned dataset is now saved and ready for use. In the next chapters, we will visualize these employee-level patterns (Chapters 7–8) and build models to predict high absenteeism (Chapters 9–10).

6.5.1 Complete Code Listing

Below is the complete R code used in this chapter as a single script. You can copy this code into a Quarto document (.qmd file), render it, and reproduce all of the data cleaning from this chapter.

Complete Code

# -------------------------------------------
# 1. Load packages and data
# -------------------------------------------
if(!require("pacman")) install.packages("pacman")
pacman::p_load("tidyverse", "psych", "lubridate")

# Load the raw dataset
absenteeism <- read_delim(
  "https://ljkelly3141.github.io/datasets/bi-book/Absenteeism_at_work.csv",
  delim = ";"
)

# -------------------------------------------
# 2. Create a custom MODE function
# -------------------------------------------
# Returns the most frequently occurring value
MODE <- function(x) {
  names(sort(-table(x)))[1] |> as.numeric()
}

# -------------------------------------------
# 3. Filter by service time
# -------------------------------------------
# Keep only employees with 6+ months of service
Absenteeism.by.employee <- absenteeism |>
  filter(`Service time` >= 6)

# -------------------------------------------
# 4. Aggregate data by employee
# -------------------------------------------
# Compute per-employee summaries within groups
Absenteeism.by.employee <- Absenteeism.by.employee |>
  group_by(ID) |>
  mutate(`Most common reason for absence` = MODE(`Reason for absence`),
         `Month of absence` = MODE(`Month of absence`),
         `Day of the week` = MODE(`Day of the week`),
         Education = max(Education),
         `Absenteeism time in hours` =
           sum(`Absenteeism time in hours`)/`Service time`,
         `Number of absence` = n()/`Service time`)

# -------------------------------------------
# 5. Collapse to one row per employee
# -------------------------------------------
Absenteeism.by.employee <- Absenteeism.by.employee |>
  group_by(ID) |>
  summarise(across(everything(), mean))

# -------------------------------------------
# 6. Convert day/month to readable labels
# -------------------------------------------
Absenteeism.by.employee <- Absenteeism.by.employee |>
  mutate(
    `Day of the week`  =
      wday(`Day of the week` + 1, label = TRUE) |> as.character(),
    `Month of absence` =
      month(
        ymd(010101) + months(`Month of absence` - 1),
        label = TRUE, abbr = TRUE
      ) |> as.character()
  )

# -------------------------------------------
# 7. Select key variables
# -------------------------------------------
Absenteeism.by.employee <- Absenteeism.by.employee |>
  select(ID, `Number of absence`, `Absenteeism time in hours`,
         `Most common reason for absence`, `Month of absence`,
         `Day of the week`, `Body mass index`, Age,
         `Social smoker`, `Social drinker`, Son, Pet, Education)

# -------------------------------------------
# 8. Recode binary/count variables
# -------------------------------------------
Absenteeism.by.employee <- Absenteeism.by.employee |>
  mutate(`Social smoker` = ifelse(`Social smoker` == 0,
                         "Non-smoker", "Smoker"),
         `Social drinker` = ifelse(`Social drinker` == 0,
                         "Non-drinker", "Social drinker"),
         Children = ifelse(Son == 0, "Non-parent", "Parent"),
         Pet = ifelse(Pet == 0, "No Pet(s)", "Pet(s)")) |>
  select(-Son, -ID)

# -------------------------------------------
# 9. Simplify education levels
# -------------------------------------------
Absenteeism.by.employee <- Absenteeism.by.employee |>
  mutate(
    College = ifelse(Education >= 2, "college", "high school"),
    Education = factor(Education) |>
      fct_recode("high school" = "1", "graduate" = "2",
                 "postgraduate" = "3", "master and doctor" = "4")
  )

# -------------------------------------------
# 10. Categorize absenteeism (high vs. low)
# -------------------------------------------
Absenteeism.by.employee <- Absenteeism.by.employee |>
  mutate(`High absenteeism` =
    ifelse(`Absenteeism time in hours` >= median(`Absenteeism time in hours`),
           "High Absenteeism", "Low Absenteeism"))

# -------------------------------------------
# 11. Convert characters to factors
# -------------------------------------------
Absenteeism.by.employee <- Absenteeism.by.employee |>
  mutate(across(where(is_character), as_factor))

# -------------------------------------------
# 12. Save the cleaned data
# -------------------------------------------
write.csv(Absenteeism.by.employee,
          file = "data/Absenteeism_by_employee_1.csv",
          row.names = FALSE)