12.1 Data Loading and Preparation

We begin our analysis by importing the “Absenteeism at Work” dataset, which is hosted online in CSV format. The R code snippet below loads the data into our R environment using a semicolon as the delimiter to ensure accurate column separation. This initial step is crucial as it prepares the dataset for detailed analysis and modeling to better understand patterns of absenteeism.

absenteeism <- read.csv(
    "https://ljkelly3141.github.io/datasets/bi-book/Absenteeism_at_work.csv",
    sep = ";"
)

The head function allows us to preview the first few rows of the dataset, ensuring it is loaded correctly.

head(absenteeism)
IDReason.for.absenceMonth.of.absenceDay.of.the.weekSeasonsTransportation.expenseDistance.from.Residence.to.WorkService.timeAgeWork.load.Average.dayHit.targetDisciplinary.failureEducationSonSocial.drinkerSocial.smokerPetWeightHeightBody.mass.indexAbsenteeism.time.in.hours
11267312893613332409701210190172304
3607311181318502409711110098178310
3237411795118382409701010089170312
77751279514392409701211068168244
11237512893613332409701210190172302
3237611795118382409701010089170312

The next block of R code relabels the day and month columns to make them more interpretable. We use the lubridate package to convert numerical day and month values into more readable formats, using weekday and abbreviated month names.

absenteeism <- absenteeism |>
  mutate(
    Day.of.the.week =
      wday(Day.of.the.week + 1, label = TRUE) |> as.character()
  )

We then narrow down the dataset to focus on specific attributes that are key for further analyses, effectively shaping our dataset to include variables that are most relevant.

absenteeism <- absenteeism |>
  select(
    Absenteeism.time.in.hours,
    Day.of.the.week,
    Body.mass.index,
    Age,
    Social.smoker,
    Social.drinker,
    Son,
    Pet,
    Education
  )

Following this, we recode several categorical variables to enhance clarity and remove redundant variables. This includes recoding smoker and drinker status, and condensing the number of children and pets into more meaningful categories.

absenteeism <- absenteeism |>
  mutate(
    Social.smoker = ifelse(Social.smoker == 0, "Non-smoker", "Smoker"),
    Social.drinker = ifelse(Social.drinker == 0, "Non-drinker", "Social drinker"),
    Children = ifelse(Son == 0, "Non-parent", "Parent"),
    Pet = ifelse(Pet == 0, "No Pet(s)", "Pet(s)")
  ) |>
  select(-Son)

Next, the education variable is simplified into broader categories to facilitate easier analysis and interpretation.

absenteeism <- absenteeism |>
  mutate(
    College = ifelse(Education >= 2, "College", "High School")
  ) |>
  select(-Education)

Finally, we ensure all categorical variables are converted to factors for proper analysis. This step is important for statistical modeling and analyses that require categorical data to be explicitly declared.

absenteeism <- absenteeism |>
  mutate(across(where(is.character), as.factor)) |>
  filter(Absenteeism.time.in.hours>0)

After these transformations, a second head function call provides a snapshot of the updated dataset.

head(absenteeism)
Absenteeism.time.in.hoursDay.of.the.weekBody.mass.indexAgeSocial.smokerSocial.drinkerPetChildrenCollege
4Wed3033Non-smokerSocial drinkerPet(s)ParentHigh School
2Thu3138Non-smokerSocial drinkerNo Pet(s)Non-parentHigh School
4Fri2439SmokerSocial drinkerNo Pet(s)ParentHigh School
2Fri3033Non-smokerSocial drinkerPet(s)ParentHigh School
2Sat3138Non-smokerSocial drinkerNo Pet(s)Non-parentHigh School
8Sat2728Non-smokerSocial drinkerPet(s)ParentHigh School

This refined dataset is now ready to be used in various methods of anomaly detection.