12.1 Data Loading and Preparation
We begin our analysis by importing the “Absenteeism at Work” dataset, which is hosted online in CSV format. The R code snippet below loads the data into our R environment using a semicolon as the delimiter to ensure accurate column separation. This initial step is crucial as it prepares the dataset for detailed analysis and modeling to better understand patterns of absenteeism.
absenteeism <- read.csv(
"https://ljkelly3141.github.io/datasets/bi-book/Absenteeism_at_work.csv",
sep = ";"
)The head function allows us to preview the first few rows of the dataset, ensuring it is loaded correctly.
| ID | Reason.for.absence | Month.of.absence | Day.of.the.week | Seasons | Transportation.expense | Distance.from.Residence.to.Work | Service.time | Age | Work.load.Average.day | Hit.target | Disciplinary.failure | Education | Son | Social.drinker | Social.smoker | Pet | Weight | Height | Body.mass.index | Absenteeism.time.in.hours |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 11 | 26 | 7 | 3 | 1 | 289 | 36 | 13 | 33 | 240 | 97 | 0 | 1 | 2 | 1 | 0 | 1 | 90 | 172 | 30 | 4 |
| 36 | 0 | 7 | 3 | 1 | 118 | 13 | 18 | 50 | 240 | 97 | 1 | 1 | 1 | 1 | 0 | 0 | 98 | 178 | 31 | 0 |
| 3 | 23 | 7 | 4 | 1 | 179 | 51 | 18 | 38 | 240 | 97 | 0 | 1 | 0 | 1 | 0 | 0 | 89 | 170 | 31 | 2 |
| 7 | 7 | 7 | 5 | 1 | 279 | 5 | 14 | 39 | 240 | 97 | 0 | 1 | 2 | 1 | 1 | 0 | 68 | 168 | 24 | 4 |
| 11 | 23 | 7 | 5 | 1 | 289 | 36 | 13 | 33 | 240 | 97 | 0 | 1 | 2 | 1 | 0 | 1 | 90 | 172 | 30 | 2 |
| 3 | 23 | 7 | 6 | 1 | 179 | 51 | 18 | 38 | 240 | 97 | 0 | 1 | 0 | 1 | 0 | 0 | 89 | 170 | 31 | 2 |
The next block of R code relabels the day and month columns to make them more interpretable. We use the lubridate package to convert numerical day and month values into more readable formats, using weekday and abbreviated month names.
absenteeism <- absenteeism |>
mutate(
Day.of.the.week =
wday(Day.of.the.week + 1, label = TRUE) |> as.character()
)We then narrow down the dataset to focus on specific attributes that are key for further analyses, effectively shaping our dataset to include variables that are most relevant.
absenteeism <- absenteeism |>
select(
Absenteeism.time.in.hours,
Day.of.the.week,
Body.mass.index,
Age,
Social.smoker,
Social.drinker,
Son,
Pet,
Education
)Following this, we recode several categorical variables to enhance clarity and remove redundant variables. This includes recoding smoker and drinker status, and condensing the number of children and pets into more meaningful categories.
absenteeism <- absenteeism |>
mutate(
Social.smoker = ifelse(Social.smoker == 0, "Non-smoker", "Smoker"),
Social.drinker = ifelse(Social.drinker == 0, "Non-drinker", "Social drinker"),
Children = ifelse(Son == 0, "Non-parent", "Parent"),
Pet = ifelse(Pet == 0, "No Pet(s)", "Pet(s)")
) |>
select(-Son)Next, the education variable is simplified into broader categories to facilitate easier analysis and interpretation.
absenteeism <- absenteeism |>
mutate(
College = ifelse(Education >= 2, "College", "High School")
) |>
select(-Education)Finally, we ensure all categorical variables are converted to factors for proper analysis. This step is important for statistical modeling and analyses that require categorical data to be explicitly declared.
absenteeism <- absenteeism |>
mutate(across(where(is.character), as.factor)) |>
filter(Absenteeism.time.in.hours>0)After these transformations, a second head function call provides a snapshot of the updated dataset.
| Absenteeism.time.in.hours | Day.of.the.week | Body.mass.index | Age | Social.smoker | Social.drinker | Pet | Children | College |
|---|---|---|---|---|---|---|---|---|
| 4 | Wed | 30 | 33 | Non-smoker | Social drinker | Pet(s) | Parent | High School |
| 2 | Thu | 31 | 38 | Non-smoker | Social drinker | No Pet(s) | Non-parent | High School |
| 4 | Fri | 24 | 39 | Smoker | Social drinker | No Pet(s) | Parent | High School |
| 2 | Fri | 30 | 33 | Non-smoker | Social drinker | Pet(s) | Parent | High School |
| 2 | Sat | 31 | 38 | Non-smoker | Social drinker | No Pet(s) | Non-parent | High School |
| 8 | Sat | 27 | 28 | Non-smoker | Social drinker | Pet(s) | Parent | High School |
This refined dataset is now ready to be used in various methods of anomaly detection.