12.1 Data Loading and Preparation

We begin our analysis by importing the “Absenteeism at Work” dataset, which is hosted online in CSV format. The R code snippet below loads the data into our R environment using a semicolon as the delimiter to ensure accurate column separation. This initial step is crucial as it prepares the dataset for detailed analysis and modeling to better understand patterns of absenteeism.

absenteeism <- read.csv(
    "https://ljkelly3141.github.io/datasets/bi-book/Absenteeism_at_work.csv",
    sep = ";"
)

The head function allows us to preview the first few rows of the dataset, ensuring it is loaded correctly.

head(absenteeism)

ID	Reason.for.absence	Month.of.absence	Day.of.the.week	Seasons	Transportation.expense	Distance.from.Residence.to.Work	Service.time	Age	Work.load.Average.day	Hit.target	Disciplinary.failure	Education	Son	Social.drinker	Social.smoker	Pet	Weight	Height	Body.mass.index	Absenteeism.time.in.hours
11	26	7	3	1	289	36	13	33	240	97	0	1	2	1	0	1	90	172	30	4
36	0	7	3	1	118	13	18	50	240	97	1	1	1	1	0	0	98	178	31	0
3	23	7	4	1	179	51	18	38	240	97	0	1	0	1	0	0	89	170	31	2
7	7	7	5	1	279	5	14	39	240	97	0	1	2	1	1	0	68	168	24	4
11	23	7	5	1	289	36	13	33	240	97	0	1	2	1	0	1	90	172	30	2
3	23	7	6	1	179	51	18	38	240	97	0	1	0	1	0	0	89	170	31	2

The next block of R code relabels the day and month columns to make them more interpretable. We use the lubridate package to convert numerical day and month values into more readable formats, using weekday and abbreviated month names.

absenteeism <- absenteeism |>
  mutate(
    Day.of.the.week =
      wday(Day.of.the.week + 1, label = TRUE) |> as.character()
  )

We then narrow down the dataset to focus on specific attributes that are key for further analyses, effectively shaping our dataset to include variables that are most relevant.

absenteeism <- absenteeism |>
  select(
    Absenteeism.time.in.hours,
    Day.of.the.week,
    Body.mass.index,
    Age,
    Social.smoker,
    Social.drinker,
    Son,
    Pet,
    Education
  )

Following this, we recode several categorical variables to enhance clarity and remove redundant variables. This includes recoding smoker and drinker status, and condensing the number of children and pets into more meaningful categories.

absenteeism <- absenteeism |>
  mutate(
    Social.smoker = ifelse(Social.smoker == 0, "Non-smoker", "Smoker"),
    Social.drinker = ifelse(Social.drinker == 0, "Non-drinker", "Social drinker"),
    Children = ifelse(Son == 0, "Non-parent", "Parent"),
    Pet = ifelse(Pet == 0, "No Pet(s)", "Pet(s)")
  ) |>
  select(-Son)

Next, the education variable is simplified into broader categories to facilitate easier analysis and interpretation.

absenteeism <- absenteeism |>
  mutate(
    College = ifelse(Education >= 2, "College", "High School")
  ) |>
  select(-Education)

Finally, we ensure all categorical variables are converted to factors for proper analysis. This step is important for statistical modeling and analyses that require categorical data to be explicitly declared.

absenteeism <- absenteeism |>
  mutate(across(where(is.character), as.factor)) |>
  filter(Absenteeism.time.in.hours>0)

After these transformations, a second head function call provides a snapshot of the updated dataset.

head(absenteeism)

Absenteeism.time.in.hours	Day.of.the.week	Body.mass.index	Age	Social.smoker	Social.drinker	Pet	Children	College
4	Wed	30	33	Non-smoker	Social drinker	Pet(s)	Parent	High School
2	Thu	31	38	Non-smoker	Social drinker	No Pet(s)	Non-parent	High School
4	Fri	24	39	Smoker	Social drinker	No Pet(s)	Parent	High School
2	Fri	30	33	Non-smoker	Social drinker	Pet(s)	Parent	High School
2	Sat	31	38	Non-smoker	Social drinker	No Pet(s)	Non-parent	High School
8	Sat	27	28	Non-smoker	Social drinker	Pet(s)	Parent	High School

This refined dataset is now ready to be used in various methods of anomaly detection.

ID	Reason.for.absence	Month.of.absence	Day.of.the.week	Seasons	Transportation.expense	Distance.from.Residence.to.Work	Service.time	Age	Work.load.Average.day	Hit.target	Disciplinary.failure	Education	Son	Social.drinker	Social.smoker	Pet	Weight	Height	Body.mass.index	Absenteeism.time.in.hours
11	26	7	3	1	289	36	13	33	240	97	0	1	2	1	0	1	90	172	30	4
36	0	7	3	1	118	13	18	50	240	97	1	1	1	1	0	0	98	178	31	0
3	23	7	4	1	179	51	18	38	240	97	0	1	0	1	0	0	89	170	31	2
7	7	7	5	1	279	5	14	39	240	97	0	1	2	1	1	0	68	168	24	4
11	23	7	5	1	289	36	13	33	240	97	0	1	2	1	0	1	90	172	30	2
3	23	7	6	1	179	51	18	38	240	97	0	1	0	1	0	0	89	170	31	2

ID	Reason.for.absence	Month.of.absence	Day.of.the.week	Seasons	Transportation.expense	Distance.from.Residence.to.Work	Service.time	Age	Work.load.Average.day	Hit.target	Disciplinary.failure	Education	Son	Social.drinker	Social.smoker	Pet	Weight	Height	Body.mass.index	Absenteeism.time.in.hours
11	26	7	3	1	289	36	13	33	240	97	0	1	2	1	0	1	90	172	30	4
36	0	7	3	1	118	13	18	50	240	97	1	1	1	1	0	0	98	178	31	0
3	23	7	4	1	179	51	18	38	240	97	0	1	0	1	0	0	89	170	31	2
7	7	7	5	1	279	5	14	39	240	97	0	1	2	1	1	0	68	168	24	4
11	23	7	5	1	289	36	13	33	240	97	0	1	2	1	0	1	90	172	30	2
3	23	7	6	1	179	51	18	38	240	97	0	1	0	1	0	0	89	170	31	2

ID	Reason.for.absence	Month.of.absence	Day.of.the.week	Seasons	Transportation.expense	Distance.from.Residence.to.Work	Service.time	Age	Work.load.Average.day	Hit.target	Disciplinary.failure	Education	Son	Social.drinker	Social.smoker	Pet	Weight	Height	Body.mass.index	Absenteeism.time.in.hours
11	26	7	3	1	289	36	13	33	240	97	0	1	2	1	0	1	90	172	30	4
36	0	7	3	1	118	13	18	50	240	97	1	1	1	1	0	0	98	178	31	0
3	23	7	4	1	179	51	18	38	240	97	0	1	0	1	0	0	89	170	31	2
7	7	7	5	1	279	5	14	39	240	97	0	1	2	1	1	0	68	168	24	4
11	23	7	5	1	289	36	13	33	240	97	0	1	2	1	0	1	90	172	30	2
3	23	7	6	1	179	51	18	38	240	97	0	1	0	1	0	0	89	170	31	2