6.2 Processing the Data

6.2.1 Creating a Mode Function

R does not have a built-in function for the statistical mode (the most frequently occurring value). We need one because we want to find each employee’s most common reason for absence, most common day, and most common month. Here is a custom MODE function:

MODE <- function(x) {
  names(sort(-table(x)))[1] |> as.numeric()
}

Breaking Down the Code

  1. table(x): Creates a frequency table — each unique value in x is paired with the number of times it appears.

  2. sort(-table(x)): The negation (-) reverses the sort order, so the most frequent values come first.

  3. names(...)[1]: Extracts the name (i.e., the original value) of the first entry in the sorted table — the most frequent value.

  4. |> as.numeric(): The name is returned as a character string, so we convert it to a number.

Note that if multiple values tie for the highest frequency, this function returns only the first one it encounters after sorting. For our purposes, this is sufficient.

6.2.2 Filtering by Service Time

We remove employees with fewer than six months of service. Six months provides enough absence history for meaningful per-employee summaries, while shorter tenures may reflect onboarding periods rather than established patterns:

Absenteeism.by.employee <- absenteeism |>
  filter(`Service time` >= 6)

The filter() function keeps only the rows where Service time is 6 or greater. The result is stored in a new variable, Absenteeism.by.employee, which we will continue to transform in the steps that follow. Note that Absenteeism.by.employee is the name of our R variable (which we chose ourselves), not a column name from the data file. R variable names cannot contain spaces, so we use dots. Column names, which come from the data file, do contain spaces and require backticks.

6.2.3 Aggregating Data by Employee

Next, we group the data by employee ID and compute summary statistics for each employee. This is the most important step in the chapter — it transforms the dataset from one-row-per-absence to one-row-per-employee:

Absenteeism.by.employee <- Absenteeism.by.employee |>
  group_by(ID) |>
  mutate(`Most common reason for absence` = MODE(`Reason for absence`),
         `Month of absence` = MODE(`Month of absence`),
         `Day of the week` = MODE(`Day of the week`),
         Education = max(Education),
         `Absenteeism time in hours` =
           sum(`Absenteeism time in hours`)/`Service time`,
         `Number of absence` = n()/`Service time`)

Breaking Down the Code

  1. group_by(ID): Groups the data by employee ID so that all subsequent operations are performed separately for each employee.

  2. MODE(`Reason for absence`): Finds the most common absence reason for each employee using our custom function.

  3. MODE(`Month of absence`) and MODE(`Day of the week`): Similarly find the most common month and day of the week for each employee’s absences.

  4. max(Education): Takes the highest education level recorded for each employee (education is encoded numerically, with higher numbers representing higher levels).

  5. sum(`Absenteeism time in hours`)/`Service time`: Calculates the total hours absent divided by years of service, giving an adjusted absenteeism rate that accounts for tenure. Since every row for a given employee has the same Service time value, dividing by it within mutate() produces the correct per-year rate.

  6. n()/`Service time`: Counts the number of absence events for each employee and divides by service time, giving the average number of absences per year of service.

6.2.4 Summarizing Data

The previous step computed new values within each group but kept all the original rows. Now we collapse the data to one row per employee by taking the mean of each variable:

Absenteeism.by.employee <- Absenteeism.by.employee |>
  group_by(ID) |>
  summarise(across(everything(), mean))

Breaking Down the Code

  1. group_by(ID): Groups by employee ID again to ensure the summary is per-employee.

  2. summarise(across(everything(), mean)): For each employee, computes the mean of every column. Since the mutate() step already set many columns to the same value within each group (e.g., Most common reason for absence is the same for all of an employee’s rows), the mean simply returns that value. For variables like Age and Body mass index, which are constant per employee, the mean is the original value.

The result is a data frame with one row per employee — exactly what we need. After filtering and summarizing, we have 33 employees in our cleaned dataset.

6.2.5 Enhancing Date and Time Variables

The Month of absence and Day of the week columns are currently stored as numbers. We convert them to readable labels using the lubridate package:

Absenteeism.by.employee <- Absenteeism.by.employee |>
  mutate(
    `Day of the week`  =
      wday(`Day of the week` + 1, label = TRUE) |> as.character(),
    `Month of absence` =
      month(
        ymd(010101) +
          months(`Month of absence` - 1),
        label = TRUE,
        abbr = TRUE
      ) |> as.character()
  )

Breaking Down the Code

  1. wday(`Day of the week` + 1, label = TRUE): The wday() function converts a numeric day to a weekday name. We add 1 because the dataset uses 2 for Monday through 6 for Friday, while wday() uses 1 for Sunday through 7 for Saturday. The label = TRUE argument returns the name (e.g., “Mon”) instead of a number.

  2. month(ymd(010101) + months(`Month of absence` - 1), label = TRUE, abbr = TRUE): This creates a dummy date and adds months to it, then extracts the abbreviated month name. For example, if Month of absence is 3, it adds 2 months to January 1, arriving at March, and returns “Mar”.

  3. |> as.character(): Converts the labeled factor output to plain character strings.

A Simpler Alternative

A lookup vector would also work for converting month numbers to names:

month_names <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
                 "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
`Month of absence` = month_names[`Month of absence`]

We use lubridate here because it is a powerful package for working with dates and times that you will encounter in later chapters. This is a good introduction to its capabilities.

6.2.6 Refining the Data Set

We select only the columns needed for subsequent analyses, dropping variables like Seasons, Transportation expense, and others that we won’t use in the modeling chapters:

Absenteeism.by.employee <- Absenteeism.by.employee |>
  select(
    ID,
    `Number of absence`,
    `Absenteeism time in hours`,
    `Most common reason for absence`,
    `Month of absence`,
    `Day of the week`,
    `Body mass index`,
    Age,
    `Social smoker`,
    `Social drinker`,
    Son,
    Pet,
    Education
  )

The select() function keeps only the named columns and drops everything else. The 13 variables retained include the outcome measures (Number of absence, Absenteeism time in hours), employee demographics (Age, Education, Son, Pet), health indicators (Body mass index, Social smoker, Social drinker), and the most common absence context (Most common reason for absence, Month of absence, Day of the week).

6.2.7 Recoding Variables

Several binary and count variables are stored as numbers (0/1 or counts). We recode them into descriptive labels to make the data more readable:

Absenteeism.by.employee <- Absenteeism.by.employee |>
  mutate(`Social smoker` = ifelse(`Social smoker` == 0,
                         "Non-smoker",
                         "Smoker"),
         `Social drinker` = ifelse(`Social drinker` == 0,
                         "Non-drinker",
                         "Social drinker"),
         Children = ifelse(Son == 0,
                         "Non-parent",
                         "Parent"),
         Pet = ifelse(Pet == 0,
                         "No Pet(s)",
                         "Pet(s)"))|>
  select(-Son, -ID)

Breaking Down the Code

  1. Social smoker and Social drinker: The ifelse() function checks each value — if it is 0, it assigns a descriptive label (“Non-smoker” or “Non-drinker”); otherwise, it assigns the alternative label (“Smoker” or “Social drinker”).

  2. Children: A new column is created from the Son column (number of children). Employees with 0 children are labeled “Non-parent”; those with 1 or more are labeled “Parent”.

  3. Pet: Similarly, employees with 0 pets get “No Pet(s)” and those with pets get “Pet(s)”.

  4. select(-Son, -ID): Removes the Son column (now replaced by Children) and the ID column (no longer needed for analysis).

6.2.8 Simplifying Education Levels

The Education column is encoded as numbers (1–4). We convert it to a factor with descriptive labels, and also create a simplified binary College variable:

Absenteeism.by.employee <- Absenteeism.by.employee |>
  mutate(
    College = ifelse(Education >= 2, "college", "high school"),
    Education = factor(Education) |>
      fct_recode(
        "high school" = "1",
        "graduate" = "2",
        "postgraduate" = "3",
        "master and doctor" = "4"
      )
  )

Breaking Down the Code

  1. College: A new binary variable. Employees with Education >= 2 (graduate or above) are labeled “college”; those with level 1 are labeled “high school”. This simplified grouping is useful for analyses where we want to compare employees with and without college education.

  2. factor(Education) |> fct_recode(...): First converts Education to a factor, then uses fct_recode() from the forcats package to replace the numeric codes with descriptive labels: 1 becomes “high school”, 2 becomes “graduate”, 3 becomes “postgraduate”, and 4 becomes “master and doctor”.

6.2.9 Categorizing Absenteeism

We create a binary outcome variable that classifies each employee as having either high or low absenteeism, using the median as the threshold:

Absenteeism.by.employee <- Absenteeism.by.employee |>
  mutate(`High absenteeism` =
    ifelse(`Absenteeism time in hours` >= median(`Absenteeism time in hours`),
           "High Absenteeism", "Low Absenteeism"))

The median() function calculates the midpoint of the Absenteeism time in hours distribution. Employees at or above the median are labeled “High Absenteeism”; those below are labeled “Low Absenteeism”. This binary classification will be useful in later chapters when we build predictive models.

6.2.10 Final Touches

The last step converts all remaining character columns to factors, which is required by many R modeling and plotting functions:

Absenteeism.by.employee <- Absenteeism.by.employee |>
  mutate(across(where(is_character), as_factor))

The across(where(is_character), as_factor) expression finds every column that is currently stored as character data and converts it to a factor. Factors are R’s way of representing categorical data — they store both the values and the set of possible categories (called “levels”), which is important for statistical modeling and for controlling the order of categories in plots.