8.2 Preparing the Dataset

After loading the dataset, our analysis focuses on identifying seasonal patterns in absenteeism over the year. We aim to determine whether absenteeism fluctuates across different months. To do this, we organize the data by month, which allows us to assess and compare absenteeism for each period.

We calculate two key metrics for each month: the total number of absences and the total absenteeism time in hours. Given that the dataset spans a complete set of months within its recorded timeframe, it’s logical to analyze the aggregate totals. This approach ensures that we are considering a consistent basis for comparison across all months, facilitating a clear understanding of when absenteeism peaks or declines.

The specific data processing steps are outlined in the following R code, which groups the data by month before summarizing the key metrics of interest. This method helps us quantitatively measure and compare absenteeism across different times of the year.

Absenteeism.Month <- absenteeism |>
  filter(Month.of.absence > 0) |>
  group_by(Month.of.absence) |>
  summarise(
    Number.of.absence = n(),
    Absenteeism.time.in.hours = sum(Absenteeism.time.in.hours),
  ) |>
  mutate(
    Month.of.absence = as.factor(month.abb[Month.of.absence])
  )

Breaking Down the Code

  1. Start Processing with absenteeism Dataframe: The code uses the absenteeism dataset and begins data manipulation with the pipe operator (|>).

  2. Filter Out Records with No Absences: Applies filter(Month.of.absence > 0) to exclude rows where the Month.of.absence is zero or unspecified, keeping only records with a defined month of absence.

  3. Group Data by Month of Absence: Utilizes group_by(Month.of.absence) to categorize the data into separate groups for each month (e.g., 1 for January, 2 for February, etc.).

  4. Summarize Data Within Each Group: Implements summarise(Number.of.absence = n(), Absenteeism.time.in.hours = sum(Absenteeism.time.in.hours)) to compute two key statistics for each month:

    • Number.of.absence: Counts the total number of absence entries using n(), which returns the number of rows in each group.
    • Absenteeism.time.in.hours: Sums up the total hours of absenteeism.
  5. Convert Numeric Month to Abbreviated Month Name and Factorize It: Uses mutate(Month.of.absence = as.factor(month.abb[Month.of.absence])) to transform the Month.of.absence from numeric values to their corresponding abbreviated names (e.g., “Jan” for January) using the month.abb array. It further converts these names into factors.