8.2 Preparing the Dataset
After loading the dataset, our analysis focuses on identifying seasonal patterns in absenteeism over the year. We aim to determine whether absenteeism fluctuates across different months. To do this, we organize the data by month, which allows us to assess and compare absenteeism for each period.
We calculate two key metrics for each month: the total number of absences and the total absenteeism time in hours. Given that the dataset spans a complete set of months within its recorded timeframe, it’s logical to analyze the aggregate totals. This approach ensures that we are considering a consistent basis for comparison across all months, facilitating a clear understanding of when absenteeism peaks or declines.
The specific data processing steps are outlined in the following R code, which groups the data by month before summarizing the key metrics of interest. This method helps us quantitatively measure and compare absenteeism across different times of the year.
Absenteeism.Month <- absenteeism |>
filter(Month.of.absence > 0) |>
group_by(Month.of.absence) |>
summarise(
Number.of.absence = n(),
Absenteeism.time.in.hours = sum(Absenteeism.time.in.hours),
) |>
mutate(
Month.of.absence = as.factor(month.abb[Month.of.absence])
)Breaking Down the Code
Start Processing with
absenteeismDataframe: The code uses theabsenteeismdataset and begins data manipulation with the pipe operator (|>).Filter Out Records with No Absences: Applies
filter(Month.of.absence > 0)to exclude rows where theMonth.of.absenceis zero or unspecified, keeping only records with a defined month of absence.Group Data by Month of Absence: Utilizes
group_by(Month.of.absence)to categorize the data into separate groups for each month (e.g., 1 for January, 2 for February, etc.).Summarize Data Within Each Group: Implements
summarise(Number.of.absence = n(), Absenteeism.time.in.hours = sum(Absenteeism.time.in.hours))to compute two key statistics for each month:Number.of.absence: Counts the total number of absence entries usingn(), which returns the number of rows in each group.Absenteeism.time.in.hours: Sums up the total hours of absenteeism.
Convert Numeric Month to Abbreviated Month Name and Factorize It: Uses
mutate(Month.of.absence = as.factor(month.abb[Month.of.absence]))to transform theMonth.of.absencefrom numeric values to their corresponding abbreviated names (e.g., “Jan” for January) using themonth.abbarray. It further converts these names into factors.