10.1 Data Loading and Preparation

10.1.1 Load Data

We begin by loading the dataset titled “Absenteeism at work” from an online CSV file. The code snippet below imports the data into our R environment, utilizing a semicolon as the delimiter to ensure accurate column separation. This step allows us to have the necessary data ready for detailed analysis and modeling in our study on absenteeism.

absenteeism <- read.csv(
    "https://ljkelly3141.github.io/datasets/bi-book/Absenteeism_at_work.csv",
    sep = ";"
)

Now, lets take a quite look at the data.

head(absenteeism)
IDReason.for.absenceMonth.of.absenceDay.of.the.weekSeasonsTransportation.expenseDistance.from.Residence.to.WorkService.timeAgeWork.load.Average.dayHit.targetDisciplinary.failureEducationSonSocial.drinkerSocial.smokerPetWeightHeightBody.mass.indexAbsenteeism.time.in.hours
11267312893613332409701210190172304
3607311181318502409711110098178310
3237411795118382409701010089170312
77751279514392409701211068168244
11237512893613332409701210190172302
3237611795118382409701010089170312

10.1.2 Prepare the data

The process of analyzing absenteeism within an organization requires precise data handling and transformation to ensure meaningful insights. This section outlines how we manipulate the dataset titled absenteeism using R, employing various data manipulation techniques to prepare it for deeper analysis. Each step in the process builds on the previous one, refining and enriching the dataset to enhance its analytical value. Here is a detailed breakdown:

Step 1. Data Filtering The initial stage of our analysis involves filtering the dataset to focus on employees with substantial duration of service. This ensures that the data reflects stable employment patterns, crucial for reliable analysis.

# Filter the data for employees with a Service time of at least 6
filtered_data <- absenteeism |>
  filter(Service.time >= 6)

Service.time measures the duration of employment in months and examines how it potentially correlates with absenteeism. We filter the dataset to include only employees who have been with the company for at least six months, ensuring that our analysis is based on substantial and reliable absenteeism data.

Step 2. Grouping and Mutation

Next, we group the data by employee ID to analyze absenteeism patterns individually and adjust absenteeism metrics according to each employee’s service duration. This normalization is crucial as it accounts for varying employment periods among employees, ensuring a fair comparison across all data points.

# Group by employee ID and calculate new variables
grouped_data <- filtered_data |>
  group_by(ID) |>
  mutate(Absenteeism.time.in.hours = sum(Absenteeism.time.in.hours) / Service.time,
         Number.of.absence = n() / Service.time)

Breaking Down the Code

  1. Grouping by ID: The data is first grouped by the employee ID. This allows us to aggregate and analyze data specific to each individual, which is essential for identifying unique absenteeism patterns.
  2. Calculating Absenteeism.time.in.hours: We calculate the total hours of absenteeism per employee and then divide this by their total service time. This results in the average hours of absenteeism per unit of service time, providing a normalized view of absenteeism that can be compared across employees regardless of how long they have been with the company.
  3. Calculating Number.of.absence: Similarly, n() counts the number of absentee instances for each employee and divides this number by the service time. This calculation offers another perspective on absenteeism, focusing on the frequency of absences rather than the total duration, adjusted for the length of service.

Step 3. Summarization

After calculating key absenteeism metrics, we proceed to aggregate the data by employee, averaging each metric for every individual. This step simplifies the data into a more manageable form for each employee, making it easier to analyze overall trends and patterns.

# Summarize the grouped data by taking the mean of all columns
summarized_data <- grouped_data |>
  summarise(across(everything(), mean))

Breaking Down the Code

  1. grouped_data: This is the dataset that has previously been grouped by one or more variables (typically using group_by()). In this context, it likely refers to data grouped by employee ID or another identifier.
  2. summarise(): This function is used to create a new data frame that contains summary statistics of the grouped data. It applies the specified summary function to each group.
  3. across(everything(), mean): Inside the summarise() function, the across() function targets all columns (everything()) in the grouped data. It then applies the mean() function to each column within each group. The result is that for each group, the mean value of every column is calculated.
  4. summarized_data: This is the new data frame created from the summarization process. Each row in summarized_data represents a group from grouped_data, with each column containing the mean value of the corresponding original column for that group.

Step 4. Selection and Rearrangement

We then select specific columns crucial for understanding the context of absenteeism, such as demographic factors and lifestyle choices.

# Select specific columns for further analysis
selected_data <- summarized_data |>
  select(ID, Number.of.absence, Absenteeism.time.in.hours, Body.mass.index, Age, Social.smoker, Social.drinker, Son, Pet)

Breaking Down the Code

  1. selected_data: This is the new data frame created by selecting certain columns from summarized_data.
  2. select(): This function is used here to choose specific columns from the summarized dataset. The columns chosen are:
  • ID: Employee identification number.
  • Number.of.absence: The average number of times an employee was absent.
  • Absenteeism.time.in.hours: The average total hours of absenteeism per employee.
  • Body.mass.index: Average body mass index, which can correlate with health-related absenteeism.
  • Age: Average age of employees, as age can affect absenteeism patterns.
  • Social.smoker: Smoking status, which may influence absenteeism due to health issues.
  • Social.drinker: Drinking status, potentially relevant to social habits affecting work attendance.
  • Son: Number of children, as family responsibilities can impact absenteeism.
  • Pet: Pet ownership, which might affect time off needed for pet care.

This focused selection of variables allows for a deeper analysis of factors most likely to influence absenteeism patterns, helping to pinpoint specific areas for intervention or further study.

Step 5. Data Recoding

To enhance the interpretability of the data, we convert numeric indicator variables into categorical ones, making the data clearer and easier to analyze.

# Recode the selected data
recoded_data <- selected_data |>
  mutate(Social.smoker = ifelse(Social.smoker == 0, "Non-smoker", "Smoker"),
         Social.drinker = ifelse(Social.drinker == 0, "Non-drinker", "Social drinker"),
         Children = ifelse(Son == 0, "Non-parent", "Parent"),
         Pet = ifelse(Pet == 0, "No Pet(s)", "Pet(s)"))

Breaking Down the Code

  1. recoded_data: This new data frame is created by modifying specific columns in selected_data.
  2. mutate(): This function is used to alter or create new columns in the data frame. Each specified column is transformed based on the provided conditions:
  • Social.smoker: Transforms the Social.smoker column from a binary (0 or 1) to a categorical label, where 0 becomes “Non-smoker” and any non-zero value becomes “Smoker”.
  • Social.drinker: Similar to Social.smoker, this recodes the Social.drinker column where 0 is recoded as “Non-drinker” and any non-zero value as “Social drinker”.
  • Children: The Son column is recoded into “Non-parent” for 0 and “Parent” for any non-zero values, providing clear labels about parental status.
  • Pet: The Pet column changes from a numeric indicator to a categorical description, where 0 becomes “No Pet(s)” and any non-zero value becomes “Pet(s)”.

This recoding process clarifies the impact of lifestyle factors on absenteeism by providing descriptive categories that reflect the characteristics of each employee.

Step 6. Final Selection and Cleanup

Finally, we refine the dataset by eliminating columns that are not essential for our analysis or presentation. This step ensures that the dataset is streamlined, focusing only on the most relevant information.

# Final selection of the data, removing unnecessary columns
Absenteeism.by.employee <- recoded_data |>
  select(-Son, -ID)

Breaking Down the Code

  1. Absenteeism.by.employee: This is the new data frame created by removing specific columns from recoded_data.
  2. select(-Son, -ID): The select() function is used here with a minus sign (-) to exclude columns from the data frame. In this case:
  • -Son: Removes the Son column, which has already been transformed into a more informative Children category.
  • -ID: Removes the ID column to de-personalize the data, focusing solely on the statistical patterns and trends without individual identifiers.

10.1.3 Define High absence

Finally, we create a new variable that categorizes employees based on their absenteeism rates. This classification divides employees into “High Absenteeism” or “Low Absenteeism” groups, using the median of absenteeism hours per month of service as the threshold. Employees whose absenteeism time exceeds this median are classified as having “High Absenteeism.”

# Set threshold of
High.Absence <- 0.5

Absenteeism.by.employee <- Absenteeism.by.employee |>
  mutate(
    High.absenteeism = ifelse(
      Absenteeism.time.in.hours >=
        quantile(Absenteeism.time.in.hours, High.Absence),
      "High Absenteeism",
      "Low Absenteeism") |> as.factor())

Breaking Down the Code

  • High.Absence: This variable is set at 0.5, representing the 50th percentile, commonly known as the median.
  • quantile(Absenteeism.time.in.hours, High.Absence): This function calculates the value at the 50th percentile of the Absenteeism.time.in.hours distribution within the data. It determines the threshold between “High Absenteeism” and “Low Absenteeism.”
  • ifelse(): This function checks whether each employee’s absenteeism time is greater than or equal to the median value. If true, it assigns “High Absenteeism”; otherwise, it assigns “Low Absenteeism.”
  • as.factor(): This function is used to convert the resulting labels (“High Absenteeism” or “Low Absenteeism”) into a factor, which is a categorical variable suitable for analysis in R.

This step categorizes employees, allowing for detailed analyses and comparisons based on their absenteeism levels, enhancing our understanding of absenteeism patterns within the organization.We will use this to conduct a logistic regression.