10  Case Study: Regression Models

PRELIMINARY AND INCOMPLETE

In this case study, we will continue to analyze the “Absenteeism at Work Data Set.” We aim to predict absenteeism using a linear regression model. This statistical approach will allow us to explore and quantify the relationships between absenteeism and various explanatory variables. The selection of these variables will be guided by both the theoretical frameworks relevant to our research and the availability of data. By integrating theory with practical data considerations, our model will provide insights into the factors that influence absenteeism, thus facilitating more informed decision-making.

10.1 Data Loading and Preparation

10.1.1 Load Data

We begin by loading the dataset titled “Absenteeism at work” from an online CSV file. The code snippet below imports the data into our R environment, utilizing a semicolon as the delimiter to ensure accurate column separation. This step allows us to have the necessary data ready for detailed analysis and modeling in our study on absenteeism.

absenteeism <- read.csv(
    "https://ljkelly3141.github.io/datasets/bi-book/Absenteeism_at_work.csv",
    sep = ";"
)

Now, lets take a quite look at the data.

head(absenteeism)
ID Reason.for.absence Month.of.absence Day.of.the.week Seasons Transportation.expense Distance.from.Residence.to.Work Service.time Age Work.load.Average.day Hit.target Disciplinary.failure Education Son Social.drinker Social.smoker Pet Weight Height Body.mass.index Absenteeism.time.in.hours
11 26 7 3 1 289 36 13 33 240 97 0 1 2 1 0 1 90 172 30 4
36 0 7 3 1 118 13 18 50 240 97 1 1 1 1 0 0 98 178 31 0
3 23 7 4 1 179 51 18 38 240 97 0 1 0 1 0 0 89 170 31 2
7 7 7 5 1 279 5 14 39 240 97 0 1 2 1 1 0 68 168 24 4
11 23 7 5 1 289 36 13 33 240 97 0 1 2 1 0 1 90 172 30 2
3 23 7 6 1 179 51 18 38 240 97 0 1 0 1 0 0 89 170 31 2

10.1.2 Prepare the data

The process of analyzing absenteeism within an organization requires precise data handling and transformation to ensure meaningful insights. This section outlines how we manipulate the dataset titled absenteeism using R, employing various data manipulation techniques to prepare it for deeper analysis. Each step in the process builds on the previous one, refining and enriching the dataset to enhance its analytical value. Here is a detailed breakdown:

Step 1. Data Filtering
The initial stage of our analysis involves filtering the dataset to focus on employees with substantial duration of service. This ensures that the data reflects stable employment patterns, crucial for reliable analysis.

# Filter the data for employees with a Service time of at least 6
filtered_data <- absenteeism %>%
  filter(Service.time >= 6)

Service.time measures the duration of employment in months and examines how it potentially correlates with absenteeism. We filter the dataset to include only employees who have been with the company for at least six months, ensuring that our analysis is based on substantial and reliable absenteeism data.

Step 2. Grouping and Mutation

Next, we group the data by employee ID to analyze absenteeism patterns individually and adjust absenteeism metrics according to each employee’s service duration. This normalization is crucial as it accounts for varying employment periods among employees, ensuring a fair comparison across all data points.

# Group by employee ID and calculate new variables
grouped_data <- filtered_data %>%
  group_by(ID) %>%
  mutate(Absenteeism.time.in.hours = sum(Absenteeism.time.in.hours) / Service.time,
         Number.of.absence = n() / Service.time)

Breaking Down the Code

  1. Grouping by ID: The data is first grouped by the employee ID. This allows us to aggregate and analyze data specific to each individual, which is essential for identifying unique absenteeism patterns.
  2. Calculating Absenteeism.time.in.hours: We calculate the total hours of absenteeism per employee and then divide this by their total service time. This results in the average hours of absenteeism per unit of service time, providing a normalized view of absenteeism that can be compared across employees regardless of how long they have been with the company.
  3. Calculating Number.of.absence: Similarly, n() counts the number of absentee instances for each employee and divides this number by the service time. This calculation offers another perspective on absenteeism, focusing on the frequency of absences rather than the total duration, adjusted for the length of service.

Step 3. Summarization

After calculating key absenteeism metrics, we proceed to aggregate the data by employee, averaging each metric for every individual. This step simplifies the data into a more manageable form for each employee, making it easier to analyze overall trends and patterns.

# Summarize the grouped data by taking the mean of all columns
summarized_data <- grouped_data %>%
  summarise(across(everything(), mean))

Breaking Down the Code

  1. grouped_data: This is the dataset that has previously been grouped by one or more variables (typically using group_by()). In this context, it likely refers to data grouped by employee ID or another identifier.
  2. summarise(): This function is used to create a new data frame that contains summary statistics of the grouped data. It applies the specified summary function to each group.
  3. across(everything(), mean): Inside the summarise() function, the across() function targets all columns (everything()) in the grouped data. It then applies the mean() function to each column within each group. The result is that for each group, the mean value of every column is calculated.
  4. summarized_data: This is the new data frame created from the summarization process. Each row in summarized_data represents a group from grouped_data, with each column containing the mean value of the corresponding original column for that group.

Step 4. Selection and Rearrangement

We then select specific columns crucial for understanding the context of absenteeism, such as demographic factors and lifestyle choices.

# Select specific columns for further analysis
selected_data <- summarized_data %>%
  select(ID, Number.of.absence, Absenteeism.time.in.hours, Body.mass.index, Age, Social.smoker, Social.drinker, Son, Pet)

Breaking Down the Code

  1. selected_data: This is the new data frame created by selecting certain columns from summarized_data.
  2. select(): This function is used here to choose specific columns from the summarized dataset. The columns chosen are:
  • ID: Employee identification number.
  • Number.of.absence: The average number of times an employee was absent.
  • Absenteeism.time.in.hours: The average total hours of absenteeism per employee.
  • Body.mass.index: Average body mass index, which can correlate with health-related absenteeism.
  • Age: Average age of employees, as age can affect absenteeism patterns.
  • Social.smoker: Smoking status, which may influence absenteeism due to health issues.
  • Social.drinker: Drinking status, potentially relevant to social habits affecting work attendance.
  • Son: Number of children, as family responsibilities can impact absenteeism.
  • Pet: Pet ownership, which might affect time off needed for pet care.

This focused selection of variables allows for a deeper analysis of factors most likely to influence absenteeism patterns, helping to pinpoint specific areas for intervention or further study.

Step 5. Data Recoding

To enhance the interpretability of the data, we convert numeric indicator variables into categorical ones, making the data clearer and easier to analyze.

# Recode the selected data
recoded_data <- selected_data %>%
  mutate(Social.smoker = ifelse(Social.smoker == 0, "Non-smoker", "Smoker"),
         Social.drinker = ifelse(Social.drinker == 0, "Non-drinker", "Social drinker"),
         Children = ifelse(Son == 0, "Non-parent", "Parent"),
         Pet = ifelse(Pet == 0, "No Pet(s)", "Pet(s)"))

Breaking Down the Code

  1. recoded_data: This new data frame is created by modifying specific columns in selected_data.
  2. mutate(): This function is used to alter or create new columns in the data frame. Each specified column is transformed based on the provided conditions:
  • Social.smoker: Transforms the Social.smoker column from a binary (0 or 1) to a categorical label, where 0 becomes “Non-smoker” and any non-zero value becomes “Smoker”.
  • Social.drinker: Similar to Social.smoker, this recodes the Social.drinker column where 0 is recoded as “Non-drinker” and any non-zero value as “Social drinker”.
  • Children: The Son column is recoded into “Non-parent” for 0 and “Parent” for any non-zero values, providing clear labels about parental status.
  • Pet: The Pet column changes from a numeric indicator to a categorical description, where 0 becomes “No Pet(s)” and any non-zero value becomes “Pet(s)”.

This recoding process clarifies the impact of lifestyle factors on absenteeism by providing descriptive categories that reflect the characteristics of each employee.

Step 6. Final Selection and Cleanup

Finally, we refine the dataset by eliminating columns that are not essential for our analysis or presentation. This step ensures that the dataset is streamlined, focusing only on the most relevant information.

# Final selection of the data, removing unnecessary columns
Absenteeism.by.employee <- recoded_data %>%
  select(-Son, -ID)

Breaking Down the Code

  1. Absenteeism.by.employee: This is the new data frame created by removing specific columns from recoded_data.
  2. select(-Son, -ID): The select() function is used here with a minus sign (-) to exclude columns from the data frame. In this case:
  • -Son: Removes the Son column, which has already been transformed into a more informative Children category.
  • -ID: Removes the ID column to de-personalize the data, focusing solely on the statistical patterns and trends without individual identifiers.

10.1.3 Dfine High absence

Finally, we create a new variable that categorizes employees based on their absenteeism rates. This classification divides employees into “High Absenteeism” or “Low Absenteeism” groups, using the median of absenteeism hours per month of service as the threshold. Employees whose absenteeism time exceeds this median are classified as having “High Absenteeism.”

# Set threshold of 
High.Absence <- 0.5

Absenteeism.by.employee <- Absenteeism.by.employee %>%
  mutate(
    High.absenteeism = ifelse(
      Absenteeism.time.in.hours >= 
        quantile(Absenteeism.time.in.hours, High.Absence),
      "High Absenteeism",
      "Low Absenteeism") |> as.factor())

Breaking Down the Code

  • High.Absence: This variable is set at 0.5, representing the 50th percentile, commonly known as the median.
  • quantile(Absenteeism.time.in.hours, High.Absence): This function calculates the value at the 50th percentile of the Absenteeism.time.in.hours distribution within the data. It determines the threshold between “High Absenteeism” and “Low Absenteeism.”
  • ifelse(): This function checks whether each employee’s absenteeism time is greater than or equal to the median value. If true, it assigns “High Absenteeism”; otherwise, it assigns “Low Absenteeism.”
  • as.factor(): This function is used to convert the resulting labels (“High Absenteeism” or “Low Absenteeism”) into a factor, which is a categorical variable suitable for analysis in R.

This step categorizes employees, allowing for detailed analyses and comparisons based on their absenteeism levels, enhancing our understanding of absenteeism patterns within the organization.We will use this to conduct a logistic regression.

10.2 Modeling Absenteeism

10.2.1 Linear Model Prediction of Absenteeism Time in Hours

10.2.1.1 Model with all predictors

To predict Absenteeism.time.in.hours and understand its underlying factors, we use a linear regression model. The model identifies key variables that influence absenteeism by assessing their statistical significance and effect sizes.

# Initial linear model excluding some predictors
Absenteeism.lm <- lm(Absenteeism.time.in.hours ~ . 
                     - Number.of.absence 
                     - High.absenteeism,
                     data = Absenteeism.by.employee)
summ(Absenteeism.lm)
MODEL INFO:
Observations: 33
Dependent Variable: Absenteeism.time.in.hours
Type: OLS linear regression 

MODEL FIT:
F(6,26) = 1.54, p = 0.21
R² = 0.26
Adj. R² = 0.09 

Standard errors: OLS
----------------------------------------------------------
                              Est.    S.E.   t val.      p
------------------------- -------- ------- -------- ------
(Intercept)                  36.69   14.18     2.59   0.02
Body.mass.index              -0.44    0.52    -0.84   0.41
Age                          -0.37    0.30    -1.22   0.23
Social.smokerSmoker         -12.21    5.36    -2.28   0.03
Social.drinkerSocial          1.05    4.78     0.22   0.83
drinker                                                   
PetPet(s)                    -4.12    5.49    -0.75   0.46
ChildrenParent                7.11    5.74     1.24   0.23
----------------------------------------------------------

Breaking Down the Code

  • The lm() function fits a linear regression model, predicting Absenteeism.time.in.hours using all available variables in Absenteeism.by.employee except Number.of.absence and High.absenteeism.
  • summ() from the jtools package provides an enhanced summary of the linear model, showing estimates, standard errors, t-values, and p-values, making interpretation straightforward.

Results Explanation:

  • Model Info: The analysis includes 33 observations.
  • Model Fit: The overall fit of the model is weak (F(6,26) = 1.54, p = 0.21, R² = 0.26, Adj. R² = 0.09), indicating that the variables explain only 26% of the variability in absenteeism hours. Adjustments for the number of predictors lead to a lower adjusted R² of 0.09.
  • Coefficients: The intercept and Social.smokerSmoker are statistically significant (p < 0.05), suggesting a baseline absenteeism time when all predictors are zero and a notable decrease in absenteeism for smokers, respectively.

10.2.1.2 Refine the model variable selection

Next, we refine the model using stepwise regression based on the AIC criterion to select the most informative variables. Stepwise regression iteratively adds or removes predictors based on their statistical significance to optimize the AIC value. Lower AIC values generally indicate a better model fit given the number of parameters.

Absenteeism.lm.step <- step(Absenteeism.lm) 
Start:  AIC=168.35
Absenteeism.time.in.hours ~ (Number.of.absence + Body.mass.index + 
    Age + Social.smoker + Social.drinker + Pet + Children + High.absenteeism) - 
    Number.of.absence - High.absenteeism

                  Df Sum of Sq    RSS    AIC
- Social.drinker   1      6.56 3553.7 166.41
- Pet              1     76.61 3623.8 167.06
- Body.mass.index  1     97.36 3644.5 167.25
- Age              1    201.96 3749.1 168.18
- Children         1    209.17 3756.3 168.25
<none>                         3547.2 168.35
- Social.smoker    1    706.50 4253.7 172.35

Step:  AIC=166.42
Absenteeism.time.in.hours ~ Body.mass.index + Age + Social.smoker + 
    Pet + Children

                  Df Sum of Sq    RSS    AIC
- Body.mass.index  1     91.08 3644.8 165.25
- Pet              1    117.44 3671.2 165.49
<none>                         3553.7 166.41
- Age              1    225.80 3779.5 166.45
- Children         1    281.88 3835.6 166.93
- Social.smoker    1    702.21 4256.0 170.37

Step:  AIC=165.25
Absenteeism.time.in.hours ~ Age + Social.smoker + Pet + Children

                Df Sum of Sq    RSS    AIC
- Pet            1    158.56 3803.4 164.66
<none>                       3644.8 165.25
- Children       1    256.68 3901.5 165.50
- Age            1    348.79 3993.6 166.27
- Social.smoker  1    617.23 4262.0 168.41

Step:  AIC=164.66
Absenteeism.time.in.hours ~ Age + Social.smoker + Children

                Df Sum of Sq    RSS    AIC
- Children       1    126.79 3930.2 163.74
<none>                       3803.4 164.66
- Age            1    455.58 4259.0 166.39
- Social.smoker  1    532.00 4335.4 166.98

Step:  AIC=163.74
Absenteeism.time.in.hours ~ Age + Social.smoker

                Df Sum of Sq    RSS    AIC
<none>                       3930.2 163.74
- Age            1    362.84 4293.0 164.65
- Social.smoker  1    451.76 4381.9 165.33

Stepwise regression results Explanation:

  • The model simplifies by sequentially removing predictors that do not significantly improve the model, such as Social.drinker and Pet.
  • The final model focuses on Age and Social.smoker, indicating these are the most significant predictors under the AIC criterion.

Refined Model Output:

summ(Absenteeism.lm.step)
MODEL INFO:
Observations: 33
Dependent Variable: Absenteeism.time.in.hours
Type: OLS linear regression 

MODEL FIT:
F(2,30) = 3.35, p = 0.05
R² = 0.18
Adj. R² = 0.13 

Standard errors: OLS
---------------------------------------------------------
                             Est.    S.E.   t val.      p
------------------------- ------- ------- -------- ------
(Intercept)                 30.59   10.46     2.92   0.01
Age                         -0.44    0.26    -1.66   0.11
Social.smokerSmoker         -9.07    4.89    -1.86   0.07
---------------------------------------------------------
  • Model Info: The final model also consists of 33 observations.
  • Model Fit: This refined model has a slightly improved fit (F(2,30) = 3.35, p = 0.05), with an R² of 0.18 and an adjusted R² of 0.13, indicating modest explanatory power.
  • Coefficients: Both Age and Social.smokerSmoker are the remaining predictors, with Age showing a negative association with absenteeism time (more absenteeism with increasing age) and smoking status reducing absenteeism, both marginally significant.

10.2.1.3 Brief Model Diagnotics

Finally, we assess the quality of the model by examining the residuals versus fitted values.

plot(Absenteeism.lm.step, which = 1)

Breaking Down the Code This plot visualizes the residuals (differences between observed and predicted values) against the fitted values (predicted values). It is used to check for patterns that could suggest issues like heteroscedasticity.

Analysis of Results: - The plot indicates potential heteroscedasticity, where the variance of residuals varies with the level of the fitted values. This condition can invalidate the assumptions of constant variance in standard linear regression models. - Combined with the modest R² values, these issues suggest that linear regression might not be the most reliable method for analyzing this data, possibly requiring alternative methods or transformations to better capture the underlying patterns.

10.2.2 Logistic Model Prediction of Absenteeism

We employ logistic regression to model the probability that employees fall into categories of “High Absenteeism” or “Low Absenteeism” based on various predictors. Logistic regression is useful for binary outcomes and provides probabilities as outputs, which can be thresholded at a particular value (commonly 0.5) to classify observations.

10.2.2.1 Model with All Predictors

We employ a logistic regression model, using the glm() function with a binomial family, to predict High.absenteeism. In this model, we exclude the variables Absenteeism.time.in.hours and Number.of.absence from the predictors. These exclusions are made because these variables are derived from the dependent variable, which could lead to multicollinearity or redundancy, potentially skewing the model’s accuracy and interpretability.

Absenteeism.Logit <- glm(High.absenteeism ~ . 
            - Absenteeism.time.in.hours
            - Number.of.absence, 
            data = Absenteeism.by.employee, 
            family = "binomial")

Breaking Down the Code

The R code you provided fits a logistic regression model to predict a binary outcome, labeled High.absenteeism, which categorizes employees based on their absenteeism rates. Here’s a breakdown of each component of the code:

Absenteeism.Logit <- glm(High.absenteeism ~ . 
            - Absenteeism.time.in.hours
            - Number.of.absence, 
            data = Absenteeism.by.employee, 
            family = "binomial")
  • glm(): This function is used to fit generalized linear models, a class of models that includes logistic regression. The type of model is specified by the family parameter.
  • High.absenteeism ~ .: This part of the formula specifies that High.absenteeism is the dependent variable we are trying to predict. The ~ . indicates that all other variables in the dataset should be considered as predictors.
  • - Absenteeism.time.in.hours - Number.of.absence: These terms are preceded by a minus sign, which means they are explicitly excluded from the set of predictor variables.
  • data = Absenteeism.by.employee: Specifies the dataset from which variables are taken. In this case, it’s Absenteeism.by.employee, which presumably contains the employee attendance data.
  • family = "binomial": Indicates that the model should be fit using the binomial family, which is used for binary outcomes. In logistic regression, this setting specifies that the link function (the function that relates the linear predictor to the mean of the distribution function) is the logit function, suitable for binary data.

This model is structured to examine the influence of various predictors on the likelihood of an employee having high absenteeism, while ensuring that the analysis does not include redundant or inappropriate predictors.

summ(Absenteeism.Logit)
MODEL INFO:
Observations: 33
Dependent Variable: High.absenteeism
Type: Generalized linear model
  Family: binomial 
  Link function: logit 

MODEL FIT:
χ²(6) = 9.79, p = 0.13
Pseudo-R² (Cragg-Uhler) = 0.34
Pseudo-R² (McFadden) = 0.21
AIC = 49.92, BIC = 60.40 

Standard errors: MLE
--------------------------------------------------------
                             Est.   S.E.   z val.      p
------------------------- ------- ------ -------- ------
(Intercept)                 -3.79   3.28    -1.15   0.25
Body.mass.index              0.13   0.11     1.13   0.26
Age                          0.03   0.06     0.50   0.61
Social.smokerSmoker          2.77   1.41     1.97   0.05
Social.drinkerSocial        -0.34   0.93    -0.36   0.72
drinker                                                 
PetPet(s)                    1.80   1.31     1.37   0.17
ChildrenParent              -3.11   1.55    -2.00   0.05
--------------------------------------------------------

Model Information: - Observations: 33 - Dependent Variable: High.absenteeism - Type: Generalized linear model with a logit link function (logistic regression).

Model Fit: - χ²(6) = 9.79, p = 0.13: The chi-square test indicates that the model as a whole is not statistically significant at conventional levels. - Pseudo-R² (Cragg-Uhler) = 0.34 and Pseudo-R² (McFadden) = 0.21: These values suggest a moderate fit, showing that the model explains some, but not all, of the variability in absenteeism classification. - AIC = 49.92, BIC = 60.40: These information criteria suggest the model’s relative quality of fit to the data; lower values are better.

Coefficients Analysis: - Most variables are not statistically significant, indicating a weak association with the likelihood of high absenteeism. Notable exceptions include Social.smokerSmoker and ChildrenParent, both at the boundary of significance, suggesting potential areas for further investigation.

10.2.2.2 Refine the Model: Variable Selection

we utilize Stepwise regression to automated process of adding and removing predictors based on their statistical significance.

x <- step(Absenteeism.Logit)
Start:  AIC=49.92
High.absenteeism ~ (Number.of.absence + Absenteeism.time.in.hours + 
    Body.mass.index + Age + Social.smoker + Social.drinker + 
    Pet + Children) - Absenteeism.time.in.hours - Number.of.absence

                  Df Deviance    AIC
- Social.drinker   1   36.054 48.054
- Age              1   36.179 48.179
- Body.mass.index  1   37.265 49.265
<none>                 35.924 49.924
- Pet              1   38.179 50.179
- Social.smoker    1   41.436 53.436
- Children         1   41.574 53.574

Step:  AIC=48.05
High.absenteeism ~ Body.mass.index + Age + Social.smoker + Pet + 
    Children

                  Df Deviance    AIC
- Age              1   36.363 46.363
- Body.mass.index  1   37.272 47.272
<none>                 36.054 48.054
- Pet              1   38.958 48.958
- Social.smoker    1   41.457 51.457
- Children         1   42.951 52.951

Step:  AIC=46.36
High.absenteeism ~ Body.mass.index + Social.smoker + Pet + Children

                  Df Deviance    AIC
- Body.mass.index  1   37.764 45.764
<none>                 36.363 46.363
- Pet              1   39.297 47.297
- Social.smoker    1   41.938 49.938
- Children         1   42.974 50.974

Step:  AIC=45.76
High.absenteeism ~ Social.smoker + Pet + Children

                Df Deviance    AIC
<none>               37.764 45.764
- Pet            1   41.233 47.233
- Social.smoker  1   42.229 48.229
- Children       1   43.485 49.485

Stepwise Regression Results:

The results detail a stepwise regression process aimed at optimizing a logistic regression model by minimizing the Akaike Information Criterion (AIC). Initially, the model includes several predictors, starting with an AIC of 49.92. Through the stepwise removal of predictors that contribute the least to the model’s explanatory power, the AIC is progressively lowered. Significant reductions are achieved by first removing Social.drinker, which lowers the AIC to 48.054, and then by excluding Age, which further reduces the AIC to 46.363. The final variable selection includes Social smoker, Pet, and Children, as removing any of these results in a higher AIC, indicating their importance in the model. The process concludes with a refined model that effectively balances simplicity and explanatory power, with a final AIC of 45.76. This model retains only the most statistically significant predictors, providing a streamlined yet informative analysis of factors influencing high absenteeism.

Refined Model Output:

summ(x)
MODEL INFO:
Observations: 33
Dependent Variable: High.absenteeism
Type: Generalized linear model
  Family: binomial 
  Link function: logit 

MODEL FIT:
χ²(3) = 7.95, p = 0.05
Pseudo-R² (Cragg-Uhler) = 0.29
Pseudo-R² (McFadden) = 0.17
AIC = 45.76, BIC = 51.75 

Standard errors: MLE
--------------------------------------------------------
                             Est.   S.E.   z val.      p
------------------------- ------- ------ -------- ------
(Intercept)                  0.30   0.66     0.45   0.65
Social.smokerSmoker          2.28   1.24     1.83   0.07
PetPet(s)                    1.95   1.21     1.62   0.11
ChildrenParent              -2.63   1.30    -2.03   0.04
--------------------------------------------------------

Model Fit and Statistical Significance: The model’s goodness-of-fit, as indicated by a chi-square value of 7.95 with a p-value of 0.05, suggests that the model as a whole is statistically significant at the conventional 0.05 level. This indicates that the model provides a better fit to the data than a model with no predictors.

Coefficient Analysis: - Intercept: The estimate is 0.30 with a standard error of 0.66, resulting in a z-value of 0.45 and a p-value of 0.65. This indicates that the baseline log odds of high absenteeism (when all predictors are zero) is not significantly different from zero. - Social Smoker: The coefficient of 2.28 with a standard error of 1.24 and a z-value of 1.83 suggests that being a smoker increases the log odds of being classified as a high absentee, although this result is marginally significant (p = 0.07). - Pet(s): With a coefficient of 1.95, a standard error of 1.21, and a z-value of 1.62, this predictor also shows a tendency towards increasing the likelihood of high absenteeism, but the result is not statistically significant (p = 0.11). - Children (Parent): This variable has a coefficient of -2.63, indicating that parents are less likely to be classified under high absenteeism, with a standard error of 1.30 and a z-value of -2.03, making it statistically significant (p = 0.04).

Pseudo-R² Values: - The Cragg-Uhler Pseudo-R² is 0.29 and McFadden’s Pseudo-R² is 0.17, both of which are modest. These values indicate that while the model explains some variability in high absenteeism, a significant portion remains unexplained, suggesting that absenteeism is influenced by other unmeasured factors or that the relationship is not purely logistic.

Overall, the model suggests that certain lifestyle factors like smoking and having children are important in predicting high absenteeism, though the moderate Pseudo-R² values and the marginal significance of some predictors call for cautious interpretation of the results.

10.3 Homework Assignment: Regression Models

In this case study, you are assigned to analyze the “Absenteeism at Work Data Set” using regression models, with a specific focus on predicting absenteeism. You will replicate the analysis from the previous case but with a key difference: use Number.of.absence as the dependent variable for your models. This will allow you to concentrate specifically on the factors influencing the frequency of absenteeism.

Assignment Instructions

You are to produce a comprehensive Quarto document, rendered to a DOCX file formatted as a memo to your boss. This document should cover the following steps, providing a thorough analysis from data preparation to modeling and diagnostics:

  1. Data Preparation: Refer to the earlier section of the case for detailed steps on data loading, filtering, grouping, summarizing, selecting relevant columns, recoding, and cleaning to prepare the dataset for analysis.

  2. Defining High Absence: Establish a new variable categorizing employees as ‘High Absenteeism’ or ‘Low Absenteeism’ based on the median of number of absences month of service.

  3. Linear Model Prediction: Use a linear regression model to predict absenteeism time in hours. Identify and evaluate the significance of different variables affecting absenteeism.

  4. Refine the Model: Apply stepwise regression to refine the model based on the AIC criterion, which may involve adding or removing variables to enhance model accuracy.

  5. Logistic Model Prediction: Implement logistic regression to assess the probability of employees falling into either the ‘High Absenteeism’ or ‘Low Absenteeism’ categories.

  6. Model Diagnostics: Conduct diagnostics to evaluate the fit of the logistic regression model and identify any issues.

Your analysis must clearly explain the results, detailing the significance of predictors and their relationships with absenteeism. Make sure your report includes all necessary steps from data preparation to final diagnostics, offering a comprehensive view of the factors affecting absenteeism in the workplace. This document should serve as a detailed memo to your boss, summarizing your findings and methodologies in a clear and professional manner.