4.3 Summary Statistics

The describe() function from the psych package provides a professional summary of the numeric variables in the dataset:

describe(work, skew = FALSE, ranges = FALSE, omit = TRUE) |>
  kable(digits = 2) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE)
vars n mean sd se
ID 1 740 18.02 11.02 0.41
Reason for absence 2 740 19.22 8.43 0.31
Month of absence 3 740 6.32 3.44 0.13
Day of the week 4 740 3.91 1.42 0.05
Seasons 5 740 2.54 1.11 0.04
Transportation expense 6 740 221.33 66.95 2.46
Distance from Residence to Work 7 740 29.63 14.84 0.55
Service time 8 740 12.55 4.38 0.16
Age 9 740 36.45 6.48 0.24
Work load Average/day 10 740 271.49 39.06 1.44
Hit target 11 740 94.59 3.78 0.14
Disciplinary failure 12 740 0.05 0.23 0.01
Education 13 740 1.29 0.67 0.02
Son 14 740 1.02 1.10 0.04
Social drinker 15 740 0.57 0.50 0.02
Social smoker 16 740 0.07 0.26 0.01
Pet 17 740 0.75 1.32 0.05
Weight 18 740 79.04 12.88 0.47
Height 19 740 172.11 6.03 0.22
Body mass index 20 740 26.68 4.29 0.16
Absenteeism time in hours 21 740 6.92 13.33 0.49

This gives us the variable number (vars), count (n), mean, standard deviation (sd), minimum (min), maximum (max), and standard error (se) for each variable. Let’s highlight a few key observations:

Absenteeism time in hours — This is our target variable, the outcome we ultimately want to understand. The median is 3 hours and the mean is 6.9 hours. The fact that the mean is higher than the median tells us the distribution is right-skewed — most absences are relatively short, but some are much longer, pulling the average up. The maximum is 120 hours, which represents an unusually long absence.

Age — Employees in the dataset range from 27 to 58 years old, with a median age of 37.

Distance from Residence to Work — Commute distances range from 5 to 52 kilometers. The median is 26 km.

These numbers give us a first sense of the data’s shape and scale. But numbers alone can be hard to interpret — visualizations often reveal patterns more clearly.