4.3 Summary Statistics

The describe() function from the psych package provides a professional summary of the numeric variables in the dataset:

describe(work, skew = FALSE, ranges = FALSE, omit = TRUE) |>
  kable(digits = 2) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE)

	vars	n	mean	sd	se
ID	1	740	18.02	11.02	0.41
Reason for absence	2	740	19.22	8.43	0.31
Month of absence	3	740	6.32	3.44	0.13
Day of the week	4	740	3.91	1.42	0.05
Seasons	5	740	2.54	1.11	0.04
Transportation expense	6	740	221.33	66.95	2.46
Distance from Residence to Work	7	740	29.63	14.84	0.55
Service time	8	740	12.55	4.38	0.16
Age	9	740	36.45	6.48	0.24
Work load Average/day	10	740	271.49	39.06	1.44
Hit target	11	740	94.59	3.78	0.14
Disciplinary failure	12	740	0.05	0.23	0.01
Education	13	740	1.29	0.67	0.02
Son	14	740	1.02	1.10	0.04
Social drinker	15	740	0.57	0.50	0.02
Social smoker	16	740	0.07	0.26	0.01
Pet	17	740	0.75	1.32	0.05
Weight	18	740	79.04	12.88	0.47
Height	19	740	172.11	6.03	0.22
Body mass index	20	740	26.68	4.29	0.16
Absenteeism time in hours	21	740	6.92	13.33	0.49

This gives us the variable number (vars), count (n), mean, standard deviation (sd), minimum (min), maximum (max), and standard error (se) for each variable. Let’s highlight a few key observations:

Absenteeism time in hours — This is our target variable, the outcome we ultimately want to understand. The median is 3 hours and the mean is 6.9 hours. The fact that the mean is higher than the median tells us the distribution is right-skewed — most absences are relatively short, but some are much longer, pulling the average up. The maximum is 120 hours, which represents an unusually long absence.

Age — Employees in the dataset range from 27 to 58 years old, with a median age of 37.

Distance from Residence to Work — Commute distances range from 5 to 52 kilometers. The median is 26 km.

These numbers give us a first sense of the data’s shape and scale. But numbers alone can be hard to interpret — visualizations often reveal patterns more clearly.