6.3 Examining the Final Data Frame

Let’s inspect the cleaned dataset to verify our work. Try running head(Absenteeism.by.employee) in your console — here is what you will see:

Number of absence Absenteeism time in hours Most common reason for absence Month of absence Day of the week Body mass index Age Social smoker Social drinker Pet Education Children College High absenteeism
1.64 8.64 22 Aug Tue 29 37 Non-smoker Non-drinker Pet(s) postgraduate Parent college High Absenteeism
0.50 2.08 0 Aug Tue 33 48 Smoker Non-drinker Pet(s) high school Parent high school Low Absenteeism
6.28 26.78 27 Feb Thu 31 38 Non-smoker Social drinker No Pet(s) high school Non-parent high school High Absenteeism
0.08 0.00 0 Dec Wed 34 40 Non-smoker Social drinker Pet(s) high school Parent high school Low Absenteeism
1.46 8.00 26 Sep Tue 38 43 Non-smoker Social drinker No Pet(s) high school Parent high school High Absenteeism
0.62 5.54 22 Feb Fri 25 33 Non-smoker Non-drinker Pet(s) high school Parent high school High Absenteeism

The data frame now has one row per employee with cleaned, labeled variables — ready for the visualization and modeling work in the chapters ahead.

We can also review the summary statistics of the cleaned data. Running describe(Absenteeism.by.employee, skew = FALSE, ranges = FALSE, omit = TRUE) produces the following summary:

vars n mean sd se
Number of absence 1 33 1.72 2.02 0.35
Absenteeism time in hours 2 33 11.53 12.26 2.13
Most common reason for absence 3 33 14.48 11.52 2.01
Body mass index 6 33 26.58 4.80 0.84
Age 7 33 39.10 7.71 1.34