6.3 Examining the Final Data Frame
Let’s inspect the cleaned dataset to verify our work. Try running head(Absenteeism.by.employee) in your console — here is what you will see:
| Number of absence | Absenteeism time in hours | Most common reason for absence | Month of absence | Day of the week | Body mass index | Age | Social smoker | Social drinker | Pet | Education | Children | College | High absenteeism |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1.64 | 8.64 | 22 | Aug | Tue | 29 | 37 | Non-smoker | Non-drinker | Pet(s) | postgraduate | Parent | college | High Absenteeism |
| 0.50 | 2.08 | 0 | Aug | Tue | 33 | 48 | Smoker | Non-drinker | Pet(s) | high school | Parent | high school | Low Absenteeism |
| 6.28 | 26.78 | 27 | Feb | Thu | 31 | 38 | Non-smoker | Social drinker | No Pet(s) | high school | Non-parent | high school | High Absenteeism |
| 0.08 | 0.00 | 0 | Dec | Wed | 34 | 40 | Non-smoker | Social drinker | Pet(s) | high school | Parent | high school | Low Absenteeism |
| 1.46 | 8.00 | 26 | Sep | Tue | 38 | 43 | Non-smoker | Social drinker | No Pet(s) | high school | Parent | high school | High Absenteeism |
| 0.62 | 5.54 | 22 | Feb | Fri | 25 | 33 | Non-smoker | Non-drinker | Pet(s) | high school | Parent | high school | High Absenteeism |
The data frame now has one row per employee with cleaned, labeled variables — ready for the visualization and modeling work in the chapters ahead.
We can also review the summary statistics of the cleaned data. Running describe(Absenteeism.by.employee, skew = FALSE, ranges = FALSE, omit = TRUE) produces the following summary:
| vars | n | mean | sd | se | |
|---|---|---|---|---|---|
| Number of absence | 1 | 33 | 1.72 | 2.02 | 0.35 |
| Absenteeism time in hours | 2 | 33 | 11.53 | 12.26 | 2.13 |
| Most common reason for absence | 3 | 33 | 14.48 | 11.52 | 2.01 |
| Body mass index | 6 | 33 | 26.58 | 4.80 | 0.84 |
| Age | 7 | 33 | 39.10 | 7.71 | 1.34 |