6.3 Examining the Final Data Frame

Let’s inspect the cleaned dataset to verify our work. Try running head(Absenteeism.by.employee) in your console — here is what you will see:

Table 6.1: Table 6.2: First six rows of the cleaned employee-level dataset.
Number of absence	Absenteeism time in hours	Most common reason for absence	Month of absence	Day of the week	Body mass index	Age	Social smoker	Social drinker	Pet	Education	Children	College	High absenteeism
1.64	8.64	22	Aug	Tue	29	37	Non-smoker	Non-drinker	Pet(s)	postgraduate	Parent	college	High Absenteeism
0.50	2.08	0	Aug	Tue	33	48	Smoker	Non-drinker	Pet(s)	high school	Parent	high school	Low Absenteeism
6.28	26.78	27	Feb	Thu	31	38	Non-smoker	Social drinker	No Pet(s)	high school	Non-parent	high school	High Absenteeism
0.08	0.00	0	Dec	Wed	34	40	Non-smoker	Social drinker	Pet(s)	high school	Parent	high school	Low Absenteeism
1.46	8.00	26	Sep	Tue	38	43	Non-smoker	Social drinker	No Pet(s)	high school	Parent	high school	High Absenteeism
0.62	5.54	22	Feb	Fri	25	33	Non-smoker	Non-drinker	Pet(s)	high school	Parent	high school	High Absenteeism

The data frame now has one row per employee with cleaned, labeled variables — ready for the visualization and modeling work in the chapters ahead.

We can also review the summary statistics of the cleaned data. Running describe(Absenteeism.by.employee, skew = FALSE, ranges = FALSE, omit = TRUE) produces the following summary:

Table 6.3: Table 6.4: Summary statistics for the cleaned employee-level dataset.
	vars	n	mean	sd	se
Number of absence	1	33	1.72	2.02	0.35
Absenteeism time in hours	2	33	11.53	12.26	2.13
Most common reason for absence	3	33	14.48	11.52	2.01
Body mass index	6	33	26.58	4.80	0.84
Age	7	33	39.10	7.71	1.34