4.6 Conclusion
This chapter gave you your first hands-on experience with the Absenteeism at Work dataset. You loaded the data, inspected its structure, computed summary statistics, and created four types of visualizations — a histogram, a bar chart, and boxplots by day of week and by month. Along the way, you noticed that the data has some characteristics (skewed distributions, categorical variables stored as numbers, column names with spaces) that will need to be addressed before we can do more sophisticated analysis.
In the next chapter, we turn to data preparation and cleaning — the essential step that transforms raw data into a form ready for analysis.
4.6.1 Complete Code Listing
Below is the complete R code used in this chapter as a single script. You can copy this code into a Quarto document (.qmd file), render it, and reproduce all of the analysis and visualizations from this chapter.
Complete Code
# -------------------------------------------
# 1. Load packages and data
# -------------------------------------------
# Load the Tidyverse (data manipulation and plotting)
# and psych (for the describe() function)
if(!require("pacman")) install.packages("pacman")
pacman::p_load("tidyverse", "psych")
# Read the dataset from the hosted URL
# We use read_delim() with delim = ";" because
# this file uses semicolons to separate values
work <- read_delim(
"https://ljkelly3141.github.io/datasets/bi-book/Absenteeism_at_work.csv",
delim = ";"
)
# -------------------------------------------
# 2. Inspect the data
# -------------------------------------------
# dim() returns the number of rows and columns
dim(work)
# head() displays the first six rows
head(work)
# names() lists every column name
names(work)
# -------------------------------------------
# 3. Compute summary statistics
# -------------------------------------------
# describe() computes count, mean, sd, min, max, and se
# for each numeric variable
describe(work, skew = FALSE, ranges = FALSE, omit = TRUE)
# -------------------------------------------
# 4. Histogram of absenteeism hours
# -------------------------------------------
# geom_histogram() shows the distribution of a single variable
# binwidth = 4 groups absence hours into 4-hour bins
ggplot(work, aes(x = `Absenteeism time in hours`)) +
geom_histogram(binwidth = 4, fill = "steelblue", color = "white") +
labs(
title = "Distribution of Absenteeism Hours",
x = "Hours Absent",
y = "Number of Absence Events"
) +
theme_minimal()
# -------------------------------------------
# 5. Bar chart of absence reasons
# -------------------------------------------
# count() tallies events per reason code
# reorder() sorts bars from most to least frequent
# geom_col() draws the bars
work |>
count(`Reason for absence`) |>
mutate(`Reason for absence` = factor(`Reason for absence`)) |>
ggplot(aes(x = reorder(`Reason for absence`, -n), y = n)) +
geom_col(fill = "steelblue") +
labs(
title = "Frequency of Absence Reasons",
x = "Reason Code",
y = "Number of Events"
) +
theme_minimal() +
theme(axis.text.x = element_text(size = 7, angle = 90, hjust = 1))
# -------------------------------------------
# 6. Boxplot of absenteeism by day of week
# -------------------------------------------
# Convert numeric day codes (2-6) to labels (Mon-Fri)
# geom_boxplot() shows median, IQR, and outliers
work |>
mutate(Day = factor(
`Day of the week`,
levels = 2:6,
labels = c("Mon", "Tue", "Wed", "Thu", "Fri")
)) |>
ggplot(aes(x = Day, y = `Absenteeism time in hours`)) +
geom_boxplot(fill = "steelblue", alpha = 0.7) +
labs(
title = "Absenteeism Hours by Day of the Week",
x = "Day of the Week",
y = "Hours Absent"
) +
theme_minimal()
# -------------------------------------------
# 7. Boxplot of absenteeism by month
# -------------------------------------------
# Filter out month = 0 (no absence recorded)
# Convert numeric months (1-12) to abbreviated names
work |>
filter(`Month of absence` > 0) |>
mutate(Month = factor(
`Month of absence`,
levels = 1:12,
labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
)) |>
ggplot(aes(x = Month, y = `Absenteeism time in hours`)) +
geom_boxplot(fill = "steelblue", alpha = 0.7) +
labs(
title = "Absenteeism Hours by Month",
x = "Month",
y = "Hours Absent"
) +
theme_minimal()