4.1 Loading the Data

The first step in any data analysis project is loading your data into R. The Absenteeism at Work dataset is hosted online as a text file where values are separated by semicolons rather than commas. We use read_delim() from the readr package (part of the Tidyverse) to load it:

Multiple Ways to Do the Same Thing in R

One thing you will quickly discover about R is that there is almost always more than one way to accomplish a task. Reading a semicolon-delimited file is a good example. All three of the following approaches would work:

  • read.csv("file", sep = ";") — Base R’s CSV reader. The sep argument overrides the default comma delimiter. Simple and requires no extra packages, but returns a traditional data frame and can be slow on large files.
  • read_csv2("file") — From the readr package. Designed specifically for semicolon-separated files (common in European data where commas are used as decimal marks). Convenient, but only works for semicolons — you cannot specify an arbitrary delimiter.
  • read_delim("file", delim = ";") — Also from readr. The most flexible option: the delim argument accepts any single-character separator (semicolons, tabs, pipes, etc.).

We use read_delim() throughout this book because it makes the delimiter explicit in the code — anyone reading your script can immediately see what separator the file uses. It also returns a tibble (the Tidyverse’s enhanced data frame) and handles large files efficiently. When in doubt, read_delim() with an explicit delimiter is the safest and most readable choice.

work <- read_delim(
  "https://ljkelly3141.github.io/datasets/bi-book/Absenteeism_at_work.csv",
  delim = ";"
)

Let’s confirm that the data loaded correctly by checking its dimensions:

dim(work)
## [1] 740  21

The dataset has 740 rows and 21 columns — exactly what we expected from Chapter 2. Each row represents a single absence event, and each column represents one of the 21 variables described in that chapter.