4.2 Inspecting the Data
Before computing any statistics, we need to understand the structure of the data. The head() function shows the first few rows. Try running head(work) in your console — here is what you will see:
| ID | Reason for absence | Month of absence | Day of the week | Seasons | Transportation expense | Distance from Residence to Work | Service time | Age | Work load Average/day | Hit target | Disciplinary failure | Education | Son | Social drinker | Social smoker | Pet | Weight | Height | Body mass index | Absenteeism time in hours |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 11 | 26 | 7 | 3 | 1 | 289 | 36 | 13 | 33 | 239.554 | 97 | 0 | 1 | 2 | 1 | 0 | 1 | 90 | 172 | 30 | 4 |
| 36 | 0 | 7 | 3 | 1 | 118 | 13 | 18 | 50 | 239.554 | 97 | 1 | 1 | 1 | 1 | 0 | 0 | 98 | 178 | 31 | 0 |
| 3 | 23 | 7 | 4 | 1 | 179 | 51 | 18 | 38 | 239.554 | 97 | 0 | 1 | 0 | 1 | 0 | 0 | 89 | 170 | 31 | 2 |
| 7 | 7 | 7 | 5 | 1 | 279 | 5 | 14 | 39 | 239.554 | 97 | 0 | 1 | 2 | 1 | 1 | 0 | 68 | 168 | 24 | 4 |
| 11 | 23 | 7 | 5 | 1 | 289 | 36 | 13 | 33 | 239.554 | 97 | 0 | 1 | 2 | 1 | 0 | 1 | 90 | 172 | 30 | 2 |
| 3 | 23 | 7 | 6 | 1 | 179 | 51 | 18 | 38 | 239.554 | 97 | 0 | 1 | 0 | 1 | 0 | 0 | 89 | 170 | 31 | 2 |
We can also check the column names and data types. The names() function lists every column, and ncol() confirms the count:
## [1] "ID" "Reason for absence"
## [3] "Month of absence" "Day of the week"
## [5] "Seasons" "Transportation expense"
## [7] "Distance from Residence to Work" "Service time"
## [9] "Age" "Work load Average/day "
## [11] "Hit target" "Disciplinary failure"
## [13] "Education" "Son"
## [15] "Social drinker" "Social smoker"
## [17] "Pet" "Weight"
## [19] "Height" "Body mass index"
## [21] "Absenteeism time in hours"
Notice that R has read all columns as numeric types (dbl for decimal numbers). This is technically correct — the data file stores everything as numbers — but some of these variables are really categorical. For example, Reason for absence is coded as numbers 0–28, but those numbers represent categories (disease types, appointment types, or 0 for no reason recorded), not quantities. Similarly, Day of the week uses 2 for Monday through 6 for Friday. We will address this distinction when we clean the data in Chapters 5 and 6. For now, we simply note it.
Column Names with Spaces
You may notice that the column names contain spaces (e.g., Reason for absence). In R, when a column name contains spaces, you need to wrap it in backticks when referencing it in code — for example, work$`Reason for absence`. We will rename these columns to simpler names when we clean the data in Chapter 6.