4.2 Inspecting the Data

Before computing any statistics, we need to understand the structure of the data. The head() function shows the first few rows. Try running head(work) in your console — here is what you will see:

ID Reason for absence Month of absence Day of the week Seasons Transportation expense Distance from Residence to Work Service time Age Work load Average/day Hit target Disciplinary failure Education Son Social drinker Social smoker Pet Weight Height Body mass index Absenteeism time in hours
11 26 7 3 1 289 36 13 33 239.554 97 0 1 2 1 0 1 90 172 30 4
36 0 7 3 1 118 13 18 50 239.554 97 1 1 1 1 0 0 98 178 31 0
3 23 7 4 1 179 51 18 38 239.554 97 0 1 0 1 0 0 89 170 31 2
7 7 7 5 1 279 5 14 39 239.554 97 0 1 2 1 1 0 68 168 24 4
11 23 7 5 1 289 36 13 33 239.554 97 0 1 2 1 0 1 90 172 30 2
3 23 7 6 1 179 51 18 38 239.554 97 0 1 0 1 0 0 89 170 31 2

We can also check the column names and data types. The names() function lists every column, and ncol() confirms the count:

names(work)
##  [1] "ID"                              "Reason for absence"             
##  [3] "Month of absence"                "Day of the week"                
##  [5] "Seasons"                         "Transportation expense"         
##  [7] "Distance from Residence to Work" "Service time"                   
##  [9] "Age"                             "Work load Average/day "         
## [11] "Hit target"                      "Disciplinary failure"           
## [13] "Education"                       "Son"                            
## [15] "Social drinker"                  "Social smoker"                  
## [17] "Pet"                             "Weight"                         
## [19] "Height"                          "Body mass index"                
## [21] "Absenteeism time in hours"

Notice that R has read all columns as numeric types (dbl for decimal numbers). This is technically correct — the data file stores everything as numbers — but some of these variables are really categorical. For example, Reason for absence is coded as numbers 0–28, but those numbers represent categories (disease types, appointment types, or 0 for no reason recorded), not quantities. Similarly, Day of the week uses 2 for Monday through 6 for Friday. We will address this distinction when we clean the data in Chapters 5 and 6. For now, we simply note it.

Column Names with Spaces

You may notice that the column names contain spaces (e.g., Reason for absence). In R, when a column name contains spaces, you need to wrap it in backticks when referencing it in code — for example, work$`Reason for absence`. We will rename these columns to simpler names when we clean the data in Chapter 6.