4.2 Inspecting the Data

Before computing any statistics, we need to understand the structure of the data. The head() function shows the first few rows. Try running head(work) in your console — here is what you will see:

ID	Reason for absence	Month of absence	Day of the week	Seasons	Transportation expense	Distance from Residence to Work	Service time	Age	Work load Average/day	Hit target	Disciplinary failure	Education	Son	Social drinker	Social smoker	Pet	Weight	Height	Body mass index	Absenteeism time in hours
11	26	7	3	1	289	36	13	33	239.554	97	0	1	2	1	0	1	90	172	30	4
36	0	7	3	1	118	13	18	50	239.554	97	1	1	1	1	0	0	98	178	31	0
3	23	7	4	1	179	51	18	38	239.554	97	0	1	0	1	0	0	89	170	31	2
7	7	7	5	1	279	5	14	39	239.554	97	0	1	2	1	1	0	68	168	24	4
11	23	7	5	1	289	36	13	33	239.554	97	0	1	2	1	0	1	90	172	30	2
3	23	7	6	1	179	51	18	38	239.554	97	0	1	0	1	0	0	89	170	31	2

We can also check the column names and data types. The names() function lists every column, and ncol() confirms the count:

names(work)

##  [1] "ID"                              "Reason for absence"             
##  [3] "Month of absence"                "Day of the week"                
##  [5] "Seasons"                         "Transportation expense"         
##  [7] "Distance from Residence to Work" "Service time"                   
##  [9] "Age"                             "Work load Average/day "         
## [11] "Hit target"                      "Disciplinary failure"           
## [13] "Education"                       "Son"                            
## [15] "Social drinker"                  "Social smoker"                  
## [17] "Pet"                             "Weight"                         
## [19] "Height"                          "Body mass index"                
## [21] "Absenteeism time in hours"

Notice that R has read all columns as numeric types (dbl for decimal numbers). This is technically correct — the data file stores everything as numbers — but some of these variables are really categorical. For example, Reason for absence is coded as numbers 0–28, but those numbers represent categories (disease types, appointment types, or 0 for no reason recorded), not quantities. Similarly, Day of the week uses 2 for Monday through 6 for Friday. We will address this distinction when we clean the data in Chapters 5 and 6. For now, we simply note it.

Column Names with Spaces

You may notice that the column names contain spaces (e.g., Reason for absence). In R, when a column name contains spaces, you need to wrap it in backticks when referencing it in code — for example, work$`Reason for absence`. We will rename these columns to simpler names when we clean the data in Chapter 6.