6 Case Study: Loading and Cleaning the Data
PRELIMINARY AND INCOMPLETE
In this case study, we transform the “Absenteeism at Work Data Set” into a structured data frame. The objective is to cleanse and organize the data, making it conducive for extracting meaningful business intelligence insights. This process involves loading the data, processing it through various stages of cleaning, and finally, examining the refined output.
6.1 Loading the Data
The first step in our data analysis journey involves loading the absenteeism dataset from a specified URL. The read.csv function in R is utilized for this purpose, with the separator specified as a semicolon to correctly parse the CSV file.
6.2 Processing the Data
6.2.1 Creating a Mode Function
To analyze the data effectively, we introduce a custom mode function, MODE, which identifies the most frequent occurrence of a value within a given set. In cases of a tie, the function will only return the first of these modes it encounters after sorting.
Breaking Down the Code
Function Definition:
MODE <- function(x) { ... }defines a new function namedMODEthat takes a single argumentx. Thisxis expected to be a vector of values for which the mode is to be calculated.table(x): This part of the code creates a frequency table of the values inx. Each unique value inxbecomes a name in the table, and the corresponding value in the table represents the number of times that unique value appears inx.sort(-table(x)): Thesortfunction is used to sort the elements of the frequency table in descending order. The negation operator-is applied to the table values before sorting, which means that the sorting will be in descending order of frequency. This way, the most frequent values come first in the sorted table.names(sort(-table(x)))[1]: After sorting,names(...)is used to extract the names (i.e., the original values fromx) of the sorted frequency table. Indexing with[1]selects the first name from this sorted list, which corresponds to the value with the highest frequency in the original data setx. This is because the sorting was done in descending order of frequency.|> as.numeric(): The|>is the pipe operator introduced in R 4.1.0, which forwards the left-hand side value as the first argument to the function on the right-hand side. In this context, it forwards the most frequent value (which is still a character because it was extracted as a name from the table) to theas.numeric()function, converting it to a numeric value if possible.
It’s important to note that this function has a limitation: if there are multiple modes in the data set (i.e., more than one value shares the maximum frequency), the function will only return the first of these modes it encounters after sorting. This might not always be desirable, as there could be cases where identifying all modes is important, or where the numeric ordering of modes is not meaningful.
6.2.2 Filtering by Service Time
An initial filtering step involves removing records of employees with less than six months of service, ensuring the analysis focuses on more established workers.
Breaking Down the Code
Pipe Operator (
|>): This operator is used to pass the result of one expression as the argument to the next. It’s a way to chain together a sequence of operations in a more readable manner.absenteeism: This refers to the dataset that is being filtered. It likely contains records of employee absenteeism, including various details such as the employee’s ID, reasons for absence, duration of absence, and the service time with the company, among other things.filter(Service.time >= 6): This part of the code uses thefilterfunction from thedplyrpackage (implied by the use of the pipe operator), which is a part of the tidyverse suite of data manipulation tools. Thefilterfunction is used to subset rows based on a particular condition. Here, the condition isService.time >= 6, which means the function will retain only those rows in the dataset where theService.timecolumn (representing the length of service of an employee in months) has a value of 6 or more. Rows where the service time is less than 6 months will be excluded from the resulting dataset.Absenteeism.by.employee <-: This part of the code assigns the filtered dataset to a new variable calledAbsenteeism.by.employee. This variable will now contain the subset of the originalabsenteeismdataset, including only those employees who have been with the company for 6 months or more.
In summary, this code snippet is used to create a new dataset from the original absenteeism dataset, containing only the records of employees with a service time of at least 6 months.
6.2.3 Aggregating Data by Employee
The next phase involves grouping the data by employee ID and calculating various summary statistics. This includes finding the most common reason for absence, the typical month and day of the week for absences, and other key metrics. A crucial step here is to normalize the total absenteeism time by the service time and to calculate the average number of absences per unit of service time.
Absenteeism.by.employee <- Absenteeism.by.employee |>
group_by(ID) |>
mutate(Most.common.reason.for.absence = MODE(Reason.for.absence),
Month.of.absence = MODE(Month.of.absence),
Day.of.the.week = MODE(Day.of.the.week),
Education = max(Education),
Absenteeism.time.in.hours =
sum(Absenteeism.time.in.hours)/Service.time,
Number.of.absence = n()/Service.time)Breaking Down the Code
Group by Employee ID: The
group_by(ID)function groups the data frame by the employee ID. This means that all subsequent operations will be performed within each group of rows that share the same employee ID, allowing for individualized calculations for each employee.Calculate Most Common Reason for Absence: Within each group (i.e., for each employee), the custom
MODEfunction is applied to theReason.for.absencecolumn to determine the most common reason an employee was absent. TheMODEfunction identifies the mode, or the most frequently occurring value in a set of data. If there are multiple modes, the function is designed to select the largest value among them.Determine the Mode of Month of Absence: Similar to the previous step, the
MODEfunction is used on theMonth.of.absencecolumn to find the month in which each employee was most frequently absent.Find the Most Common Day of the Week of Absence: Again, the
MODEfunction is applied, but this time to theDay.of.the.weekcolumn, identifying the day of the week on which each employee was most commonly absent.Identify Maximum Education Level: The
maxfunction is used on theEducationcolumn to find the highest level of education for each employee. This step assumes that the education levels are encoded numerically, with higher numbers representing higher levels of education.Calculate Adjusted Absenteeism Time: The total absenteeism time in hours (
Absenteeism.time.in.hours) for each employee is summed up and then divided by the employee’s service time (Service.time). This step calculates an adjusted measure of absenteeism time, accounting for the length of service.Compute Number of Absences per Unit of Service Time: The
n()function counts the number of rows (i.e., absence records) for each employee, which is then divided by the service time (Service.time). This yields the average number of absences per unit of service time for each employee, providing a standardized measure of absence frequency that accounts for differences in service lengths.
6.2.4 Summarizing Data
The dataset is further summarized by averaging the observations for each employee, streamlining the dataset for analysis.
Breaking Down the Code
Group by Employee ID: The
group_by(ID)function is used to group the data frame by the employee ID. This operation ensures that all subsequent calculations are performed separately for each employee, allowing for individualized summaries.Summarize Across All Variables: The
summarisefunction is combined with theacross(everything(), mean)expression to perform an operation across all columns (variables) in each group. Theeverything()selector targets all columns in the data frame, and themeanfunction is applied to calculate the average for each column.Calculate the Mean for Each Variable: For each group (i.e., for each employee), the mean of every variable is calculated. This includes all numeric variables present in the data frame after the previous processing steps, such as the most common reason for absence, the most common month and day of the week of absence, maximum education level, adjusted absenteeism time, and the number of absences per unit of service time.
Create a New Data Frame: The result of the
summariseoperation is a new data frame where each row corresponds to an employee (identified by their ID), and each column contains the mean of the respective variable for that employee. This new data frame is then reassigned to theAbsenteeism.by.employeevariable.
This process effectively condenses the data, providing a single-row summary for each employee that captures the average values of all measured variables. It’s a common step in data preparation, especially when the goal is to analyze patterns or trends at the individual level without the noise of multiple records per individual.
6.2.5 Enhancing Date and Time Variables
Using the lubridate package, the month and day of the week variables are transformed into more readable formats, enhancing the interpretability of the data.
Breaking Down the Code
Load the
lubridatePackage: Thelibrary("lubridate")command loads thelubridatepackage, which provides functions that make it easier to work with dates and times in R.-
Transform
Day.of.the.week:- The
wday()function from thelubridatepackage is used to convert the numericDay.of.the.weekvalues into actual weekday names. - Since
wday()by default treats 1 as Sunday,Day.of.the.week + 1is used to adjust for any off-by-one discrepancies in the original data (assuming the data starts with 1 for Monday). - The
label = TRUEargument tellswday()to return the weekday names instead of numbers. - The final result is converted to a character string using
as.character()to ensure the newDay.of.the.weekvariable consists of readable weekday names.
- The
-
Transform
Month.of.absence:- The
month()function, also fromlubridate, is used to convert numericMonth.of.absencevalues into month names. -
ymd(010101)creates a dummy date (January 1, 0001) to whichmonths(Month.of.absence - 1)is added. The subtraction of 1 fromMonth.of.absenceis necessary to align the numeric month values with their correct position (e.g., 1 for January, 2 for February, etc.), assuming the original data used 1 for January, 2 for February, and so on. - The
label = TRUE, abbr = TRUEarguments ensure that the resulting month names are abbreviated and more readable. - The output is converted to a character string with
as.character()to store the transformedMonth.of.absencevalues as readable abbreviated month names.
- The
By utilizing the lubridate package, this code effectively enhances the Absenteeism.by.employee data frame, making the day of the week and month of absence variables more intuitive and accessible for analysis or reporting purposes.
6.2.6 Refining the Data Set
The final selection of variables is made, focusing on key attributes that will be used in subsequent analyses.
Breaking Down the Code
Data Frame Specification:
Absenteeism.by.employee <- Absenteeism.by.employee |>indicates that the operations following the pipe (|>) operator will be applied to theAbsenteeism.by.employeedata frame, and the result will overwrite the original data frame.-
Column Selection: The
select()function is used to explicitly choose which columns to retain in the final data frame. The columns included are:-
ID: Employee identifier. -
Number.of.absence: The average number of absences per unit of service time for each employee. -
Absenteeism.time.in.hours: The adjusted measure of absenteeism time, accounting for the length of service. -
Most.common.reason.for.absence: The most frequent reason for an employee’s absence. -
Month.of.absence: The month in which an employee was most frequently absent, likely transformed to a character string representing month names or abbreviations. -
Day.of.the.week: The day of the week on which an employee was most commonly absent, likely transformed to a character string representing day names or abbreviations. -
Body.mass.index: The Body Mass Index (BMI) of each employee. -
Age: The age of each employee. -
Social.smoker: Indicates whether the employee is a smoker or not. -
Social.drinker: Indicates whether the employee is a drinker or not. -
Son: The number of children the employee has. -
Pet: The number of pets the employee has. -
Education: The highest level of education achieved by the employee.
-
By selecting these specific columns, the data frame is streamlined to include only the variables that are deemed necessary for further analysis or reporting. This step can help in focusing on the most relevant data and improving the efficiency of subsequent data handling tasks.
6.2.7 Recoding Variables
Several categorical variables, such as smoking and drinking habits, are recoded for clarity, and redundant variables are removed from the dataset.
Absenteeism.by.employee <- Absenteeism.by.employee |>
mutate(Social.smoker = ifelse(Social.smoker == 0,
"Non-smoker",
"Smoker"),
Social.drinker = ifelse(Social.drinker == 0,
"Non-drinker",
"Social drinker"),
Children = ifelse(Son == 0,
"Non-parent",
"Parent"),
Pet = ifelse(Pet == 0,
"No Pet(s)",
"Pet(s)"))|>
select(-Son, -ID)Breaking Down the Code
Recoding
Social.smoker: Themutatefunction is used to transform theSocial.smokercolumn. Theifelsestatement checks each value in theSocial.smokercolumn; if a value is 0, it is replaced with “Non-smoker”, and if not, it is replaced with “Smoker”. This recodes the binary representation (0 and 1) into a more readable format.Recoding
Social.drinker: Similarly,Social.drinkeris recoded usingifelse. Values of 0 are replaced with “Non-drinker”, and all other values are replaced with “Social drinker”. This change makes the data more interpretable by converting numeric codes into descriptive categories.Creating
ChildrenCategory: A new column,Children, is created usingifelseto check theSoncolumn (which likely represents the number of children an employee has). If the value is 0, it is labeled as “Non-parent”, and if there’s at least one child, it’s labeled as “Parent”. This categorizes employees based on parenthood status.Recoding
Pet: ThePetcolumn is transformed in a similar manner. If an employee has no pets (Pet == 0), the value is set to “No Pet(s)”; otherwise, it’s set to “Pet(s)”, indicating the presence of one or more pets.Removing Columns: After these transformations, the
selectfunction with the minus sign (-) is used to remove theSonandIDcolumns from the data frame. TheSoncolumn is likely removed because its information is now encapsulated in the more descriptiveChildrencolumn. TheIDcolumn might be removed because it’s no longer needed for the analysis, or to anonymize the data.
The result is a data frame where binary or numeric indicators for smoking, drinking, having children, and owning pets are converted into descriptive categories, enhancing the readability and interpretability of the data. Additionally, unnecessary columns are dropped to streamline the dataset.
6.2.8 Simplifying Education Levels
The education variable is simplified into more general categories, aiding in the analysis and interpretation of the data.
Breaking Down the Code
This R code snippet further manipulates the Absenteeism.by.employee data frame by transforming the Education column and adding a new College column:
Creating the
CollegeColumn: Themutatefunction adds a new column namedCollegebased on theEducationcolumn’s values. Theifelsefunction checks if theEducationvalue is greater than or equal to 2; if so, it labels the entry as “college”, indicating that the employee has at least some college education. If the value is less than 2, it is labeled as “high school”, indicating that the employee’s highest level of education is high school or lower.-
Transforming the
EducationColumn: TheEducationcolumn is then transformed into a factor using thefactor()function, which is a way to categorize data in R. Thefct_recodefunction from theforcatspackage (implicitly loaded withdplyrandtidyverse) is then used to rename the factor levels for more meaningful descriptions:-
"high school" = "1": Education level 1 is recoded as “high school”. -
"graduate" = "2": Education level 2 is recoded as “graduate”. -
"postgraduate" = "3": Education level 3 is recoded as “postgraduate”. -
"master and doctor" = "4": Education level 4 is recoded as “master and doctor”.
-
This transformation enriches the dataset by providing a clear distinction between employees with only high school education and those with some college education or higher. It also makes the Education column more informative by replacing numerical codes with descriptive labels, making the data easier to understand and analyze.
6.2.9 Categorizing Absenteeism
A new variable is introduced to categorize employees based on their absenteeism time, dividing them into high and low absenteeism groups based on the median.
Breaking Down the Code
Using
mutateto Create a New Column: Themutatefunction is applied to theAbsenteeism.by.employeedata frame to create a new column namedHigh.absenteeism.Categorizing Based on Absenteeism Time: The
ifelsefunction is used to evaluate each employee’sAbsenteeism.time.in.hours. For each employee, if their absenteeism time is greater than or equal to the median absenteeism time across the entire data frame, they are categorized as “High Absenteeism”. If their absenteeism time is below the median, they are categorized as “Low Absenteeism”.Calculating the Median: The
median(Absenteeism.time.in.hours)function calculates the median value of theAbsenteeism.time.in.hourscolumn for the entire data frame. This median serves as the threshold to categorize employees.
This operation effectively splits the dataset into two groups, allowing for further analysis or reporting on patterns or trends related to absenteeism levels among employees. It provides a straightforward method for identifying employees with relatively high or low absenteeism, which can be useful for human resource management and policy-making.
6.2.10 Final Touches
Character variables are converted to factors to facilitate analysis.
Breaking Down the Code
The
mutateFunction: This function is used to modify or create new columns within a data frame. In this context, it’s used to apply a transformation across multiple columns.The
acrossFunction: This function is used to apply a specified function or transformation across selected columns of a data frame. It can target columns based on conditions like data type or column names.The
whereFunction: This is a helper function used withinacrossto specify which columns should be affected by the transformation. In this case,where(is_character)is used to select all columns where the data type is character.The
as_factorFunction: This function is applied to each selected column to convert character data into factors. Factors in R are data structures used to categorize and store categorical data, and they can be beneficial for statistical modeling and visualizations.
By converting character columns to factors, the data is prepared for analysis methods that require categorical variables to be in factor form, such as many types of statistical modeling. This step is often important in data preprocessing, especially when working with categorical data in R.
6.3 Examining the Final Data Frame
The final data frame is then displayed, offering a clear view of the structured and cleaned dataset ready for analysis.
| Number.of.absence | Absenteeism.time.in.hours | Most.common.reason.for.absence | Month.of.absence | Day.of.the.week | Body.mass.index | Age | Social.smoker | Social.drinker | Pet | Education | Children | College | High.absenteeism |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1.64 | 8.64 | 22 | Aug | Tue | 29 | 37 | Non-smoker | Non-drinker | Pet(s) | postgraduate | Parent | college | High Absenteeism |
| 0.5 | 2.08 | 0 | Aug | Tue | 33 | 48 | Smoker | Non-drinker | Pet(s) | high school | Parent | high school | Low Absenteeism |
| 6.28 | 26.8 | 27 | Feb | Thu | 31 | 38 | Non-smoker | Social drinker | No Pet(s) | high school | Non-parent | high school | High Absenteeism |
| 0.0769 | 0 | 0 | Dec | Wed | 34 | 40 | Non-smoker | Social drinker | Pet(s) | high school | Parent | high school | Low Absenteeism |
| 1.46 | 8 | 26 | Sep | Tue | 38 | 43 | Non-smoker | Social drinker | No Pet(s) | high school | Parent | high school | High Absenteeism |
| 0.615 | 5.54 | 22 | Feb | Fri | 25 | 33 | Non-smoker | Non-drinker | Pet(s) | high school | Parent | high school | High Absenteeism |
6.4 Saving the Processed Data
Finally, the processed data frame is saved to a CSV file, making it accessible for further analysis or sharing.
Breaking Down the Code
write.csvFunction: This is an R function that writes a data frame to a CSV (Comma-Separated Values) file. It is a commonly used function for exporting data from R to be used in other software or for sharing with others.First Argument (
Absenteeism.by.employee): This specifies the data frame that you want to write to the CSV file. In this case, it’s theAbsenteeism.by.employeedata frame, which likely contains processed and analyzed data on employee absenteeism.fileArgument: This parameter specifies the path and name of the CSV file to be created. Here, the file is namedAbsenteeism_by_employee.csvand will be saved in a directory nameddata. Thedatadirectory should be in your current working directory, or you should have appropriate permissions to create it if it doesn’t exist.row.names = FALSE: This argument indicates that row names from the data frame should not be included in the CSV file. This is often desirable because row names, which are automatically generated in R, usually don’t carry meaningful information for the dataset and can be omitted in the output file.
The end result is a CSV file named Absenteeism_by_employee.csv in the data folder, containing all the data from the Absenteeism.by.employee data frame, ready for use outside of R, sharing with others, or for archival purposes.
6.5 Conclution
This case study demonstrates the process of loading, cleaning, and preparing a dataset for analysis. Through a series of transformations, recoding, and summarizations, the Absenteeism at Work Data Set is refined into a structured format conducive to business intelligence applications, providing a foundational basis for insightful data analysis.
6.6 Homework Assignment: Cleaning Data
6.6.1 Objective:
The objective of this case study assignment is to develop hands-on experience in data manipulation, analysis, and visualization using an agricultural datasets. You will pratice to work with real-world data, applying techniques in data cleaning, merging, and aggregation.
6.6.2 The Data
The datasets used in this assignment are provided by the agridat package, which contains historical agricultural data from the United States Department of Agriculture, National Agricultural Statistics Service. These datasets encompass yields and acres harvested for major crops across the U.S. states, spanning from approximately 1900 to 2011. The crops included are barley, corn, cotton, hay, rice, sorghum, soybeans, and wheat.
Each dataset is structured with the following variables:
- year: The year of the data record.
- state: The U.S. state to which the data pertains, treated as a categorical factor.
- acres: The total acres harvested for the crop in question.
- yield: The average yield per acre for the crop.
It’s important to note the units of measurement for yield differ by crop type: bushels per acre for barley, corn, sorghum, soybeans, and wheat; pounds per acre for cotton and rice; and tons per acre for hay.
A caution is advised when interpreting yield values, especially for states with small acres harvested, as these figures might not represent the broader agricultural productivity accurately. We will focus on corn and barley.
6.6.3 Load the Data
Load the historical crop yield datasets for barley and corn, which are provided by the United States Department of Agriculture’s National Agricultural Statistics Service, through the agridat package in R. Use the tail function to display the last few rows of the datasets for barley, and corn. This will help you get a sense of the data’s structure and the values it contains.
6.6.4 Define/Document the Variables and Datasets
Provide a brief description of the datasets and the variables they contain.
6.6.5 Join Data
To prepare the datasets for merging and ensure a smooth analysis process, it’s essential to first address potential naming conflicts among key variables. This is achieved by renaming the acres and yield columns in each crop dataset, appending the crop name as a prefix to these variables. This step, accomplished using the dplyr::rename function, helps in clearly distinguishing between the variables from different datasets when they are combined. Following the renaming, the datasets are merged into a single data frame through a full join operation. This merge is conducted on the common variables state and year, effectively combining the corn and barley datasets into one comprehensive frame for analysis. This process not only facilitates a unified analysis but also exemplifies the method of combining datasets based on shared variables.
6.6.6 Add State Metadata
Integrating state metadata into your analysis involves creating a data frame that contains state names, their regions, and the area of each state, using the available built-in R data. After assembling this data frame, proceed to summarize the area by region to provide a clear overview of the geographical distribution within your dataset.
6.6.7 Finalize the state data frame
Start by narrowing down your dataset to include only records from the year 1900 onwards, ensuring that your analysis focuses on more recent agricultural trends. Next, undertake a state-level aggregation by categorizing the data by both state and decade. For each state-decade pair, calculate the mean of all numeric variables, which will facilitate the examination of trends over time across different states. To add further depth to your analysis, merge this aggregated data with the state metadata using a left_join operation. This integration will introduce region information to your dataset, enabling a more nuanced analysis of agricultural trends not just across time and states, but also across different regions. This comprehensive approach will provide a multi-dimensional view of the data, allowing for a richer understanding of agricultural developments over the past century.
6.6.8 Finalize a regional data frame
Create a regional dataset by first joining your filtered data with state metadata to include region details for each record. Then, group this dataset by year and region, calculating the average yield and total acres harvested for each group. Next, add a decade variable by grouping years into their respective decades, and re-group the data by this decade and region, removing the year variable to focus on these broader categories. Finally, summarize this data to calculate the mean for all variables within each decade-region grouping, producing the region.grain dataset.
6.6.9 Plot the Yield by Region
Plot the average regional yield for corn and barley.
6.6.9.1 Transforming the Data for Visualization
First, prepare your dataset for visualization by converting it into a long format, which is more amenable to plotting with tools like ggplot2 in R. Utilize the pivot_longer function to achieve this transformation, ensuring that each grain type and its associated metric (yield or acres) are delineated into separate columns. This step is crucial for accurately representing the diverse aspects of your dataset in the visualizations, e.g.
6.6.9.2 Creating the Plot
Once your data is in the correct format, focus on visualizing yield trends, specifically. Filter the dataset to include only records from 1950 onwards, concentrating on the ‘yield’ metric to maintain clarity and relevance in your analysis. Employ ggplot2 to craft a line plot that illustrates these yield trends across different regions and over the decades. To enhance the plot’s readability and comparative value, use color coding to differentiate between regions and faceting to separate the data by grain type.