6 Case Study: Loading and Cleaning the Data

PRELIMINARY AND INCOMPLETE

In this case study, we transform the “Absenteeism at Work Data Set” into a structured data frame. The objective is to cleanse and organize the data, making it conducive for extracting meaningful business intelligence insights. This process involves loading the data, processing it through various stages of cleaning, and finally, examining the refined output.

6.1 Loading the Data

The first step in our data analysis journey involves loading the absenteeism dataset from a specified URL. The read.csv function in R is utilized for this purpose, with the separator specified as a semicolon to correctly parse the CSV file.

absenteeism <- 
  read.csv("https://ljkelly3141.github.io/datasets/bi-book/Absenteeism_at_work.csv",
           sep = ";")

6.2 Processing the Data

6.2.1 Creating a Mode Function

To analyze the data effectively, we introduce a custom mode function, MODE, which identifies the most frequent occurrence of a value within a given set. In cases of a tie, the function will only return the first of these modes it encounters after sorting.

MODE <- function(x) {
  names(sort(-table(x)))[1] |> as.numeric()
}

Breaking Down the Code

Function Definition: MODE <- function(x) { ... } defines a new function named MODE that takes a single argument x. This x is expected to be a vector of values for which the mode is to be calculated.
table(x): This part of the code creates a frequency table of the values in x. Each unique value in x becomes a name in the table, and the corresponding value in the table represents the number of times that unique value appears in x.
sort(-table(x)): The sort function is used to sort the elements of the frequency table in descending order. The negation operator - is applied to the table values before sorting, which means that the sorting will be in descending order of frequency. This way, the most frequent values come first in the sorted table.
names(sort(-table(x)))[1]: After sorting, names(...) is used to extract the names (i.e., the original values from x) of the sorted frequency table. Indexing with [1] selects the first name from this sorted list, which corresponds to the value with the highest frequency in the original data set x. This is because the sorting was done in descending order of frequency.
|> as.numeric(): The |> is the pipe operator introduced in R 4.1.0, which forwards the left-hand side value as the first argument to the function on the right-hand side. In this context, it forwards the most frequent value (which is still a character because it was extracted as a name from the table) to the as.numeric() function, converting it to a numeric value if possible.

It’s important to note that this function has a limitation: if there are multiple modes in the data set (i.e., more than one value shares the maximum frequency), the function will only return the first of these modes it encounters after sorting. This might not always be desirable, as there could be cases where identifying all modes is important, or where the numeric ordering of modes is not meaningful.

6.2.2 Filtering by Service Time

An initial filtering step involves removing records of employees with less than six months of service, ensuring the analysis focuses on more established workers.

Absenteeism.by.employee <- absenteeism |>
  filter(Service.time >= 6)

Breaking Down the Code

Pipe Operator (|>): This operator is used to pass the result of one expression as the argument to the next. It’s a way to chain together a sequence of operations in a more readable manner.
absenteeism: This refers to the dataset that is being filtered. It likely contains records of employee absenteeism, including various details such as the employee’s ID, reasons for absence, duration of absence, and the service time with the company, among other things.
filter(Service.time >= 6): This part of the code uses the filter function from the dplyr package (implied by the use of the pipe operator), which is a part of the tidyverse suite of data manipulation tools. The filter function is used to subset rows based on a particular condition. Here, the condition is Service.time >= 6, which means the function will retain only those rows in the dataset where the Service.time column (representing the length of service of an employee in months) has a value of 6 or more. Rows where the service time is less than 6 months will be excluded from the resulting dataset.
Absenteeism.by.employee <-: This part of the code assigns the filtered dataset to a new variable called Absenteeism.by.employee. This variable will now contain the subset of the original absenteeism dataset, including only those employees who have been with the company for 6 months or more.

In summary, this code snippet is used to create a new dataset from the original absenteeism dataset, containing only the records of employees with a service time of at least 6 months.

6.2.3 Aggregating Data by Employee

The next phase involves grouping the data by employee ID and calculating various summary statistics. This includes finding the most common reason for absence, the typical month and day of the week for absences, and other key metrics. A crucial step here is to normalize the total absenteeism time by the service time and to calculate the average number of absences per unit of service time.

Absenteeism.by.employee <- Absenteeism.by.employee |>
  group_by(ID) |>
  mutate(Most.common.reason.for.absence = MODE(Reason.for.absence),
         Month.of.absence = MODE(Month.of.absence),
         Day.of.the.week = MODE(Day.of.the.week),
         Education = max(Education),
         Absenteeism.time.in.hours = 
           sum(Absenteeism.time.in.hours)/Service.time,
         Number.of.absence = n()/Service.time)

Breaking Down the Code

Group by Employee ID: The group_by(ID) function groups the data frame by the employee ID. This means that all subsequent operations will be performed within each group of rows that share the same employee ID, allowing for individualized calculations for each employee.
Calculate Most Common Reason for Absence: Within each group (i.e., for each employee), the custom MODE function is applied to the Reason.for.absence column to determine the most common reason an employee was absent. The MODE function identifies the mode, or the most frequently occurring value in a set of data. If there are multiple modes, the function is designed to select the largest value among them.
Determine the Mode of Month of Absence: Similar to the previous step, the MODE function is used on the Month.of.absence column to find the month in which each employee was most frequently absent.
Find the Most Common Day of the Week of Absence: Again, the MODE function is applied, but this time to the Day.of.the.week column, identifying the day of the week on which each employee was most commonly absent.
Identify Maximum Education Level: The max function is used on the Education column to find the highest level of education for each employee. This step assumes that the education levels are encoded numerically, with higher numbers representing higher levels of education.
Calculate Adjusted Absenteeism Time: The total absenteeism time in hours (Absenteeism.time.in.hours) for each employee is summed up and then divided by the employee’s service time (Service.time). This step calculates an adjusted measure of absenteeism time, accounting for the length of service.
Compute Number of Absences per Unit of Service Time: The n() function counts the number of rows (i.e., absence records) for each employee, which is then divided by the service time (Service.time). This yields the average number of absences per unit of service time for each employee, providing a standardized measure of absence frequency that accounts for differences in service lengths.

6.2.4 Summarizing Data

The dataset is further summarized by averaging the observations for each employee, streamlining the dataset for analysis.

Absenteeism.by.employee <- Absenteeism.by.employee |> 
  group_by(ID) |> 
  summarise(across(everything(), mean))

Breaking Down the Code

Group by Employee ID: The group_by(ID) function is used to group the data frame by the employee ID. This operation ensures that all subsequent calculations are performed separately for each employee, allowing for individualized summaries.
Summarize Across All Variables: The summarise function is combined with the across(everything(), mean) expression to perform an operation across all columns (variables) in each group. The everything() selector targets all columns in the data frame, and the mean function is applied to calculate the average for each column.
Calculate the Mean for Each Variable: For each group (i.e., for each employee), the mean of every variable is calculated. This includes all numeric variables present in the data frame after the previous processing steps, such as the most common reason for absence, the most common month and day of the week of absence, maximum education level, adjusted absenteeism time, and the number of absences per unit of service time.
Create a New Data Frame: The result of the summarise operation is a new data frame where each row corresponds to an employee (identified by their ID), and each column contains the mean of the respective variable for that employee. This new data frame is then reassigned to the Absenteeism.by.employee variable.

This process effectively condenses the data, providing a single-row summary for each employee that captures the average values of all measured variables. It’s a common step in data preparation, especially when the goal is to analyze patterns or trends at the individual level without the noise of multiple records per individual.

6.2.5 Enhancing Date and Time Variables

Using the lubridate package, the month and day of the week variables are transformed into more readable formats, enhancing the interpretability of the data.

library("lubridate")

Absenteeism.by.employee <- Absenteeism.by.employee |>
  mutate(
    Day.of.the.week  =
      wday(Day.of.the.week + 1, label = TRUE) |> as.character(),
    Month.of.absence =
      month(
        ymd(010101) +
          months(Month.of.absence - 1),
        label = TRUE,
        abbr = TRUE
      ) |> as.character()
  )

Breaking Down the Code

Load the lubridate Package: The library("lubridate") command loads the lubridate package, which provides functions that make it easier to work with dates and times in R.
Transform Day.of.the.week:
- The wday() function from the lubridate package is used to convert the numeric Day.of.the.week values into actual weekday names.
- Since wday() by default treats 1 as Sunday, Day.of.the.week + 1 is used to adjust for any off-by-one discrepancies in the original data (assuming the data starts with 1 for Monday).
- The label = TRUE argument tells wday() to return the weekday names instead of numbers.
- The final result is converted to a character string using as.character() to ensure the new Day.of.the.week variable consists of readable weekday names.
Transform Month.of.absence:
- The month() function, also from lubridate, is used to convert numeric Month.of.absence values into month names.
- ymd(010101) creates a dummy date (January 1, 0001) to which months(Month.of.absence - 1) is added. The subtraction of 1 from Month.of.absence is necessary to align the numeric month values with their correct position (e.g., 1 for January, 2 for February, etc.), assuming the original data used 1 for January, 2 for February, and so on.
- The label = TRUE, abbr = TRUE arguments ensure that the resulting month names are abbreviated and more readable.
- The output is converted to a character string with as.character() to store the transformed Month.of.absence values as readable abbreviated month names.

By utilizing the lubridate package, this code effectively enhances the Absenteeism.by.employee data frame, making the day of the week and month of absence variables more intuitive and accessible for analysis or reporting purposes.

6.2.6 Refining the Data Set

The final selection of variables is made, focusing on key attributes that will be used in subsequent analyses.

Absenteeism.by.employee <- Absenteeism.by.employee |>
  select(
    ID,
    Number.of.absence,
    Absenteeism.time.in.hours,
    Most.common.reason.for.absence,
    Month.of.absence,
    Day.of.the.week,
    Body.mass.index,
    Age,
    Social.smoker,
    Social.drinker,
    Son,
    Pet,
    Education
  )

Breaking Down the Code

Data Frame Specification: Absenteeism.by.employee <- Absenteeism.by.employee |> indicates that the operations following the pipe (|>) operator will be applied to the Absenteeism.by.employee data frame, and the result will overwrite the original data frame.
Column Selection: The select() function is used to explicitly choose which columns to retain in the final data frame. The columns included are:
- ID: Employee identifier.
- Number.of.absence: The average number of absences per unit of service time for each employee.
- Absenteeism.time.in.hours: The adjusted measure of absenteeism time, accounting for the length of service.
- Most.common.reason.for.absence: The most frequent reason for an employee’s absence.
- Month.of.absence: The month in which an employee was most frequently absent, likely transformed to a character string representing month names or abbreviations.
- Day.of.the.week: The day of the week on which an employee was most commonly absent, likely transformed to a character string representing day names or abbreviations.
- Body.mass.index: The Body Mass Index (BMI) of each employee.
- Age: The age of each employee.
- Social.smoker: Indicates whether the employee is a smoker or not.
- Social.drinker: Indicates whether the employee is a drinker or not.
- Son: The number of children the employee has.
- Pet: The number of pets the employee has.
- Education: The highest level of education achieved by the employee.

By selecting these specific columns, the data frame is streamlined to include only the variables that are deemed necessary for further analysis or reporting. This step can help in focusing on the most relevant data and improving the efficiency of subsequent data handling tasks.

6.2.7 Recoding Variables

Several categorical variables, such as smoking and drinking habits, are recoded for clarity, and redundant variables are removed from the dataset.

Absenteeism.by.employee <- Absenteeism.by.employee |> 
  mutate(Social.smoker = ifelse(Social.smoker == 0, 
                         "Non-smoker", 
                         "Smoker"),
         Social.drinker = ifelse(Social.drinker == 0, 
                         "Non-drinker", 
                         "Social drinker"),
         Children = ifelse(Son == 0, 
                         "Non-parent", 
                         "Parent"),
         Pet = ifelse(Pet == 0, 
                         "No Pet(s)", 
                         "Pet(s)"))|> 
  select(-Son, -ID)

Breaking Down the Code

Recoding Social.smoker: The mutate function is used to transform the Social.smoker column. The ifelse statement checks each value in the Social.smoker column; if a value is 0, it is replaced with “Non-smoker”, and if not, it is replaced with “Smoker”. This recodes the binary representation (0 and 1) into a more readable format.
Recoding Social.drinker: Similarly, Social.drinker is recoded using ifelse. Values of 0 are replaced with “Non-drinker”, and all other values are replaced with “Social drinker”. This change makes the data more interpretable by converting numeric codes into descriptive categories.
Creating Children Category: A new column, Children, is created using ifelse to check the Son column (which likely represents the number of children an employee has). If the value is 0, it is labeled as “Non-parent”, and if there’s at least one child, it’s labeled as “Parent”. This categorizes employees based on parenthood status.
Recoding Pet: The Pet column is transformed in a similar manner. If an employee has no pets (Pet == 0), the value is set to “No Pet(s)”; otherwise, it’s set to “Pet(s)”, indicating the presence of one or more pets.
Removing Columns: After these transformations, the select function with the minus sign (-) is used to remove the Son and ID columns from the data frame. The Son column is likely removed because its information is now encapsulated in the more descriptive Children column. The ID column might be removed because it’s no longer needed for the analysis, or to anonymize the data.

The result is a data frame where binary or numeric indicators for smoking, drinking, having children, and owning pets are converted into descriptive categories, enhancing the readability and interpretability of the data. Additionally, unnecessary columns are dropped to streamline the dataset.

6.2.8 Simplifying Education Levels

The education variable is simplified into more general categories, aiding in the analysis and interpretation of the data.

Absenteeism.by.employee <- Absenteeism.by.employee |>
  mutate(
    College = ifelse(Education >= 2, "college", "high school"),
    Education = factor(Education) |>
      fct_recode(
        "high school" = "1",
        "graduate" = "2",
        "postgraduate" = "3",
        "master and doctor" = "4"
      )
  )

Breaking Down the Code

This R code snippet further manipulates the Absenteeism.by.employee data frame by transforming the Education column and adding a new College column:

Creating the College Column: The mutate function adds a new column named College based on the Education column’s values. The ifelse function checks if the Education value is greater than or equal to 2; if so, it labels the entry as “college”, indicating that the employee has at least some college education. If the value is less than 2, it is labeled as “high school”, indicating that the employee’s highest level of education is high school or lower.
Transforming the Education Column: The Education column is then transformed into a factor using the factor() function, which is a way to categorize data in R. The fct_recode function from the forcats package (implicitly loaded with dplyr and tidyverse) is then used to rename the factor levels for more meaningful descriptions:
- "high school" = "1": Education level 1 is recoded as “high school”.
- "graduate" = "2": Education level 2 is recoded as “graduate”.
- "postgraduate" = "3": Education level 3 is recoded as “postgraduate”.
- "master and doctor" = "4": Education level 4 is recoded as “master and doctor”.

This transformation enriches the dataset by providing a clear distinction between employees with only high school education and those with some college education or higher. It also makes the Education column more informative by replacing numerical codes with descriptive labels, making the data easier to understand and analyze.

6.2.9 Categorizing Absenteeism

A new variable is introduced to categorize employees based on their absenteeism time, dividing them into high and low absenteeism groups based on the median.

Absenteeism.by.employee <- Absenteeism.by.employee |> 
  mutate(High.absenteeism =
    ifelse(Absenteeism.time.in.hours >= median(Absenteeism.time.in.hours),
           "High Absenteeism", "Low Absenteeism"))

Breaking Down the Code

Using mutate to Create a New Column: The mutate function is applied to the Absenteeism.by.employee data frame to create a new column named High.absenteeism.
Categorizing Based on Absenteeism Time: The ifelse function is used to evaluate each employee’s Absenteeism.time.in.hours. For each employee, if their absenteeism time is greater than or equal to the median absenteeism time across the entire data frame, they are categorized as “High Absenteeism”. If their absenteeism time is below the median, they are categorized as “Low Absenteeism”.
Calculating the Median: The median(Absenteeism.time.in.hours) function calculates the median value of the Absenteeism.time.in.hours column for the entire data frame. This median serves as the threshold to categorize employees.

This operation effectively splits the dataset into two groups, allowing for further analysis or reporting on patterns or trends related to absenteeism levels among employees. It provides a straightforward method for identifying employees with relatively high or low absenteeism, which can be useful for human resource management and policy-making.

6.2.10 Final Touches

Character variables are converted to factors to facilitate analysis.

Absenteeism.by.employee <- Absenteeism.by.employee |>
  mutate(across(where(is_character), as_factor))

Breaking Down the Code

The mutate Function: This function is used to modify or create new columns within a data frame. In this context, it’s used to apply a transformation across multiple columns.
The across Function: This function is used to apply a specified function or transformation across selected columns of a data frame. It can target columns based on conditions like data type or column names.
The where Function: This is a helper function used within across to specify which columns should be affected by the transformation. In this case, where(is_character) is used to select all columns where the data type is character.
The as_factor Function: This function is applied to each selected column to convert character data into factors. Factors in R are data structures used to categorize and store categorical data, and they can be beneficial for statistical modeling and visualizations.

By converting character columns to factors, the data is prepared for analysis methods that require categorical variables to be in factor form, such as many types of statistical modeling. This step is often important in data preprocessing, especially when working with categorical data in R.

6.3 Examining the Final Data Frame

The final data frame is then displayed, offering a clear view of the structured and cleaned dataset ready for analysis.

head(Absenteeism.by.employee)

Number.of.absence	Absenteeism.time.in.hours	Most.common.reason.for.absence	Month.of.absence	Day.of.the.week	Body.mass.index	Age	Social.smoker	Social.drinker	Pet	Education	Children	College	High.absenteeism
1.64	8.64	22	Aug	Tue	29	37	Non-smoker	Non-drinker	Pet(s)	postgraduate	Parent	college	High Absenteeism
0.5	2.08	0	Aug	Tue	33	48	Smoker	Non-drinker	Pet(s)	high school	Parent	high school	Low Absenteeism
6.28	26.8	27	Feb	Thu	31	38	Non-smoker	Social drinker	No Pet(s)	high school	Non-parent	high school	High Absenteeism
0.0769	0	0	Dec	Wed	34	40	Non-smoker	Social drinker	Pet(s)	high school	Parent	high school	Low Absenteeism
1.46	8	26	Sep	Tue	38	43	Non-smoker	Social drinker	No Pet(s)	high school	Parent	high school	High Absenteeism
0.615	5.54	22	Feb	Fri	25	33	Non-smoker	Non-drinker	Pet(s)	high school	Parent	high school	High Absenteeism

6.4 Saving the Processed Data

Finally, the processed data frame is saved to a CSV file, making it accessible for further analysis or sharing.

write.csv(Absenteeism.by.employee, 
          file = file_path,
          row.names = FALSE)

Breaking Down the Code

write.csv Function: This is an R function that writes a data frame to a CSV (Comma-Separated Values) file. It is a commonly used function for exporting data from R to be used in other software or for sharing with others.
First Argument (Absenteeism.by.employee): This specifies the data frame that you want to write to the CSV file. In this case, it’s the Absenteeism.by.employee data frame, which likely contains processed and analyzed data on employee absenteeism.
file Argument: This parameter specifies the path and name of the CSV file to be created. Here, the file is named Absenteeism_by_employee.csv and will be saved in a directory named data. The data directory should be in your current working directory, or you should have appropriate permissions to create it if it doesn’t exist.
row.names = FALSE: This argument indicates that row names from the data frame should not be included in the CSV file. This is often desirable because row names, which are automatically generated in R, usually don’t carry meaningful information for the dataset and can be omitted in the output file.

The end result is a CSV file named Absenteeism_by_employee.csv in the data folder, containing all the data from the Absenteeism.by.employee data frame, ready for use outside of R, sharing with others, or for archival purposes.

6.5 Conclution

This case study demonstrates the process of loading, cleaning, and preparing a dataset for analysis. Through a series of transformations, recoding, and summarizations, the Absenteeism at Work Data Set is refined into a structured format conducive to business intelligence applications, providing a foundational basis for insightful data analysis.

6.6 Homework Assignment: Cleaning Data

6.6.1 Objective:

The objective of this case study assignment is to develop hands-on experience in data manipulation, analysis, and visualization using an agricultural datasets. You will pratice to work with real-world data, applying techniques in data cleaning, merging, and aggregation.

6.6.2 The Data

The datasets used in this assignment are provided by the agridat package, which contains historical agricultural data from the United States Department of Agriculture, National Agricultural Statistics Service. These datasets encompass yields and acres harvested for major crops across the U.S. states, spanning from approximately 1900 to 2011. The crops included are barley, corn, cotton, hay, rice, sorghum, soybeans, and wheat.

Each dataset is structured with the following variables:

year: The year of the data record.
state: The U.S. state to which the data pertains, treated as a categorical factor.
acres: The total acres harvested for the crop in question.
yield: The average yield per acre for the crop.

It’s important to note the units of measurement for yield differ by crop type: bushels per acre for barley, corn, sorghum, soybeans, and wheat; pounds per acre for cotton and rice; and tons per acre for hay.

A caution is advised when interpreting yield values, especially for states with small acres harvested, as these figures might not represent the broader agricultural productivity accurately. We will focus on corn and barley.

6.6.3 Load the Data

Load the historical crop yield datasets for barley and corn, which are provided by the United States Department of Agriculture’s National Agricultural Statistics Service, through the agridat package in R. Use the tail function to display the last few rows of the datasets for barley, and corn. This will help you get a sense of the data’s structure and the values it contains.

Instructions

Begin by ensuring you have the pacman package available in your R environment. If it’s not already installed, you’ll need to add it first.
Use the pacman package to load both the tidyverse and agridat packages.
Once the necessary packages are loaded, proceed to access the datasets for nass.corn and nass.barley provided within the agridat package by using the data().
Use the tail() function to view the last few rows of both the nass.barley and nass.corn datasets.

6.6.4 Define/Document the Variables and Datasets

Provide a brief description of the datasets and the variables they contain.

Instructions

Summarize the information provided about the datasets. Highlight the types of crops included, the range of years covered, and the metrics (e.g., yield and acres harvested) provided.

6.6.5 Join Data

To prepare the datasets for merging and ensure a smooth analysis process, it’s essential to first address potential naming conflicts among key variables. This is achieved by renaming the acres and yield columns in each crop dataset, appending the crop name as a prefix to these variables. This step, accomplished using the dplyr::rename function, helps in clearly distinguishing between the variables from different datasets when they are combined. Following the renaming, the datasets are merged into a single data frame through a full join operation. This merge is conducted on the common variables state and year, effectively combining the corn and barley datasets into one comprehensive frame for analysis. This process not only facilitates a unified analysis but also exemplifies the method of combining datasets based on shared variables.

Instructions

Start by renaming the variables in each of the datasets for clarity and to avoid confusion during analysis. Specifically, for the barley dataset, rename the acres variable to barley.acres and the yield variable to barley.yield.

barley <- nass.barley |> 
  dplyr::rename(barley.acres = acres,
                barley.yield = yield)

Apply the same renaming process to the corn dataset, changing the acres variable to corn.acres and the yield variable to corn.yield.
Once the variables have been renamed, proceed to merge the datasets for a comprehensive analysis. Specifically, perform a full join between the corn and barley datasets. Use the state and year columns as keys for this join, as these common variables will allow you to combine the datasets based on their shared attributes. To do this, add the argument the by argument as follows to the full_joint() call: by = c("state","year").

6.6.6 Add State Metadata

Integrating state metadata into your analysis involves creating a data frame that contains state names, their regions, and the area of each state, using the available built-in R data. After assembling this data frame, proceed to summarize the area by region to provide a clear overview of the geographical distribution within your dataset.

Instructions

Begin by creating a new data frame that will hold the geographical metadata for each state. This data frame should include three columns: state, which holds the names of the states; region, which specifies the region each state belongs to; and area, which contains the total area of each state. Use R’s built-in variables state.name, state.region, and state.area to fill in the columns.

state.data <- data.frame(
  state = state.name,  # Built-in variable for state names
  region = state.region,  # Built-in variable for the region 
  area = state.area  # Built-in variable for the land area of state
)

Once your data frame is set up with state names, regions, and areas, the next step is to aggregate this data at the regional level. Specifically, you want to sum up the total area for each region to get a sense of the geographical distribution across regions. This can be done by grouping the data by the region column and then summarizing it to calculate the sum of the area column for each group.

redion.data <- state.data |>
  group_by(region) |>  # Group states by region
  summarise(area = sum(area))  # Calculate the total area for each region

6.6.7 Finalize the state data frame

Start by narrowing down your dataset to include only records from the year 1900 onwards, ensuring that your analysis focuses on more recent agricultural trends. Next, undertake a state-level aggregation by categorizing the data by both state and decade. For each state-decade pair, calculate the mean of all numeric variables, which will facilitate the examination of trends over time across different states. To add further depth to your analysis, merge this aggregated data with the state metadata using a left_join operation. This integration will introduce region information to your dataset, enabling a more nuanced analysis of agricultural trends not just across time and states, but also across different regions. This comprehensive approach will provide a multi-dimensional view of the data, allowing for a richer understanding of agricultural developments over the past century.

Instructions

Filtering for Recent Records: Begin by filtering your dataset to include only records from the year 1900 onwards. This ensures that your analysis is focused on data relevant to the 20th and early 21st centuries.
Decade Calculation: Add a new variable to your dataset that represents the decade of each record. This can be done by truncating the year to the nearest lower decade. For instance, years within the 1990s will all be represented as 1990, e.g. mutate(decade = trunc(year/10)*10).
Excluding the Year Variable: Remove the original year variable from your dataset. This focuses your analysis on the newly created decade categorization.
Grouping and Aggregating: Group your data by the newly created decade variable and state. Within each group, calculate the mean for all numeric variables. This step is essential for analyzing trends over time and across states, e.g. summarise(across(everything(), mean)).
Enhancing Data with Region Information: Perform a left join with a state metadata dataset that includes state and region information. This enriches your aggregated data with geographical context, allowing for regional analyses alongside state and decade comparisons.

6.6.8 Finalize a regional data frame

Create a regional dataset by first joining your filtered data with state metadata to include region details for each record. Then, group this dataset by year and region, calculating the average yield and total acres harvested for each group. Next, add a decade variable by grouping years into their respective decades, and re-group the data by this decade and region, removing the year variable to focus on these broader categories. Finally, summarize this data to calculate the mean for all variables within each decade-region grouping, producing the region.grain dataset.

Instructions

Group Data by Decade and Region
Calculate Mean Yields and Total Acres: Within each decade-region group, calculate the average yield for all crops and sum the total acres harvested, e.g.

group_by(decade,region) |> 
summarize(across(contains("yield"), ~mean(.x, na.rm = TRUE), .names = "{.col}"),
          across(contains("acres"), ~sum(.x, na.rm = TRUE), .names = "{.col}"))

Add a new variable to your dataset that categorizes each year into its corresponding decade.
Re-group Data by Decade and Region.
Remove the Year Variable
For each decade-region grouping, calculate the mean for all remaining variables.

6.6.9 Plot the Yield by Region

Plot the average regional yield for corn and barley.

6.6.9.1 Transforming the Data for Visualization

First, prepare your dataset for visualization by converting it into a long format, which is more amenable to plotting with tools like ggplot2 in R. Utilize the pivot_longer function to achieve this transformation, ensuring that each grain type and its associated metric (yield or acres) are delineated into separate columns. This step is crucial for accurately representing the diverse aspects of your dataset in the visualizations, e.g.

Instructions

region.grain.long <- region.grain |> 
  pivot_longer(cols = -c("decade", "region")) |> 
  separate(name, into = c("grain", "metric"), sep = "\\.")

pivot_longer Function: The pivot_longer function is used to transform data from a wide format to a long format. The cols = -c("decade", "region") argument specifies that all columns except decade and region should be pivoted longer. This means that for every unique value in the columns other than decade and region, a new row will be created. This is useful for converting multiple yield and acre columns (one for each grain type) into a single column, with each row representing a specific measurement.
separate Function: After pivoting, the separate function is used to split the names of the newly created long column into two separate columns. The original long column likely combines grain type and the metric (yield or acres) in one string, separated by a period (e.g., “corn.yield”). The separate function takes this combined column (name) and divides it into two new columns: one for grain and one for metric, based on the separator specified by sep = "\\.". The double backslash \\ is used to escape the period, as a single period in regex (regular expressions) matches any character, but here, it’s meant to represent a literal period.

6.6.9.2 Creating the Plot

Once your data is in the correct format, focus on visualizing yield trends, specifically. Filter the dataset to include only records from 1950 onwards, concentrating on the ‘yield’ metric to maintain clarity and relevance in your analysis. Employ ggplot2 to craft a line plot that illustrates these yield trends across different regions and over the decades. To enhance the plot’s readability and comparative value, use color coding to differentiate between regions and faceting to separate the data by grain type.

Instructions

region.grain.long |> 
  filter(decade>=1950 & metric == "yield") |> 
  ggplot(aes(x = decade, y=value, color = region)) +
  geom_line() +
  facet_wrap(~grain,ncol = 2) +
  labs(y = "yield", color = "")

Filtering Data: filter(decade>=1950 & metric == "yield") restricts the dataset to entries from 1950 and later, focusing only on records where the metric is ‘yield’. This ensures that the plot reflects trends in agricultural yields rather than other metrics like acres harvested.
Starting the Plot with ggplot: ggplot(aes(x = decade, y=value, color = region)) initializes the plot, setting decade as the x-axis, value (the yield) as the y-axis, and the line color to distinguish different regions. This step sets up the aesthetic mappings but doesn’t yet draw any data points or lines.
Adding Lines with geom_line: geom_line() adds line geometries to the plot, connecting data points for each region across decades to illustrate how yields have changed over time. Each region’s trend is plotted with a distinct color, as specified by the color aesthetic.
Faceting the Plot by Grain Type: facet_wrap(~grain, ncol = 2) organizes the plot into separate panels for each grain type, with two columns of panels. This arrangement allows for easy comparison of yield trends across different types of grains while keeping the focus on how these trends vary by region within each grain category.
Adjusting Labels with labs: labs(y = "yield", color = "") customizes the plot labels. It sets the y-axis label to “yield” to clearly indicate what the graph is measuring. The color = "" argument removes the legend title for the color aesthetic, simplifying the plot’s appearance. The color legend itself remains, showing which colors correspond to which regions.