Homework: Cleaning Agricultural Data
6.5.2 Objective
The objective of this assignment is to develop hands-on experience in data manipulation, cleaning, merging, and visualization using agricultural datasets. You will apply the techniques learned in Chapters 5 and 6 to a different domain.
6.5.3 The Data
The datasets used in this assignment are provided by the agridat package, which contains historical agricultural data from the United States Department of Agriculture, National Agricultural Statistics Service. These datasets encompass yields and acres harvested for major crops across U.S. states, spanning from approximately 1900 to 2011. We will focus on corn and barley.
Each dataset is structured with the following variables:
- year: The year of the data record.
- state: The U.S. state to which the data pertains.
- acres: The total acres harvested for the crop.
- yield: The average yield per acre (bushels per acre for barley and corn).
6.5.4 Step 1: Load the Data
Load the historical crop yield datasets for barley and corn through the agridat package. Use the tail() function to display the last few rows of each dataset.
- Use
pacman::p_load()to load thetidyverseandagridatpackages. - Access the datasets using
data("nass.corn")anddata("nass.barley"). - Use
tail()to view the last few rows of both datasets.
6.5.5 Step 2: Define the Variables
Provide a brief description of the datasets and the variables they contain. Summarize the types of crops included, the range of years covered, and the metrics provided.
6.5.6 Step 3: Join the Datasets
Rename the acres and yield columns in each dataset to avoid conflicts, then merge them:
- Rename variables in the barley dataset:
acresbecomesbarley.acresandyieldbecomesbarley.yield.
- Apply the same renaming to the corn dataset (
corn.acres,corn.yield). - Perform a full join on
stateandyear: usefull_join()withby = c("state", "year").
6.5.7 Step 4: Add State Metadata
Create a data frame with state names, regions, and areas using R’s built-in variables, then summarize area by region:
6.5.8 Step 5: Create a State-Level Summary
- Filter the joined dataset to include only records from 1900 onwards.
- Add a decade variable:
mutate(decade = trunc(year/10)*10). - Remove the
yearvariable. - Group by
decadeandstate, then calculate the mean for all numeric variables usingsummarise(across(everything(), mean)). - Left join with the state metadata to add
regioninformation.
6.5.9 Step 6: Create a Regional Summary
- Group your data by
decadeandregion. - Calculate mean yields and total acres:
6.5.10 Step 7: Visualize Yield Trends
Transform the regional data to long format using pivot_longer() and separate(), then create a line plot of yield trends by region:
region.grain.long <- region.grain |>
pivot_longer(cols = -c("decade", "region")) |>
separate(name, into = c("grain", "metric"), sep = "\\.")
region.grain.long |>
filter(decade >= 1950 & metric == "yield") |>
ggplot(aes(x = decade, y = value, color = region)) +
geom_line() +
facet_wrap(~grain, ncol = 2) +
labs(y = "Yield", color = "") +
theme_minimal()Submission: Render your Quarto document to Word format and submit the rendered file. See the Quarto appendix for detailed instructions on creating and rendering Quarto documents.