Homework: Cleaning Agricultural Data

6.5.2 Objective

The objective of this assignment is to develop hands-on experience in data manipulation, cleaning, merging, and visualization using agricultural datasets. You will apply the techniques learned in Chapters 5 and 6 to a different domain.

6.5.3 The Data

The datasets used in this assignment are provided by the agridat package, which contains historical agricultural data from the United States Department of Agriculture, National Agricultural Statistics Service. These datasets encompass yields and acres harvested for major crops across U.S. states, spanning from approximately 1900 to 2011. We will focus on corn and barley.

Each dataset is structured with the following variables:

year: The year of the data record.
state: The U.S. state to which the data pertains.
acres: The total acres harvested for the crop.
yield: The average yield per acre (bushels per acre for barley and corn).

6.5.4 Step 1: Load the Data

Load the historical crop yield datasets for barley and corn through the agridat package. Use the tail() function to display the last few rows of each dataset.

Use pacman::p_load() to load the tidyverse and agridat packages.
Access the datasets using data("nass.corn") and data("nass.barley").
Use tail() to view the last few rows of both datasets.

6.5.5 Step 2: Define the Variables

Provide a brief description of the datasets and the variables they contain. Summarize the types of crops included, the range of years covered, and the metrics provided.

6.5.6 Step 3: Join the Datasets

Rename the acres and yield columns in each dataset to avoid conflicts, then merge them:

Rename variables in the barley dataset: acres becomes barley.acres and yield becomes barley.yield.

barley <- nass.barley |>
  dplyr::rename(barley.acres = acres,
                barley.yield = yield)

Apply the same renaming to the corn dataset (corn.acres, corn.yield).
Perform a full join on state and year: use full_join() with by = c("state", "year").

6.5.7 Step 4: Add State Metadata

Create a data frame with state names, regions, and areas using R’s built-in variables, then summarize area by region:

state.data <- data.frame(
  state = state.name,
  region = state.region,
  area = state.area
)

region.data <- state.data |>
  group_by(region) |>
  summarise(area = sum(area))

6.5.8 Step 5: Create a State-Level Summary

Filter the joined dataset to include only records from 1900 onwards.
Add a decade variable: mutate(decade = trunc(year/10)*10).
Remove the year variable.
Group by decade and state, then calculate the mean for all numeric variables using summarise(across(everything(), mean)).
Left join with the state metadata to add region information.

6.5.9 Step 6: Create a Regional Summary

Group your data by decade and region.
Calculate mean yields and total acres:

group_by(decade, region) |>
  summarize(across(contains("yield"), ~mean(.x, na.rm = TRUE), .names = "{.col}"),
            across(contains("acres"), ~sum(.x, na.rm = TRUE), .names = "{.col}"))

6.5.10 Step 7: Visualize Yield Trends

Transform the regional data to long format using pivot_longer() and separate(), then create a line plot of yield trends by region:

region.grain.long <- region.grain |>
  pivot_longer(cols = -c("decade", "region")) |>
  separate(name, into = c("grain", "metric"), sep = "\\.")

region.grain.long |>
  filter(decade >= 1950 & metric == "yield") |>
  ggplot(aes(x = decade, y = value, color = region)) +
  geom_line() +
  facet_wrap(~grain, ncol = 2) +
  labs(y = "Yield", color = "") +
  theme_minimal()

Submission: Render your Quarto document to Word format and submit the rendered file. See the Quarto appendix for detailed instructions on creating and rendering Quarto documents.