Chapter 10 Data Wrangling with dplyr

10.0.1 Introduction

This chapter focuses on the practical applications of the dplyr package, a cornerstone for data manipulation within the R programming language. We will explore various functions of dplyr through the lens of the Online Retail II dataset provided by Chen, Daqing (2019). This dataset, comprising a rich array of online retail transactions, serves as an excellent basis for demonstrating the capabilities of dplyr. The dataset can be accessed here.

10.0.2 Select

The select function is your gateway to simplifying datasets by choosing only the columns that are necessary for your analysis. This function is integral for focusing on the variables of interest without getting lost in an ocean of data.

In the following example, we utilize select to extract just the Invoice and StockCode columns from the retail_data_small dataset. This operation allows us to narrow down the data to these two critical identifiers, providing a clearer view of the transactions without the clutter of unnecessary information.

selected_data <- retail_data_small %>%
  select(Invoice, StockCode)

head(selected_data)

## # A tibble: 6 × 2
##   Invoice StockCode
##   <chr>   <chr>    
## 1 511133  21080    
## 2 499422  21110    
## 3 532319  20829    
## 4 516314  22327    
## 5 515012  21397    
## 6 C490513 85043

10.0.3 Filter

The filter function is a powerful tool for subsetting data based on specific conditions. It allows you to hone in on the rows that meet your criteria, thus enabling targeted analysis.

Here, we demonstrate the use of filter to extract rows from our dataset where the quantity of items purchased exceeds 10. This operation is particularly useful when interested in larger transactions or when trying to identify bulk purchases within the dataset.

filtered_data <- retail_data_small %>%
  filter(Quantity > 10)

head(filtered_data)

## # A tibble: 6 × 8
##   Invoice StockCode Description Quantity InvoiceDate         Price `Customer ID`
##   <chr>   <chr>     <chr>          <dbl> <dttm>              <dbl>         <dbl>
## 1 515012  21397     BLUE  SPOT…       12 2010-07-07 16:14:00  1.25         15581
## 2 527445  22189     CREAM HEAR…       12 2010-10-18 09:33:00  3.95         14741
## 3 507176  22045     SPACEBOY G…       25 2010-05-06 15:05:00  0.42         17069
## 4 533085  11001     ASSTD DESI…       16 2010-11-16 09:56:00  1.69         17994
## 5 528325  22812     PACK 3 BOX…       12 2010-10-21 13:02:00  1.95         15680
## 6 499506  21673     WHITE SPOT…       12 2010-02-28 16:00:00  1.25         15031
## # ℹ 1 more variable: Country <chr>

10.0.4 Mutate

mutate is a versatile function that allows you to create or transform variables within your dataset. It’s particularly useful when you need to derive new metrics or modify existing ones for further analysis.

In this snippet, we use mutate to create a new variable, TotalPrice, which is the product of Quantity and UnitPrice in the retail_data_small dataset. This new variable represents the total sales value of each transaction, providing a crucial metric for financial analysis and insights into customer behavior.

mutated_data <- retail_data_small %>%
  mutate(TotalPrice = Quantity * Price)

head(mutated_data)

## # A tibble: 6 × 9
##   Invoice StockCode Description Quantity InvoiceDate         Price `Customer ID`
##   <chr>   <chr>     <chr>          <dbl> <dttm>              <dbl>         <dbl>
## 1 511133  21080     SET/20 RED…        4 2010-06-07 11:25:00  0.85         17227
## 2 499422  21110     LARGE CAKE…        1 2010-02-26 16:33:00 13.6             NA
## 3 532319  20829     GLITTER HA…        3 2010-11-11 15:09:00  2.1          17759
## 4 516314  22327     ROUND SNAC…        6 2010-07-19 14:31:00  2.95         12826
## 5 515012  21397     BLUE  SPOT…       12 2010-07-07 16:14:00  1.25         15581
## 6 C490513 85043     RED HEART …       -1 2009-12-06 14:58:00  4.95         17610
## # ℹ 2 more variables: Country <chr>, TotalPrice <dbl>

10.0.5 Summarize

Summarizing data is essential for understanding the overarching trends and metrics within your dataset. The summarize function in dplyr assists in calculating these summary statistics efficiently.

In this example, we calculate the total quantity, mean quantity, and mean unit price for items in our dataset. These summary statistics provide a quick snapshot of the data, offering insights into average purchase sizes, pricing, and total sales volume.

summary_data <- mutated_data %>%
  summarize(MeanQuantity = mean(Quantity), 
            MeanUnitPrice = mean(Price), 
            TotalSales = sum(TotalPrice))

knitr::kable(summary_data)

MeanQuantity	MeanUnitPrice	TotalSales
10.09318	5.758615	887393.4

10.0.6 Group_by

Segmenting data into groups based on certain criteria is a common task in data analysis. The group_by function in dplyr makes this task straightforward by allowing you to define the grouping variables.

In the following example, we group retail_data_small by Country. We then calculate the mean quantity, mean unit price, and total sales for each country. This operation provides a country-level view of the data, revealing patterns and insights that are specific to each geographical segment.

grouped_data <- mutated_data %>%
  group_by(Country) %>%
  summarize(MeanQuantity = mean(Quantity), 
            MeanUnitPrice = mean(Price), 
            TotalSales = sum(TotalPrice))

head(grouped_data)

## # A tibble: 6 × 4
##   Country   MeanQuantity MeanUnitPrice TotalSales
##   <chr>            <dbl>         <dbl>      <dbl>
## 1 Australia        23.7          10.3      1632. 
## 2 Austria          10.6           3.64     1203. 
## 3 Bahrain           9.2           3.04      113. 
## 4 Belgium           9.35          5.74     2429. 
## 5 Bermuda          19.5           1.90       77.7
## 6 Brazil            5.56          1.34       38.6

10.0.7 Pivot_longer

The pivot_longer function is a transformative tool that converts data from a wide format to a long format. This reformatting is often necessary for certain types of analysis and visualization.

In this part, we demonstrate how to convert a hypothetical wide format data into a long format. By melting columns starting with “year” into two new columns, Year and Value, we prepare the dataset for analyses that require a long format structure.

long_data <- grouped_data %>%
  pivot_longer(cols = -Country,
               names_to = "Variable",
               values_to = "Value")

head(long_data)

## # A tibble: 6 × 3
##   Country   Variable        Value
##   <chr>     <chr>           <dbl>
## 1 Australia MeanQuantity    23.7 
## 2 Australia MeanUnitPrice   10.3 
## 3 Australia TotalSales    1632.  
## 4 Austria   MeanQuantity    10.6 
## 5 Austria   MeanUnitPrice    3.64
## 6 Austria   TotalSales    1203.

10.0.8 Pivot_wider

Converting data from a long format back to a wide format is just as crucial for certain analyses and reporting needs. The pivot_wider function in dplyr facilitates this reverse transformation.

Here, we take the previously created long format data and spread it back into a wide format. By spreading the Year and Value columns, we reorganize the data into a structure that might be more familiar or useful for certain types of analysis or visualization.

wide_data <- long_data %>%
  pivot_wider(names_from = Variable,
              values_from = Value)

head(wide_data)

## # A tibble: 6 × 4
##   Country   MeanQuantity MeanUnitPrice TotalSales
##   <chr>            <dbl>         <dbl>      <dbl>
## 1 Australia        23.7          10.3      1632. 
## 2 Austria          10.6           3.64     1203. 
## 3 Bahrain           9.2           3.04      113. 
## 4 Belgium           9.35          5.74     2429. 
## 5 Bermuda          19.5           1.90       77.7
## 6 Brazil            5.56          1.34       38.6

10.0.9 Conclusion

This chapter has provided a practical exploration into the essential functions of the dplyr package for data manipulation and summarization. Through hands-on examples using the Online Retail II dataset, we’ve demonstrated how dplyr can simplify and expedite data wrangling tasks, making it an invaluable tool for any data analyst or R programmer.

10.0.10 References

Chen, Daqing. (2019). Online Retail II. UCI Machine Learning Repository. https://doi.org/10.24432/C5CG6D