Chapter 10 Data Wrangling with dplyr
10.0.1 Introduction
This chapter focuses on the practical applications of the dplyr package, a cornerstone for data manipulation within the R programming language. We will explore various functions of dplyr through the lens of the Online Retail II dataset provided by Chen, Daqing (2019). This dataset, comprising a rich array of online retail transactions, serves as an excellent basis for demonstrating the capabilities of dplyr. The dataset can be accessed here.
10.0.2 Select
The select function is your gateway to simplifying datasets by choosing only the columns that are necessary for your analysis. This function is integral for focusing on the variables of interest without getting lost in an ocean of data.
In the following example, we utilize select to extract just the Invoice and StockCode columns from the retail_data_small dataset. This operation allows us to narrow down the data to these two critical identifiers, providing a clearer view of the transactions without the clutter of unnecessary information.
## # A tibble: 6 × 2
## Invoice StockCode
## <chr> <chr>
## 1 511133 21080
## 2 499422 21110
## 3 532319 20829
## 4 516314 22327
## 5 515012 21397
## 6 C490513 85043
10.0.3 Filter
The filter function is a powerful tool for subsetting data based on specific conditions. It allows you to hone in on the rows that meet your criteria, thus enabling targeted analysis.
Here, we demonstrate the use of filter to extract rows from our dataset where the quantity of items purchased exceeds 10. This operation is particularly useful when interested in larger transactions or when trying to identify bulk purchases within the dataset.
## # A tibble: 6 × 8
## Invoice StockCode Description Quantity InvoiceDate Price `Customer ID`
## <chr> <chr> <chr> <dbl> <dttm> <dbl> <dbl>
## 1 515012 21397 BLUE SPOT… 12 2010-07-07 16:14:00 1.25 15581
## 2 527445 22189 CREAM HEAR… 12 2010-10-18 09:33:00 3.95 14741
## 3 507176 22045 SPACEBOY G… 25 2010-05-06 15:05:00 0.42 17069
## 4 533085 11001 ASSTD DESI… 16 2010-11-16 09:56:00 1.69 17994
## 5 528325 22812 PACK 3 BOX… 12 2010-10-21 13:02:00 1.95 15680
## 6 499506 21673 WHITE SPOT… 12 2010-02-28 16:00:00 1.25 15031
## # ℹ 1 more variable: Country <chr>
10.0.4 Mutate
mutate is a versatile function that allows you to create or transform variables within your dataset. It’s particularly useful when you need to derive new metrics or modify existing ones for further analysis.
In this snippet, we use mutate to create a new variable, TotalPrice, which is the product of Quantity and UnitPrice in the retail_data_small dataset. This new variable represents the total sales value of each transaction, providing a crucial metric for financial analysis and insights into customer behavior.
## # A tibble: 6 × 9
## Invoice StockCode Description Quantity InvoiceDate Price `Customer ID`
## <chr> <chr> <chr> <dbl> <dttm> <dbl> <dbl>
## 1 511133 21080 SET/20 RED… 4 2010-06-07 11:25:00 0.85 17227
## 2 499422 21110 LARGE CAKE… 1 2010-02-26 16:33:00 13.6 NA
## 3 532319 20829 GLITTER HA… 3 2010-11-11 15:09:00 2.1 17759
## 4 516314 22327 ROUND SNAC… 6 2010-07-19 14:31:00 2.95 12826
## 5 515012 21397 BLUE SPOT… 12 2010-07-07 16:14:00 1.25 15581
## 6 C490513 85043 RED HEART … -1 2009-12-06 14:58:00 4.95 17610
## # ℹ 2 more variables: Country <chr>, TotalPrice <dbl>
10.0.5 Summarize
Summarizing data is essential for understanding the overarching trends and metrics within your dataset. The summarize function in dplyr assists in calculating these summary statistics efficiently.
In this example, we calculate the total quantity, mean quantity, and mean unit price for items in our dataset. These summary statistics provide a quick snapshot of the data, offering insights into average purchase sizes, pricing, and total sales volume.
summary_data <- mutated_data %>%
summarize(MeanQuantity = mean(Quantity),
MeanUnitPrice = mean(Price),
TotalSales = sum(TotalPrice))
knitr::kable(summary_data)| MeanQuantity | MeanUnitPrice | TotalSales |
|---|---|---|
| 10.09318 | 5.758615 | 887393.4 |
10.0.6 Group_by
Segmenting data into groups based on certain criteria is a common task in data analysis. The group_by function in dplyr makes this task straightforward by allowing you to define the grouping variables.
In the following example, we group retail_data_small by Country. We then calculate the mean quantity, mean unit price, and total sales for each country. This operation provides a country-level view of the data, revealing patterns and insights that are specific to each geographical segment.
grouped_data <- mutated_data %>%
group_by(Country) %>%
summarize(MeanQuantity = mean(Quantity),
MeanUnitPrice = mean(Price),
TotalSales = sum(TotalPrice))
head(grouped_data)## # A tibble: 6 × 4
## Country MeanQuantity MeanUnitPrice TotalSales
## <chr> <dbl> <dbl> <dbl>
## 1 Australia 23.7 10.3 1632.
## 2 Austria 10.6 3.64 1203.
## 3 Bahrain 9.2 3.04 113.
## 4 Belgium 9.35 5.74 2429.
## 5 Bermuda 19.5 1.90 77.7
## 6 Brazil 5.56 1.34 38.6
10.0.7 Pivot_longer
The pivot_longer function is a transformative tool that converts data from a wide format to a long format. This reformatting is often necessary for certain types of analysis and visualization.
In this part, we demonstrate how to convert a hypothetical wide format data into a long format. By melting columns starting with “year” into two new columns, Year and Value, we prepare the dataset for analyses that require a long format structure.
long_data <- grouped_data %>%
pivot_longer(cols = -Country,
names_to = "Variable",
values_to = "Value")
head(long_data)## # A tibble: 6 × 3
## Country Variable Value
## <chr> <chr> <dbl>
## 1 Australia MeanQuantity 23.7
## 2 Australia MeanUnitPrice 10.3
## 3 Australia TotalSales 1632.
## 4 Austria MeanQuantity 10.6
## 5 Austria MeanUnitPrice 3.64
## 6 Austria TotalSales 1203.
10.0.8 Pivot_wider
Converting data from a long format back to a wide format is just as crucial for certain analyses and reporting needs. The pivot_wider function in dplyr facilitates this reverse transformation.
Here, we take the previously created long format data and spread it back into a wide format. By spreading the Year and Value columns, we reorganize the data into a structure that might be more familiar or useful for certain types of analysis or visualization.
## # A tibble: 6 × 4
## Country MeanQuantity MeanUnitPrice TotalSales
## <chr> <dbl> <dbl> <dbl>
## 1 Australia 23.7 10.3 1632.
## 2 Austria 10.6 3.64 1203.
## 3 Bahrain 9.2 3.04 113.
## 4 Belgium 9.35 5.74 2429.
## 5 Bermuda 19.5 1.90 77.7
## 6 Brazil 5.56 1.34 38.6
10.0.9 Conclusion
This chapter has provided a practical exploration into the essential functions of the dplyr package for data manipulation and summarization. Through hands-on examples using the Online Retail II dataset, we’ve demonstrated how dplyr can simplify and expedite data wrangling tasks, making it an invaluable tool for any data analyst or R programmer.
10.0.10 References
Chen, Daqing. (2019). Online Retail II. UCI Machine Learning Repository. https://doi.org/10.24432/C5CG6D