Chapter 6 Analysis of Variance (ANOVA)

6.1 Case Study Examples

6.1.1 Test in E-Commerce Data (ANOVA)

Introduction

This case study examines the application of the Analysis of Variance (ANOVA) test using R on a real-world dataset. The dataset used in this case study is the Online Retail II data set available from the UCI Machine Learning Repository. It is a transactional data set which contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.

The main question we want to answer in this case study is: “Is there a significant difference in the average transaction amount across different countries?” To answer this question, we will conduct an Analysis of Variance (ANOVA) test, a statistical method used to test differences between two or more means. It can be viewed as an extension of the t-test.

Objective

Conduct an analysis of variance (ANOVA) to answer the question: “Is there a significant difference in the quantity of products purchased across different countries?” without utilizing the builtin anova() command.

Loading Necessary Packages

This case will use the dplyr package. The dplyr package is a popular data manipulation package in R. It provides a set of tools that are specifically designed to work with data frames, making it incredibly easy to transform and summarize your data. It’s part of the larger tidyverse collection of packages, which are designed to work together in data analysis.

pacman::p_load(tidyverse,openxlsx,readxl)

Description of the Data

# Load the data
data <- read.xlsx("https://ljkelly3141.github.io/ABE-Book/data/online_retail_II.xlsx")

# Print first few rows of the dataset
head(data)

##   Invoice StockCode                         Description Quantity InvoiceDate
## 1  489434     85048 15CM CHRISTMAS GLASS BALL 20 LIGHTS       12    40148.32
## 2  489434    79323P                  PINK CHERRY LIGHTS       12    40148.32
## 3  489434    79323W                 WHITE CHERRY LIGHTS       12    40148.32
## 4  489434     22041        RECORD FRAME 7" SINGLE SIZE        48    40148.32
## 5  489434     21232      STRAWBERRY CERAMIC TRINKET BOX       24    40148.32
## 6  489434     22064          PINK DOUGHNUT TRINKET POT        24    40148.32
##   Price Customer.ID        Country
## 1  6.95       13085 United Kingdom
## 2  6.75       13085 United Kingdom
## 3  6.75       13085 United Kingdom
## 4  2.10       13085 United Kingdom
## 5  1.25       13085 United Kingdom
## 6  1.65       13085 United Kingdom

Here is the description of relevant variables in the data set:

Variable	Description
InvoiceNo	Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction.
StockCode	Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.
Description	Product (item) name. Nominal.
Quantity	The quantities of each product (item) per transaction. Numeric.
InvoiceDate	Invoice date and time. Numeric. The day and time when a transaction was generated.
UnitPrice	Unit price. Numeric. Product price per unit in sterling (£).
CustomerID	Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.
Country	Country name. Nominal. The name of the country where a customer resides.

Data Cleaning

Before performing an ANOVA test, it’s essential to clean the data. Let’s break down each step to clean and then summarize the data :

Step 1: Remove missing values

The first part of the code chunk removes rows that have missing values in the ‘CustomerID’ or ‘Description’ columns. This is achieved using the filter() function from the dplyr package.

data_clean <- data %>%
  filter(!is.na(Customer.ID)) %>%
  filter(!is.na(Description))

The %>% operator is a pipe operator that takes the output of the code on its left and uses it as input for the function on its right. The filter() function is used to subset rows in a data frame based on a condition. The is.na() function checks for missing values, and the ! operator negates the condition. So filter(!is.na(CustomerID)) means “filter the data to include only the rows where ‘CustomerID’ is not missing”.

Step 2: Filter for positive quantities and prices

The second part of the code chunk filters the data to include only rows where ‘Quantity’ is greater than 0 and ‘UnitPrice’ is greater than 0.

data_clean <- data_clean %>%
  filter(Quantity > 0) %>%
  filter(Price > 0)

This is again done using the filter() function. So filter(Quantity > 0) means “filter the data to include only the rows where ‘Quantity’ is greater than 0”.

Step 3: Calculate Total Amount of each Invoice

Here we will calculate the tot amount of each invoice. We will do so as follows:

data_final <- data_clean %>%
  mutate(TransTotal = Price * Quantity,
         InvoiceDate = convertToDate(InvoiceDate)) %>% 
  group_by(Invoice) %>% 
  mutate(InvoiceTotal = sum(TransTotal)) %>% 
  distinct(Invoice, .keep_all = TRUE) %>% 
  select(Invoice,InvoiceDate,InvoiceTotal,Country)

This is a bit much, so let’s break down the operations:

Initial Data Preparation: The mutate() function is first employed on the data_clean dataset to:
- Calculate a new variable, TransTotal, which is the product of Price and Quantity.
- Convert the InvoiceDate to a date format using a presumed function convertToDate().
Grouping: The dataset is then grouped by the Invoice column using group_by(Invoice). This means that any subsequent operations will be applied to each unique invoice separately.
Summarizing Invoices: Within each unique invoice group, another mutate() function calculates the InvoiceTotal which is the sum of all the transaction totals (TransTotal) within that particular invoice.
Filtering Unique Invoices: The distinct(Invoice, .keep_all = TRUE) function is applied to retain only unique invoices. The .keep_all = TRUE parameter ensures that all other columns (associated with that unique invoice) are retained.
Selecting Relevant Columns: Finally, the select() function is used to retain only the columns Invoice, InvoiceDate, InvoiceTotal, and Country in the final dataset.
Previewing the Data: The last line, head(data_final), is used to display the first six rows of the data_final dataset, providing a snapshot of the processed data.

Performing the ANOVA Test Manually

In R, we can calculate the ANOVA manually. Here are the steps:

Step 1: Calculate the overall and country level means

Calculate country invoice count ant average.

country.summary <- data_final %>% 
  group_by(Country) %>% 
  summarise(CountryCount = n(),
            CountryMean = mean(InvoiceTotal)) %>% 
  arrange(desc(CountryCount))
country.summary

## # A tibble: 37 × 3
##    Country        CountryCount CountryMean
##    <chr>                 <int>       <dbl>
##  1 United Kingdom        17612        421.
##  2 Germany                 347        583.
##  3 EIRE                    316       1127.
##  4 France                  236        620.
##  5 Netherlands             135       1991.
##  6 Sweden                   68        782.
##  7 Spain                    66        721.
##  8 Belgium                  52        472.
##  9 Australia                40        786.
## 10 Portugal                 40        596.
## # ℹ 27 more rows

Filter countries with too few invoices.

data_final <- data_final %>% 
  group_by(Country) %>% 
  mutate(CountryCount = n()) %>% 
  filter(CountryCount>=50) %>% 
  ungroup()

country.summary <- country.summary %>% 
  filter(CountryCount>=50)

Calculate the overall mean.

overall_mean <- mean(data_final$InvoiceTotal)
overall_mean

## [1] 452.0799

Step 2: Total Sum of Squares

Calculate the total sum of squared difference from the mean, i.e. \(\sum_{i=1}^{N}{(Y_i-\bar{Y})^2}\).

SS.tot <- sum((data_final$InvoiceTotal-overall_mean)^2)
SS.tot

## [1] 15641663963

Step 3: Within-group sum of squares

Calculate the Within-group sum of squares, i.e. \(SS_{w} = \sum_{k=1}^{G}{\sum_{i=1}^{N_k}{(Y_{ik}-\bar{Y_k})^2}}\). To do this in R, we will do this in two steps. First, we make a new variable that has the group mean of each observation.

data_final <- data_final %>% 
  group_by(Country) %>% 
  mutate(CountryMean = mean(InvoiceTotal)) %>% 
  ungroup()

Then, we can simply sum the squared difference.

SS.w <- sum((data_final$InvoiceTotal - data_final$CountryMean)^2)
SS.w

## [1] 15136264656

Step 4: Between-group sum of squares

Calculate the Within-group sum of squares, i.e. \(SS_{b} = \sum_{k=1}^{G}{\sum_{i=1}^{N_k}{(\bar{Y_k}-\bar{Y})^2}}\).

SS.b <- sum((data_final$CountryMean - overall_mean)^2)
SS.b

## [1] 505399306

Step 5: Degrees of freedom

Each of the sum of squares estimates have degrees of freedom.

# Within-group
DF.w <- nrow(data_final)-nrow(country.summary)
DF.w

## [1] 18824

# Between-group
DF.b <- nrow(country.summary)-1
DF.b

## [1] 7

Step 6: Mean squares value

MS.w <- SS.w/DF.w
MS.w

## [1] 804094

MS.b <- SS.b/DF.b
MS.b

## [1] 72199901

Step 7: F-Stat

F.stat <- MS.b/MS.w
F.stat

## [1] 89.79038

Step 8: p-value

p.value <- pf( F.stat, df1 = DF.b, df2 = DF.w, lower.tail = FALSE)
p.value

## [1] 2.720518e-129

Conclusion

One-way ANOVA showed a significant difference between invoice totals across countries (F(7,18824)=89.79,p<.001).

Save the data for the next case

write.csv(data_final, file = "docs/data/online_retail_final.csv")

6.1.2 Test in E-Commerce Data (Checking our assumptions)

In this case, we will continue to work with the Online Retail II data set available from the UCI Machine Learning Repository.

Load and Clean Data

Load and clean the data as we did in the previous case.

Objective

Conduct an analysis of variance (ANOVA) to answer the question: “Is there a significant difference in the quantity of products purchased across different countries?” utilizing the built in anova() command and then check the vilidity of our conclution.

Instructions

Step 1: Conduct basic ANOVA

NOTE: You should always begin with exploratory analysis, but since this case is a continuation of the previous case, we can rely on the exploratory analysis conducted there.

invoice.aov <- aov(InvoiceTotal~Country, data=data_final)
summary(invoice.aov)

##                Df    Sum Sq  Mean Sq F value Pr(>F)    
## Country         7 5.054e+08 72199901   89.79 <2e-16 ***
## Residuals   18824 1.514e+10   804094                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step 2: Checking the Homogeneity of Variance Assumption

Draw a boxplot of invoice total by Country.

data_final %>% ggplot(aes(x=Country,y=InvoiceTotal)) +
  geom_boxplot() +
  scale_y_continuous(limits = c(0,2000))

## Warning: Removed 409 rows containing non-finite
## values (`stat_boxplot()`).

Use the Levene test of Homogeneity of Variance

pacman::p_load("car")
leveneTest(invoice.aov)

## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.

## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value    Pr(>F)    
## group     7  79.659 < 2.2e-16 ***
##       18824                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both the boxplot and the Levene test indicate that the Homogeneity of Variance assumption is violated; thus, we should run the Welch one-way test.

oneway.test(InvoiceTotal~Country, data=data_final)

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  InvoiceTotal and Country
## F = 14.174, num df = 7.00, denom df = 313.66, p-value = 5.408e-16

The conclusion remains the same when we account for heteroscedasticity.

Step 3: Checking the Normality of Residuals

To evaluate the assumptions of our ANOVA model, one critical step is to determine whether the residuals are normally distributed. This assumption is essential as significant deviations can compromise the reliability of the model’s results.

# Displaying a histogram of the residuals
hist(invoice.aov$residuals)

The histogram provides a visual representation of the distribution of residuals. A bell-shaped curve suggests a normal distribution, while noticeable deviations from this shape indicate non-normality.

# Q-Q plot of the residuals
qqnorm(invoice.aov$residuals)
qqline(invoice.aov$residuals)  # Adding a reference line

The Q-Q plot (quantile-quantile plot) is another graphical method to assess normality. If the residuals are normally distributed, the points should fall roughly along the diagonal reference line.

# Shapiro-Wilk test for normality
shapiro.test(sample(invoice.aov$residuals,5000))

## 
##  Shapiro-Wilk normality test
## 
## data:  sample(invoice.aov$residuals, 5000)
## W = 0.26078, p-value < 2.2e-16

The Shapiro-Wilk test is a formal statistical test for normality. A significant p-value (typically less than 0.10, 0.05 or 0.01) suggests that the residuals do not follow a normal distribution.

All three of the above examinations indicate non-normality. As such we should use the Kruskal-Wallis test:

kruskal.test(InvoiceTotal~Country, data=data_final)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  InvoiceTotal by Country
## Kruskal-Wallis chi-squared = 349.68, df = 7, p-value < 2.2e-16

The conclution remains the same.

6.2 Case Study Assignments

6.2.1 Analyzing Job Satisfaction Using ANOVA

Background:

As budding business analysts, understanding the nuances within Human Resource (HR) metrics is critical. HR analytics aids in deriving insights from data, which in turn informs decisions about process improvement. One dataset that can shed light on this is the “HR Analytics: Case Study” compiled by Bhanupratap Biswas in 2023.

In this assignment, you will dive deep into this dataset to discover if job satisfaction varies based on certain categorical variables: BusinessTravel, Department, and JobRole. here,compiled by Bhanupratap Biswas (2023).⁴

Objective:

Use ANOVA to determine if job satisfaction differs based on the categories BusinessTravel, Department, and JobRole.

Instructions:

Set Up:

Before beginning the analysis, it’s essential to have all the necessary packages installed and loaded. If you haven’t already, make sure to install and load the pacman package using if(!require(pacman)) install.packages("pacman"). Once you’ve confirmed that pacman is available, load the required packages with pacman::p_load(tidyverse, car, psych).

# Check to see if pacman is installed, and if not, install it.
if(!require(pacman)) install.packages("pacman")
# Load packages
pacman::p_load(tidyverse, car, psych)

Load the Data:

Import the dataset using the following link:

hr_data <- read.csv("https://ljkelly3141.github.io/ABE-Book/data/Employee-Attrition.csv")

Data Cleaning:

Select the variables BusinessTravel, Department, JobRole, and JobSatisfaction for your analysis.
Identify and remove any observations with missing values.

To achieve this, use the following sample code:

# Selecting the required variables and storing in a new data frame
hr_clean <- hr_data %>% 
  select(BusinessTravel, Department, JobRole, JobSatisfaction) %>%
  drop_na()

After executing these steps, review the hr_clean data frame to ensure it’s prepared for the subsequent analysis.

Summary Statistics:

Compute the summary statistics of JobSatisfaction grouped by each of the categorical variables: BusinessTravel, Department, and JobRole.
Use the psych::describe() function to obtain comprehensive statistics.
Present the results in a clear and organized manner.

Here is an example demonstrating one method of achieving this:

summary_by_travel <- hr_clean %>% 
  group_by(BusinessTravel) %>% 
  do(describe(.$JobSatisfaction,ranges = FALSE,skew = FALSE))

summary_by_travel

## # A tibble: 3 × 6
## # Groups:   BusinessTravel [3]
##   BusinessTravel     vars     n  mean    sd     se
##   <chr>             <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 Non-Travel            1   150  2.79  1.03 0.0842
## 2 Travel_Frequently     1   277  2.79  1.10 0.0661
## 3 Travel_Rarely         1  1043  2.70  1.11 0.0345

The code above performs a series of operations on the hr_clean data frame with the aim of generating descriptive statistics for the JobSatisfaction variable, specifically grouped by the levels within BusinessTravel. Here’s a breakdown:

The code starts with the hr_clean data frame.
It then groups the data by the BusinessTravel variable using group_by(BusinessTravel). This ensures that any subsequent operations are performed separately for each unique value (or level) of BusinessTravel.
The do() function is then applied to each group. Inside the do() function, the describe() function from the psych package is used to compute a variety of descriptive statistics for the JobSatisfaction variable within each group.
Additionally, certain statistics are omitted from the output by setting their respective arguments to FALSE. In this case, the ranges (ranges = FALSE) and skewness (skew = FALSE) are excluded from the results.
The final output, which is a summarized table of the desired statistics for JobSatisfaction categorized by BusinessTravel, is stored in the summary_by_travel object.

Data Visualization:

Create visualizations that display job satisfaction across the different categories of BusinessTravel, Department, and JobRole. Consider using boxplots for this purpose i.e.:

hr_clean %>%
  ggplot(aes(x = BusinessTravel, y = JobSatisfaction)) +
  geom_boxplot()

ANOVA Analysis:

Conduct ANOVA tests to determine if job satisfaction differs among the aforementioned categories.

Check Assumptions:

Before drawing any conclusions from your ANOVA results, validate the assumptions associated with the test. Specifically, check for:

Homogeneity of variance across groups
Normality of residuals for each ANOVA test

If assumptions are violated, determine the appropriate correction. What is the most approperiate model.

Questions to Answer:

Based on your ANOVA results, does job satisfaction differ significantly across the categories of BusinessTravel, Department, and JobRole?
Were the assumptions of ANOVA met for each test? If not, what implications might this have for your results?
Which factor(s) – if any – would you recommend a business focus on if they want to improve job satisfaction?

Bhanupratap Biswas. (2023). HR Analytics: Case Study [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/5905686 ↩︎