Chapter 5 The Chi-Square Distribution

PRELIMINARY AND INCOMPLETE

5.1 Case Study: Chi-Squared Goodness of Fit Test

This case study applies the chi-squared goodness of fit test to data from direct marketing campaigns of a Portuguese banking institution. The dataset is available here, provided by S. Moro, P. Cortez and P. Rita (2014).2

Case Background

The bank’s marketing team hypothesizes that the distribution of job types amongst their clients mirrors the general job distribution in society. If the bank’s client jobs are distributed evenly over type, it indicates that their services have broad appeal, reaching a diverse array of industries and professions. This could be leveraged in their marketing materials to attract new customers from all walks of life.

In order to verify this assumption and refine their marketing strategy, the marketing team plans to conduct a goodness of fit test. This statistical test will compare the observed job distribution among the bank’s clients with the expected job distribution.

Description of the Data

The bank dataset consists of several variables related to bank clients. The following variables are a subset of the continence of the bank dataset:

Variable Description
age Age of the client
job Type of job
marital Marital status
education Level of education
default Has credit in default?
balance Average yearly balance, in euros

The ‘job’ variable is categorical, representing the client’s job type. There are 12 categories, including ‘admin.’, ‘blue-collar’, ‘entrepreneur’, ‘housemaid’, ‘management’, ‘retired’, ‘self-employed’, ‘services’, ‘student’, ‘technician’, ‘unemployed’, ‘unknown’.

Step 1: Load the Data and Inspect the ‘job’ Variable

First, we need to load the data and examine the distribution of the ‘job’ variable.

# Load the dataset
bank <- read.csv("https://ljkelly3141.github.io/ABE-Book/data/bank-full.csv",
                 sep = ';')

read.csv is a function in R used for loading csv files into R. The first argument to read.csv is a string that provides the location of the CSV file. In this case, the location is a URL, meaning that R will download the CSV from the web. The sep argument is used to specify the character that separates or “delimits” the data fields in the file. CSV stands for “comma-separated values”, but sometimes data fields are separated by other characters. In this case, the fields are separated by semicolons (;), so sep = ';' tells R to look for semicolons as separators.

Now let’s have a look at the job variable.

job.table <- table(bank$job)
Job Categories Freq
admin. 5171
blue-collar 9732
entrepreneur 1487
housemaid 1240
management 9458
retired 2264
self-employed 1579
services 4154
student 938
technician 7597
unemployed 1303
unknown 288

Step 2: Formulate a Hypothesis

In English, our hypotheses are:

Null hypothesis (\(H_0\)): The observed distribution of job categories matches the expected distribution. Alternate hypothesis (\(H_a\)): The observed distribution of job categories does not match the expected distribution.

In mathematical terms:

\(H_0\): All proportions \(p_i = 1/k\) (where \(k\) is the number of job categories) \(H_a\): At least one proportion \(p_i \not= 1/k\)

Step 3: Determine Expected Frequencies and the Degrees of Freedom

Now, we will calculate the expected frequencies of job categories assuming that all categories are equally likely.

Calculate the number of categories in ‘job’

We have \(k\) categories, but only \(k-1\) are free to vary. The last category’s proportion is fixed once we’ve set the proportions in the other categories. Thus, we have \(k - 1\) degrees of freedom. In the case study with the ‘job’ variable in the bank dataset has 12 categories; thus, the degrees of freedom for the chi-squared test would be 12 - 1 = 11.

n.categories <- length(job.table)
n.categories
## [1] 12

The variable n.categories now holds the number of unique job categories in our dataset.

The expected count for each job category

expected.counts <-  c(4521, 11303, 1356, 1356, 9042, 2261, 1356, 4521, 904, 6782, 1356, 453) 

The variable expected.counts now contains the expected count for each job category. The expected counts are based on the societal distribution of job types.3

Assign the names of the categories to the expected counts

names(expected.counts) <- names(job.table)

The expected.counts now has names assigned to it for clear identification. Each count corresponds to a specific job category from the job.table.

Step 4: Calculate the Chi-Square Statistic

Next, we calculate the chi-squared statistic, which is the sum of squared differences between observed and expected counts, normalized by expected counts.

# Compute observed counts
observed.counts <- table(bank$job)

# Compute chi-squared statistic
chi.sq <- sum((observed.counts - expected.counts)^2 / expected.counts)
chi.sq
## [1] 581.3817

Step 5: Calculate the P-value

The p-value is the probability under the null hypothesis of getting a chi-squared statistic as extreme as what we observed.

# Compute p-value
p.value <- pchisq(chi.sq, df = n.categories - 1, lower.tail = FALSE)
p.value
## [1] 1.342451e-117

Step 6: Interpret the Result

Lastly, we interpret the result. If the p-value is less than the chosen significance level, \(alpha\), we reject the null hypothesis. Otherwise, we do not reject the null hypothesis. Given that the p-value is extremely small, We reject the null hypothesis. The job distribution in the client database does not mirror the expected distribution

5.2 Case Study: Chi-Squared Test of Independence

In this case study, we will conduct a chi-squared test of independence using data from direct marketing campaigns of a Portuguese banking institution. The dataset is available here, provided by S. Moro, P. Cortez and P. Rita (2014).\(^1\)

Context for the Test of Independence

The bank’s marketing team is interested in examining whether a client’s job type (job) is associated with the possession of a housing loan (housing). If these two variables are independent, the job type provides no information about the likelihood of a client having a housing loan, and vice versa. However, if they are dependent, it could imply that clients from certain job categories are more likely to have a housing loan. To examine this, we’ll perform a chi-squared test of independence.

Description of the Data

The bank dataset contains several variables related to the bank clients and the outcomes of marketing campaigns. For our analysis, we’ll focus on two categorical variables:

Variable Description
job Type of job
housing Has a housing loan? (yes/no)

Step 1: Load the Data

First, we need to load the data and examine the distribution of the job and housing variables.

# Load the dataset
bank <- read.csv("https://ljkelly3141.github.io/ABE-Book/data/bank-full.csv",
                 sep = ";")

# Check the distribution of 'job' and 'housing' variables
table(bank$job, bank$housing) %>% knitr::kable()
no yes
admin. 1989 3182
blue-collar 2684 7048
entrepreneur 618 869
housemaid 842 398
management 4780 4678
retired 1773 491
self-employed 814 765
services 1388 2766
student 689 249
technician 3482 4115
unemployed 760 543
unknown 262 26

Step 2: Formulate a Hypothesis

Null hypothesis (\(H_0\)): The job and housing variables are independent.

Alternate hypothesis (\(H_a\)): The job and housing variables are dependent.

Step 3: Conduct the Chi-Squared Test of Independence

Use the table function to calculate the cross tabulation of the two variables.

cross.tab <- table(bank$job, bank$housing)
cross.tab
##                
##                   no  yes
##   admin.        1989 3182
##   blue-collar   2684 7048
##   entrepreneur   618  869
##   housemaid      842  398
##   management    4780 4678
##   retired       1773  491
##   self-employed  814  765
##   services      1388 2766
##   student        689  249
##   technician    3482 4115
##   unemployed     760  543
##   unknown        262   26

We use the chisq.test() function in R to perform the test.

chisq.stat <- chisq.test(cross.tab)
chisq.stat
## 
##  Pearson's Chi-squared test
## 
## data:  cross.tab
## X-squared = 3588.7, df = 11, p-value < 2.2e-16

Step 4: Interpret the Result

Lastly, we interpret the result. If the p-value is less than the chosen significance level, \(alpha\), we reject the null hypothesis. Otherwise, we do not reject the null hypothesis. Given that the p-value is extremely small, we reject the null hypothesis. The job and housing variables are not independent.


  1. Moro, S., Cortez, P., & Rita, P. (2014). A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31. Retrieved from https://archive.ics.uci.edu/ml/datasets/bank+marketing↩︎

  2. Note that this data is simulated and may not represent the actual societal distribution of job types.↩︎