Chapter 5 The Chi-Square Distribution
PRELIMINARY AND INCOMPLETE
5.1 Case Study: Chi-Squared Goodness of Fit Test
This case study applies the chi-squared goodness of fit test to data from direct marketing campaigns of a Portuguese banking institution. The dataset is available here, provided by S. Moro, P. Cortez and P. Rita (2014).2
Case Background
The bank’s marketing team hypothesizes that the distribution of job types amongst their clients mirrors the general job distribution in society. If the bank’s client jobs are distributed evenly over type, it indicates that their services have broad appeal, reaching a diverse array of industries and professions. This could be leveraged in their marketing materials to attract new customers from all walks of life.
In order to verify this assumption and refine their marketing strategy, the marketing team plans to conduct a goodness of fit test. This statistical test will compare the observed job distribution among the bank’s clients with the expected job distribution.
Description of the Data
The bank dataset consists of several variables related to bank clients. The following variables are a subset of the continence of the bank dataset:
| Variable | Description |
|---|---|
| age | Age of the client |
| job | Type of job |
| marital | Marital status |
| education | Level of education |
| default | Has credit in default? |
| balance | Average yearly balance, in euros |
| … | … |
The ‘job’ variable is categorical, representing the client’s job type. There are 12 categories, including ‘admin.’, ‘blue-collar’, ‘entrepreneur’, ‘housemaid’, ‘management’, ‘retired’, ‘self-employed’, ‘services’, ‘student’, ‘technician’, ‘unemployed’, ‘unknown’.
Step 1: Load the Data and Inspect the ‘job’ Variable
First, we need to load the data and examine the distribution of the ‘job’ variable.
# Load the dataset
bank <- read.csv("https://ljkelly3141.github.io/ABE-Book/data/bank-full.csv",
sep = ';')read.csv is a function in R used for loading csv files into R. The first argument to read.csv is a string that provides the location of the CSV file. In this case, the location is a URL, meaning that R will download the CSV from the web. The sep argument is used to specify the character that separates or “delimits” the data fields in the file. CSV stands for “comma-separated values”, but sometimes data fields are separated by other characters. In this case, the fields are separated by semicolons (;), so sep = ';' tells R to look for semicolons as separators.
Now let’s have a look at the job variable.
| Job Categories | Freq |
|---|---|
| admin. | 5171 |
| blue-collar | 9732 |
| entrepreneur | 1487 |
| housemaid | 1240 |
| management | 9458 |
| retired | 2264 |
| self-employed | 1579 |
| services | 4154 |
| student | 938 |
| technician | 7597 |
| unemployed | 1303 |
| unknown | 288 |
Step 2: Formulate a Hypothesis
In English, our hypotheses are:
Null hypothesis (\(H_0\)): The observed distribution of job categories matches the expected distribution. Alternate hypothesis (\(H_a\)): The observed distribution of job categories does not match the expected distribution.
In mathematical terms:
\(H_0\): All proportions \(p_i = 1/k\) (where \(k\) is the number of job categories) \(H_a\): At least one proportion \(p_i \not= 1/k\)
Step 3: Determine Expected Frequencies and the Degrees of Freedom
Now, we will calculate the expected frequencies of job categories assuming that all categories are equally likely.
Calculate the number of categories in ‘job’
We have \(k\) categories, but only \(k-1\) are free to vary. The last category’s proportion is fixed once we’ve set the proportions in the other categories. Thus, we have \(k - 1\) degrees of freedom. In the case study with the ‘job’ variable in the bank dataset has 12 categories; thus, the degrees of freedom for the chi-squared test would be 12 - 1 = 11.
## [1] 12
The variable n.categories now holds the number of unique job categories in our dataset.
The expected count for each job category
The variable expected.counts now contains the expected count for each job category. The expected counts are based on the societal distribution of job types.3
Assign the names of the categories to the expected counts
The expected.counts now has names assigned to it for clear identification. Each count corresponds to a specific job category from the job.table.
Step 4: Calculate the Chi-Square Statistic
Next, we calculate the chi-squared statistic, which is the sum of squared differences between observed and expected counts, normalized by expected counts.
# Compute observed counts
observed.counts <- table(bank$job)
# Compute chi-squared statistic
chi.sq <- sum((observed.counts - expected.counts)^2 / expected.counts)
chi.sq## [1] 581.3817
Step 5: Calculate the P-value
The p-value is the probability under the null hypothesis of getting a chi-squared statistic as extreme as what we observed.
## [1] 1.342451e-117
Step 6: Interpret the Result
Lastly, we interpret the result. If the p-value is less than the chosen significance level, \(alpha\), we reject the null hypothesis. Otherwise, we do not reject the null hypothesis. Given that the p-value is extremely small, We reject the null hypothesis. The job distribution in the client database does not mirror the expected distribution
5.2 Case Study: Chi-Squared Test of Independence
In this case study, we will conduct a chi-squared test of independence using data from direct marketing campaigns of a Portuguese banking institution. The dataset is available here, provided by S. Moro, P. Cortez and P. Rita (2014).\(^1\)
Context for the Test of Independence
The bank’s marketing team is interested in examining whether a client’s job type (job) is associated with the possession of a housing loan (housing). If these two variables are independent, the job type provides no information about the likelihood of a client having a housing loan, and vice versa. However, if they are dependent, it could imply that clients from certain job categories are more likely to have a housing loan. To examine this, we’ll perform a chi-squared test of independence.
Description of the Data
The bank dataset contains several variables related to the bank clients and the outcomes of marketing campaigns. For our analysis, we’ll focus on two categorical variables:
| Variable | Description |
|---|---|
job |
Type of job |
housing |
Has a housing loan? (yes/no) |
Step 1: Load the Data
First, we need to load the data and examine the distribution of the job and housing variables.
# Load the dataset
bank <- read.csv("https://ljkelly3141.github.io/ABE-Book/data/bank-full.csv",
sep = ";")
# Check the distribution of 'job' and 'housing' variables
table(bank$job, bank$housing) %>% knitr::kable()| no | yes | |
|---|---|---|
| admin. | 1989 | 3182 |
| blue-collar | 2684 | 7048 |
| entrepreneur | 618 | 869 |
| housemaid | 842 | 398 |
| management | 4780 | 4678 |
| retired | 1773 | 491 |
| self-employed | 814 | 765 |
| services | 1388 | 2766 |
| student | 689 | 249 |
| technician | 3482 | 4115 |
| unemployed | 760 | 543 |
| unknown | 262 | 26 |
Step 2: Formulate a Hypothesis
Null hypothesis (\(H_0\)): The job and housing variables are independent.
Alternate hypothesis (\(H_a\)): The job and housing variables are dependent.
Step 3: Conduct the Chi-Squared Test of Independence
Use the table function to calculate the cross tabulation of the two variables.
##
## no yes
## admin. 1989 3182
## blue-collar 2684 7048
## entrepreneur 618 869
## housemaid 842 398
## management 4780 4678
## retired 1773 491
## self-employed 814 765
## services 1388 2766
## student 689 249
## technician 3482 4115
## unemployed 760 543
## unknown 262 26
We use the chisq.test() function in R to perform the test.
##
## Pearson's Chi-squared test
##
## data: cross.tab
## X-squared = 3588.7, df = 11, p-value < 2.2e-16
Step 4: Interpret the Result
Lastly, we interpret the result. If the p-value is less than the chosen significance level, \(alpha\), we reject the null hypothesis. Otherwise, we do not reject the null hypothesis. Given that the p-value is extremely small, we reject the null hypothesis. The job and housing variables are not independent.
Moro, S., Cortez, P., & Rita, P. (2014). A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31. Retrieved from https://archive.ics.uci.edu/ml/datasets/bank+marketing↩︎
Note that this data is simulated and may not represent the actual societal distribution of job types.↩︎