How to perform a correlation test in RStudio
In this article, we will learn how to perform a correlation hypothesis test between two variables using RStudio. The objective is to determine if there is a significant relationship between the selected variables.
The data we will use
For this example, we will use a dataset called data that contains the following variables:
-
Campaign: Binary categorical variable that indicates whether a specific day belongs to the Black Friday campaign. It takes the value of 1 if the day is part of the campaign and 0 otherwise.
-
Purchases: discrete numeric variable containing the number of purchases made by users on a specific day.
-
Revenue: Continuous numeric variable containing the amount of revenue achieved on a specific day.
The number of rows in our data is 30 days, i.e. one month.
Next, we will analyze the correlation between the Purchases and Revenue variables in relation to the Campaign variable. We want to determine whether during the Black Friday campaign, sales and revenue increase.
We want to determine whether during the Black Friday campaign, sales and revenue increase.
Step 1: Hypothesis statement
First, we must define our hypotheses:
- Null hypothesis (H₀): There is no correlation between the two variables.
- Alternative hypothesis (H₁): There is correlation between the two variables.
Step 2: Initial data visualization
Before performing the statistical test, it is useful to visualize the relationship between the two variables:
This graph allows us to visually observe whether there appears to be a relationship between the campaign and purchases. The red line represents a linear regression, which is useful for identifying trends.
We observed a positive correlation between purchases and the days belonging to the Black Friday campaign. That is, during the days of Black Friday an increase in sales is observed. Now we will analyze if the same happens with revenues, since it is possible that, although sales increase during the Black Friday campaign, revenues remain stable due to the discounts applied.
A positive correlation is observed between revenues and the days belonging to the Black Friday campaign. Even so, this is lower than with purchases.
Step 3: Method selection
There are different methods to calculate the correlation depending on the characteristics of the data:
- Pearson: Pearson's correlation coefficient measures the linear relationship between two continuous variables and assumes that both variables are normally distributed and quantitative. It is not ideal when you have a categorical (binary) variable and a discrete variable, as Pearson is designed for continuous and linear variables.
- Spearman: Spearman's correlation coefficient is a non-parametric measure that assesses the not necessarily linear relationship between two variables. It works best when the data do not follow a normal distribution or when the variables are not continuous. Given that one of our variables is binary and one is discrete, Spearman is more appropriate.
Step 4: Obtaining the correlation coefficient
In this graph, we must look at the coefficients that appear in the cells above the main diagonal. These tell us the strength and direction of the correlation between pairs of variables. We will focus on the correlation between Campaign and the two numerical variables, since we already know that there is a relationship between Purchases and Revenues.
- Purchases and Campaign: The correlation is 0.71, which suggests a fairly high positive relationship. This means that on Black Friday days, purchases increase
- Revenues and Campaign: The correlation is 0.61, indicating a moderate positive relationship. This means that on Black Friday days revenues increase, although less strongly than purchases.
The three asterisks (***) next to the numbers indicate that the correlations are statistically significant at a high level, i.e. it is highly unlikely that these relationships are the product of chance. This section consists of a correlation hypothesis test.
Step 5: Performing the correlation test
We will proceed to perform the correlation test in more detail:
This command provides us with a p-value and a correlation coefficient (rho). The p-value tells us if the correlation is statistically significant. If it is less than 0.05, we have enough evidence to reject the null hypothesis and conclude that there is a significant correlation between the two variables.
We obtain the following results:
With a p-value well below 0.05, we have enough evidence to reject the null hypothesis and conclude that there is a correlation between shopping and Black Friday campaign days, with a positive coefficient of 0.7057.
With a p-value of less than 0.05, we have sufficient evidence to reject the null hypothesis and conclude that there is a correlation between revenue and Black Friday campaign days, with a positive coefficient of 0.6146.