How to perform a correlation test in RStudio

In this article, we will learn how to perform a correlation hypothesis test between two variables using RStudio. The objective is to determine if there is a significant relationship between the selected variables.

The data we will use

For this example, we will use a dataset called data that contains the following variables:

  1. Campaign: Binary categorical variable that indicates whether a specific day belongs to the Black Friday campaign. It takes the value of 1 if the day is part of the campaign and 0 otherwise.

  2. Purchases: discrete numeric variable containing the number of purchases made by users on a specific day. 

  3. Revenue: Continuous numeric variable containing the amount of revenue achieved on a specific day. 

The number of rows in our data is 30 days, i.e. one month. 


Next, we will analyze the correlation between the Purchases and Revenue variables in relation to the Campaign variable. We want to determine whether during the Black Friday campaign, sales and revenue increase.

We want to determine whether during the Black Friday campaign, sales and revenue increase.

Step 1: Hypothesis statement

First, we must define our hypotheses:

  • Null hypothesis (H₀): There is no correlation between the two variables.
  • Alternative hypothesis (H₁): There is correlation between the two variables.

Step 2: Initial data visualization

Before performing the statistical test, it is useful to visualize the relationship between the two variables:

Screenshot 2024-10-14 at 9.33.59

This graph allows us to visually observe whether there appears to be a relationship between the campaign and purchases. The red line represents a linear regression, which is useful for identifying trends.

Screenshot 2024-10-14 at 9.34.29

We observed a positive correlation between purchases and the days belonging to the Black Friday campaign. That is, during the days of Black Friday an increase in sales is observed. Now we will analyze if the same happens with revenues, since it is possible that, although sales increase during the Black Friday campaign, revenues remain stable due to the discounts applied.

Screenshot 2024-10-14 at 9.35.20

A positive correlation is observed between revenues and the days belonging to the Black Friday campaign. Even so, this is lower than with purchases.

Step 3: Method selection

There are different methods to calculate the correlation depending on the characteristics of the data:

  • Pearson: Pearson's correlation coefficient measures the linear relationship between two continuous variables and assumes that both variables are normally distributed and quantitative. It is not ideal when you have a categorical (binary) variable and a discrete variable, as Pearson is designed for continuous and linear variables.
  • Spearman: Spearman's correlation coefficient is a non-parametric measure that assesses the not necessarily linear relationship between two variables. It works best when the data do not follow a normal distribution or when the variables are not continuous. Given that one of our variables is binary and one is discrete, Spearman is more appropriate.

Step 4: Obtaining the correlation coefficient

Screenshot 2024-10-14 at 9.30.29

Screenshot 2024-10-14 at 9.30.36

In this graph, we must look at the coefficients that appear in the cells above the main diagonal. These tell us the strength and direction of the correlation between pairs of variables. We will focus on the correlation between Campaign and the two numerical variables, since we already know that there is a relationship between Purchases and Revenues.

  • Purchases and Campaign: The correlation is 0.71, which suggests a fairly high positive relationship. This means that on Black Friday days, purchases increase
  • Revenues and Campaign: The correlation is 0.61, indicating a moderate positive relationship. This means that on Black Friday days revenues increase, although less strongly than purchases.

The three asterisks (***) next to the numbers indicate that the correlations are statistically significant at a high level, i.e. it is highly unlikely that these relationships are the product of chance. This section consists of a correlation hypothesis test.

Step 5: Performing the correlation test

We will proceed to perform the correlation test in more detail:Screenshot 2024-10-14 at 9.31.44

This command provides us with a p-value and a correlation coefficient (rho). The p-value tells us if the correlation is statistically significant. If it is less than 0.05, we have enough evidence to reject the null hypothesis and conclude that there is a significant correlation between the two variables.

We obtain the following results:

Screen capture 2024-10-14 at 9.31.50

With a p-value well below 0.05, we have enough evidence to reject the null hypothesis and conclude that there is a correlation between shopping and Black Friday campaign days, with a positive coefficient of 0.7057.

Screenshot 2024-10-14 at 9.31.56

With a p-value of less than 0.05, we have sufficient evidence to reject the null hypothesis and conclude that there is a correlation between revenue and Black Friday campaign days, with a positive coefficient of 0.6146.

 

ANTERIOR
SIGUIENTE

TIPS DE EXPERTOS

Suscríbete para impulsar tu negocio.

ÚLTIMOS ARTÍCULOS

Group your data like a pro: clustering with K-Means and BigQuery ML

Working with large volumes of marketing data—whether it’s web traffic, keywords, users, or campaigns—can feel overwhelming. These data sets often aren’t organized or categorized in a useful way, and facing them can feel like trying to understand a conversation in an unfamiliar language.

But what if you could automatically discover patterns and create data groups—without manual rules, endless scripts, or leaving your BigQuery analysis environment?

That’s exactly what K-Means with BigQuery ML allows you to do.

What is K-Means and why should you care?

K-Means is a clustering algorithm—a technique for grouping similar items. Imagine you have a table with thousands of URLs, users, or products. Instead of going through each one manually, K-Means can automatically find groups with common patterns: pages with similar performance, campaigns with similar outcomes, or users with shared behaviors.

And the best part? With BigQuery ML, you can apply K-Means using plain SQL—no need for Python scripts or external tools.

How does it actually work?

The process behind K-Means is surprisingly simple:

  1. You choose how many groups you want (the well-known “K”).

  2. The algorithm picks initial points called centroids.

  3. Each row in your data is assigned to the nearest centroid.

  4. The centroids are recalculated using the assigned data.

  5. This process repeats until the groups stabilize.

The result? Every row in your table is tagged with the cluster it belongs to. Now you can analyze the patterns of each group and make better-informed decisions.

How to apply it in BigQuery ML

BigQuery ML simplifies the entire process. With just a few lines of SQL, you can:

  • Train a K-Means model on your data

  • Retrieve the generated centroids

  • Classify each row with its corresponding cluster

This opens up a wide range of possibilities to enrich your dashboards and marketing analysis:

  • Group pages by performance (visits, conversions, revenue)

  • Detect behaviors of returning, new, or inactive users

  • Identify products often bought together or with similar buyer profiles

  • Spot keywords with unusual performance

How many clusters do I need?

Choosing the right number of clusters (“K”) is critical. Here are a few strategies:

  • Business knowledge: If you already know you have 3 customer types or 4 product categories, start there.

  • Elbow Method: Run models with different K values and watch for the point where segmentation no longer improves significantly.

  • Iterate thoughtfully: Test, review, and adjust based on how your data behaves.

Real-world examples

With K-Means in BigQuery, you can answer questions like:

  • What types of users visit my site, and how do they differ?

  • Which pages show similar performance trends?

  • Which campaigns are generating outlier results?

Grouping data this way not only saves time—it reveals opportunities and issues that might otherwise go unnoticed.

Conclusion

If you're handling large data sets and need to identify patterns fast, clustering with K-Means and BigQuery ML can be a game-changer. You don’t need to be a data scientist or build complex solutions from scratch. You just need to understand your business and ask the right questions—BigQuery can handle the rest.

Start simple: take your top-performing pages, group them by sessions and conversions, and see what patterns emerge. You might uncover insights that completely shift how you approach your digital strategy.

 

Claude 4.0: Advances and Challenges in Conversational AI

Artificial Intelligence (AI) continues to progress at an accelerated pace, and Claude 4.0, developed by Anthropic, marks a major milestone in this journey. This next-generation language model stands out for its ability to comprehend complex contexts, deliver accurate responses, and adapt to a wide range of business needs.

AlphaEvolve: The new coding agent powered by Gemini

In a world where technology advances at unprecedented speed, artificial intelligence has emerged as a key driver of transformation. Among the most promising innovations today is AlphaEvolve, an evolutionary coding agent that combines the creative power of large language models (LLMs) with automated evaluators, opening new frontiers in software development, algorithm optimization, and solving complex problems in mathematics and computing.

How AI Is Revolutionizing Design and Development

At its Config 2025 event, Figma made it clear: the future of digital design will be deeply shaped by artificial intelligence. Beyond announcing new features, the company highlighted a paradigm shift — design is no longer a standalone process, but the core that connects creativity, technology, and product development.

data
Mallorca 184, 08036
Barcelona, Spain