Big Query Use cases¶
Estimated time to read: 8 minutes
Use Case 6: A/B Testing Analysis¶
A/B testing is a common method for comparing the performance of two different versions of a web page, advertisement, or other marketing asset. With BigQuery, you can use SQL queries to analyse the results of an A/B test.
Assuming you have a dataset called user_analytics
with a table called ab_test_results
containing columns user_id
, variant
(A or B), session_duration
, and conversion
(1 if the user converted, 0 otherwise), you could analyse the results as follows:
Creating a mock dataset for this A/B test analysis can be done using SQL in BigQuery. For example, let's create a simple table named ab_test_results
with fields for user_id
, variant
, session_duration
, and conversion
.
Here's an SQL code to create this table and populate it with some sample data:
CREATE OR REPLACE TABLE `user_analytics.ab_test_results` (
user_id STRING,
variant STRING,
session_duration INT64,
conversion INT64
) AS
SELECT
FORMAT("USER-%05d", x) AS user_id,
IF(MOD(x, 2) = 0, "A", "B") AS variant,
CAST(300 + RAND() * 300 AS INT64) AS session_duration,
IF(RAND() < 0.15, 1, 0) AS conversion
FROM (
SELECT
GENERATE_ARRAY(1, 10000) AS numbers
) t, UNNEST(t.numbers) x;
This code creates a table with 10,000 rows. Each row represents a user. The user_id
is a string like "USER-00001", "USER-00002", etc. The variant
is "A" for even-numbered users and "B" for odd-numbered users. The session_duration
is a random number between 300 and 600 (simulating the duration of a session in seconds), and conversion
is a random binary value (0 or 1) with a probability of 0.15 for a conversion (simulating a conversion rate of 15%).
Please note that the conversion rate and session duration are random for this mock dataset, so it won't necessarily reflect real-world distributions or conversion rates.
You can adjust the table size, session duration range, conversion rate, or other parameters as needed for your specific use case.
First, example, without ML calculates the overall conversion rate for each variant:
SELECT
variant,
COUNTIF(conversion = 1) AS conversions,
COUNT(*) AS total_users,
COUNTIF(conversion = 1) / COUNT(*) AS conversion_rate
FROM
`user_analytics.ab_test_results`
GROUP BY
variant;
This query returns the number of conversions (users who completed a desired action, like making a purchase or signing up for a newsletter) and the total number of users for each variant, as well as the conversion rate (the percentage of users who converted).
To determine if the difference in conversion rates is statistically significant, you could perform a chi-square test. BigQuery doesn't have built-in functionality for this, but you can compute the chi-square statistic and p-value in a programming language like Python or R using the result of the previous query.
If you want to dig deeper, you could also look at the effect of the variant on session duration:
SELECT
variant,
AVG(session_duration) AS average_session_duration
FROM
`user_analytics.ab_test_results`
GROUP BY
variant;
This query returns the average session duration for each variant. A significant difference in session duration could indicate that one variant is more engaging than the other.
By analysing A/B test results in BigQuery, you can use data to make informed decisions about which variant to implement. This can lead to improved user experience, higher conversion rates, and increased revenue.
Some of the most commonly used BigQuery functions for A/B testing include:
COUNT()
- This function counts the number of rows in a table. This can be used to track the number of users who have seen each version of a product. SUM()
- This function sums the values in a column. This can be used to track the number of conversions, clicks, or other metrics for each version of a product. AVG()
- This function calculates the average value in a column. This can be used to track the average time spent on a page, the average number of pages viewed, or other metrics. STDDEV()
- This function calculates the standard deviation of a column. This can be used to track the variation in the values in a column. CONFIDENCE_INTERVAL()
- This function calculates the confidence interval for a column. This can be used to determine the statistical significance of the results of an A/B test.
Next, find BigQuery ML code to check the A/B test using the mock data created.
Create a model to predict whether a user will convert¶
CREATE MODEL my_model
OPTIONS (
MODEL_TYPE = 'LINEAR_REGRESSION',
INPUT_LABEL = 'variant',
OUTPUT_LABEL = 'conversion'
);
# Train the model on the A/B test data
TRAIN MODEL my_model
OPTIONS (
DATA_SET = 'user_analytics.ab_test_results'
);
Predict whether a user will convert on each version of the homepage¶
SELECT
variant,
prediction
FROM
user_analytics.ab_test_results
WHERE
variant IN ('A', 'B');
# Compare the predictions to see which version of the homepage is more likely to generate a conversion
SELECT
variant,
COUNT(*) AS num_users,
SUM(prediction) AS num_predictions,
AVG(prediction) AS avg_prediction
FROM
(
SELECT
variant,
prediction
FROM
user_analytics.ab_test_results
WHERE
variant IN ('A', 'B')
) AS t
GROUP BY
variant;
Print the results¶
SELECT
variant,
num_users,
num_predictions,
avg_prediction
FROM
(
SELECT
variant,
COUNT(*) AS num_users,
SUM(prediction) AS num_predictions,
AVG(prediction) AS avg_prediction
FROM
(
SELECT
variant,
prediction
FROM
user_analytics.ab_test_results
WHERE
variant IN ('A', 'B')
) AS t
GROUP BY
variant
) AS t
ORDER BY
avg_prediction DESC;
Make a decision about which version of the homepage to use¶
IF
(
SELECT
avg_prediction
FROM
(
SELECT
variant,
AVG(prediction) AS avg_prediction
FROM
(
SELECT
variant,
prediction
FROM
user_analytics.ab_test_results
WHERE
variant IN ('A', 'B')
) AS t
GROUP BY
variant
) AS t
WHERE
variant = 'A'
) >
(
SELECT
avg_prediction
FROM
(
SELECT
variant,
AVG(prediction) AS avg_prediction
FROM
(
SELECT
variant,
prediction
FROM
user_analytics.ab_test_results
WHERE
variant IN ('A', 'B')
) AS t
GROUP BY
variant
) AS t
WHERE
variant = 'B'
)
THEN
SELECT
'Use version A'
ELSE
SELECT
'Use version B'
END AS decision;
The results:
Use Case 7: User Segmentation based on Website Usage¶
Let's create a use case where we want to segment users based on their website usage patterns. This is a common use case in digital marketing, where different user segments can be targeted with tailored campaigns.
Assume we have a table called website_usage
in the user_analytics
dataset. This table contains columns user_id
, session_duration
(average duration of a user's sessions in seconds), pages_per_session
(average number of pages viewed per session), and bounce_rate
(the percentage of sessions where the user left the site after viewing only one page).
Before we start, you can generate mock data to simulate this scenario. Below is a SQL code to generate a mock dataset in BigQuery:
CREATE OR REPLACE TABLE `user_analytics.website_usage` (
user_id STRING,
session_duration INT64,
pages_per_session INT64,
bounce_rate FLOAT64
) AS
SELECT
FORMAT("USER-%05d", x) AS user_id,
CAST(600 + RAND() * 1800 AS INT64) AS session_duration, -- Random session durations between 10-30 minutes
CAST(5 + RAND() * 20 AS INT64) AS pages_per_session, -- Random pages per session between 5 and 25
ROUND(RAND(), 2) AS bounce_rate -- Random bounce rate between 0 and 1
FROM (
SELECT
GENERATE_ARRAY(1, 10000) AS numbers
) t, UNNEST(t.numbers) x;
This code creates a website_usage
table in the user_analytics
dataset with 10,000 rows. Each row represents a user's website usage with the user_id
string generated in a similar way as before. The session_duration
, pages_per_session
, and bounce_rate
values are randomly generated with ranges that might be reasonable for a typical website, but these ranges can be adjusted to better match your specific use case.
Once you have this data, you can run the previous clustering SQL statements to create your user segmentation model and predict the segment for each user.
Now. we can use BigQuery ML's k-means clustering algorithm to create clusters based on these features:
CREATE OR REPLACE MODEL `user_analytics.user_segmentation_model`
OPTIONS(model_type='kmeans',
num_clusters=3) AS
SELECT
session_duration,
pages_per_session,
bounce_rate
FROM
`user_analytics.website_usage`;
This query creates a model that groups users into three clusters based on the three features we selected.
After the model is trained, we can use it to assign each user to a cluster:
SELECT
user_id,
centroid_id AS cluster_id
FROM
ML.PREDICT(MODEL `user_analytics.user_segmentation_model`,
(SELECT
user_id,
session_duration,
pages_per_session,
bounce_rate
FROM
`user_analytics.website_usage`));
This query returns each user's user_id
and their assigned cluster_id
. This information can be used to inform marketing strategies, website design, and more.
Note: K-means clustering is an unsupervised machine learning algorithm. The number of clusters (specified by num_clusters
in the OPTIONS
clause) is a hyperparameter that you need to choose. There are various methods to select an appropriate number of clusters, such as the elbow method or silhouette analysis.
Alternative code: