Big Query Use cases¶

Estimated time to read: 8 minutes

Use Case 6: A/B Testing Analysis¶

A/B testing is a common method for comparing the performance of two different versions of a web page, advertisement, or other marketing asset. With BigQuery, you can use SQL queries to analyse the results of an A/B test.

Assuming you have a dataset called user_analytics with a table called ab_test_results containing columns user_id, variant (A or B), session_duration, and conversion (1 if the user converted, 0 otherwise), you could analyse the results as follows:

Creating a mock dataset for this A/B test analysis can be done using SQL in BigQuery. For example, let's create a simple table named ab_test_results with fields for user_id, variant, session_duration, and conversion.

Here's an SQL code to create this table and populate it with some sample data:

SQL

CREATE OR REPLACE TABLE `user_analytics.ab_test_results` (
  user_id STRING,
  variant STRING,
  session_duration INT64,
  conversion INT64
) AS
SELECT
  FORMAT("USER-%05d", x) AS user_id,
  IF(MOD(x, 2) = 0, "A", "B") AS variant,
  CAST(300 + RAND() * 300 AS INT64) AS session_duration,
  IF(RAND() < 0.15, 1, 0) AS conversion
FROM (
  SELECT
    GENERATE_ARRAY(1, 10000) AS numbers
) t, UNNEST(t.numbers) x;

This code creates a table with 10,000 rows. Each row represents a user. The user_id is a string like "USER-00001", "USER-00002", etc. The variant is "A" for even-numbered users and "B" for odd-numbered users. The session_duration is a random number between 300 and 600 (simulating the duration of a session in seconds), and conversion is a random binary value (0 or 1) with a probability of 0.15 for a conversion (simulating a conversion rate of 15%).

Please note that the conversion rate and session duration are random for this mock dataset, so it won't necessarily reflect real-world distributions or conversion rates.

You can adjust the table size, session duration range, conversion rate, or other parameters as needed for your specific use case.

First, example, without ML calculates the overall conversion rate for each variant:

SQL

SELECT
  variant,
  COUNTIF(conversion = 1) AS conversions,
  COUNT(*) AS total_users,
  COUNTIF(conversion = 1) / COUNT(*) AS conversion_rate
FROM
  `user_analytics.ab_test_results`
GROUP BY
  variant;

This query returns the number of conversions (users who completed a desired action, like making a purchase or signing up for a newsletter) and the total number of users for each variant, as well as the conversion rate (the percentage of users who converted).

To determine if the difference in conversion rates is statistically significant, you could perform a chi-square test. BigQuery doesn't have built-in functionality for this, but you can compute the chi-square statistic and p-value in a programming language like Python or R using the result of the previous query.

If you want to dig deeper, you could also look at the effect of the variant on session duration:

SQL

SELECT
  variant,
  AVG(session_duration) AS average_session_duration
FROM
  `user_analytics.ab_test_results`
GROUP BY
  variant;

This query returns the average session duration for each variant. A significant difference in session duration could indicate that one variant is more engaging than the other.

By analysing A/B test results in BigQuery, you can use data to make informed decisions about which variant to implement. This can lead to improved user experience, higher conversion rates, and increased revenue.

Some of the most commonly used BigQuery functions for A/B testing include:

COUNT() - This function counts the number of rows in a table. This can be used to track the number of users who have seen each version of a product. SUM() - This function sums the values in a column. This can be used to track the number of conversions, clicks, or other metrics for each version of a product. AVG() - This function calculates the average value in a column. This can be used to track the average time spent on a page, the average number of pages viewed, or other metrics. STDDEV() - This function calculates the standard deviation of a column. This can be used to track the variation in the values in a column. CONFIDENCE_INTERVAL() - This function calculates the confidence interval for a column. This can be used to determine the statistical significance of the results of an A/B test.

Next, find BigQuery ML code to check the A/B test using the mock data created.

Create a model to predict whether a user will convert¶

SQL

CREATE MODEL my_model
OPTIONS (
  MODEL_TYPE = 'LINEAR_REGRESSION',
  INPUT_LABEL = 'variant',
  OUTPUT_LABEL = 'conversion'
);

# Train the model on the A/B test data
TRAIN MODEL my_model
OPTIONS (
  DATA_SET = 'user_analytics.ab_test_results'
);

Predict whether a user will convert on each version of the homepage¶

SQL

SELECT
  variant,
  prediction
FROM
  user_analytics.ab_test_results
WHERE
  variant IN ('A', 'B');

# Compare the predictions to see which version of the homepage is more likely to generate a conversion
SELECT
  variant,
  COUNT(*) AS num_users,
  SUM(prediction) AS num_predictions,
  AVG(prediction) AS avg_prediction
FROM
  (
    SELECT
      variant,
      prediction
    FROM
      user_analytics.ab_test_results
    WHERE
      variant IN ('A', 'B')
  ) AS t
GROUP BY
  variant;

Print the results¶

SQL

SELECT
  variant,
  num_users,
  num_predictions,
  avg_prediction
FROM
  (
    SELECT
      variant,
      COUNT(*) AS num_users,
      SUM(prediction) AS num_predictions,
      AVG(prediction) AS avg_prediction
    FROM
      (
        SELECT
          variant,
          prediction
        FROM
          user_analytics.ab_test_results
        WHERE
          variant IN ('A', 'B')
      ) AS t
    GROUP BY
      variant
  ) AS t
ORDER BY
  avg_prediction DESC;

Make a decision about which version of the homepage to use¶

SQL

IF
  (
    SELECT
      avg_prediction
    FROM
      (
        SELECT
          variant,
          AVG(prediction) AS avg_prediction
        FROM
          (
            SELECT
              variant,
              prediction
            FROM
              user_analytics.ab_test_results
            WHERE
              variant IN ('A', 'B')
          ) AS t
        GROUP BY
          variant
      ) AS t
    WHERE
      variant = 'A'
  ) >
  (
    SELECT
      avg_prediction
    FROM
      (
        SELECT
          variant,
          AVG(prediction) AS avg_prediction
        FROM
          (
            SELECT
              variant,
              prediction
            FROM
              user_analytics.ab_test_results
            WHERE
              variant IN ('A', 'B')
          ) AS t
        GROUP BY
          variant
      ) AS t
    WHERE
      variant = 'B'
  )
THEN
  SELECT
    'Use version A'
ELSE
  SELECT
    'Use version B'
END AS decision;

The results:

Use Case 7: User Segmentation based on Website Usage¶

Let's create a use case where we want to segment users based on their website usage patterns. This is a common use case in digital marketing, where different user segments can be targeted with tailored campaigns.

Assume we have a table called website_usage in the user_analytics dataset. This table contains columns user_id, session_duration (average duration of a user's sessions in seconds), pages_per_session (average number of pages viewed per session), and bounce_rate (the percentage of sessions where the user left the site after viewing only one page).

Before we start, you can generate mock data to simulate this scenario. Below is a SQL code to generate a mock dataset in BigQuery:

SQL

CREATE OR REPLACE TABLE `user_analytics.website_usage` (
  user_id STRING,
  session_duration INT64,
  pages_per_session INT64,
  bounce_rate FLOAT64
) AS
SELECT
  FORMAT("USER-%05d", x) AS user_id,
  CAST(600 + RAND() * 1800 AS INT64) AS session_duration,  -- Random session durations between 10-30 minutes
  CAST(5 + RAND() * 20 AS INT64) AS pages_per_session,  -- Random pages per session between 5 and 25
  ROUND(RAND(), 2) AS bounce_rate  -- Random bounce rate between 0 and 1
FROM (
  SELECT
    GENERATE_ARRAY(1, 10000) AS numbers
) t, UNNEST(t.numbers) x;

This code creates a website_usage table in the user_analytics dataset with 10,000 rows. Each row represents a user's website usage with the user_id string generated in a similar way as before. The session_duration, pages_per_session, and bounce_rate values are randomly generated with ranges that might be reasonable for a typical website, but these ranges can be adjusted to better match your specific use case.

Once you have this data, you can run the previous clustering SQL statements to create your user segmentation model and predict the segment for each user.

Now. we can use BigQuery ML's k-means clustering algorithm to create clusters based on these features:

SQL

CREATE OR REPLACE MODEL `user_analytics.user_segmentation_model`
OPTIONS(model_type='kmeans',
        num_clusters=3) AS
SELECT
  session_duration,
  pages_per_session,
  bounce_rate
FROM
  `user_analytics.website_usage`;

This query creates a model that groups users into three clusters based on the three features we selected.

After the model is trained, we can use it to assign each user to a cluster:

SQL

SELECT
  user_id,
  centroid_id AS cluster_id
FROM
  ML.PREDICT(MODEL `user_analytics.user_segmentation_model`,
    (SELECT
      user_id,
      session_duration,
      pages_per_session,
      bounce_rate
    FROM
      `user_analytics.website_usage`));

This query returns each user's user_id and their assigned cluster_id. This information can be used to inform marketing strategies, website design, and more.

Note: K-means clustering is an unsupervised machine learning algorithm. The number of clusters (specified by num_clusters in the OPTIONS clause) is a hyperparameter that you need to choose. There are various methods to select an appropriate number of clusters, such as the elbow method or silhouette analysis.

Alternative code:

Create a k-means model with 3 clusters¶

SQL

CREATE MODEL my_model
OPTIONS (
  MODEL_TYPE = 'KMEANS',
  NUM_CLUSTERS = 3
);

Train the model on the website usage data¶

SQL

TRAIN MODEL my_model
OPTIONS (
  DATA_SET = 'user_analytics.website_usage'
);

Predict which cluster each user belongs to¶

SQL

SELECT
  user_id,
  prediction
FROM
  user_analytics.website_usage
JOIN
  ML.PREDICT(my_model) AS m USING (user_id);