Skip to content

Becoming an Advanced BigQuery User

Estimated time to read: 8 minutes

BigQuery is a powerful, fully-managed data warehouse and analytics platform provided by Google Cloud. It is designed to help you analyze large datasets in real time. In this guide, we will discuss how to become an advanced user of BigQuery, focusing on Google Analytics data, and provide example code for creating reports for a website.

1. BigQuery Basics

Before diving into advanced techniques, it's crucial to understand BigQuery's essential concepts:

  • Projects: Organizational units within Google Cloud. They store your datasets and provide access control.
  • Datasets: Containers for tables and views.
  • Tables: Two-dimensional data structures that store your data.
  • Views: Virtual tables created from a SQL query on one or more tables.
  • Jobs: Asynchronous tasks that you submit to BigQuery for processing, like loading data or running queries.

2. Google Analytics Integration with BigQuery

To use Google Analytics data in BigQuery, you need to link your Google Analytics account to your Google Cloud project. This is done through the Google Analytics interface. Once linked, BigQuery will start receiving daily exports of your Google Analytics data. It will create a dataset with tables organized by date, which can be queried.

3. SQL in BigQuery

BigQuery uses a dialect of SQL called BigQuery SQL. It is similar to standard SQL but has some unique features and functions. Here are some important differences:

  • BigQuery supports nested and repeated fields, allowing you to work with complex, hierarchical data structures.
  • BigQuery uses approximate aggregation functions (e.g., APPROX_COUNT_DISTINCT) for faster performance on large datasets.
  • BigQuery supports user-defined functions written in JavaScript or SQL.
  • Some traditional SQL functions have different names in BigQuery, like STRING_AGG for GROUP_CONCAT.

4. Example Reports

Now let's dive into some examples of Google Analytics reports using BigQuery SQL. We will use a standard Google Analytics table called ga_sessions_* for these examples.

4.1. Total Number of Sessions by Date

One of the metrics you'll need to monitor for your website is session per day. This metric is a good approximation of the number of visitors and the frequency of visits daily. To calculate the total number of sessions by date:

SQL
SELECT
  date,
  COUNT(*) AS sessions
FROM
  `your_project_id.your_dataset_id.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20220401' AND '20220430'
GROUP BY
  date
ORDER BY
  date;

Notice that dates are imported as strings in the BigQuery table.

4.2. Total Visits by Page Path

Another important metric for your website is pageviews per page path. That will give you an indication of how many times visitors visit a page path. This report shows the total number of visits by page path.

SQL
SELECT
  hits.page.pagePath AS page_path,
  COUNT(*) AS visits
FROM
  `your_project_id.your_dataset_id.ga_sessions_*`,
  UNNEST(hits) AS hits
WHERE
  _TABLE_SUFFIX BETWEEN '20220401' AND '20220430'
  AND hits.type = 'PAGE'
GROUP BY
  page_path
ORDER BY
  pageviews DESC;

4.3. Top Traffic Sources

To identify the source of your traffic you need a report on sources per session. Sources can be organic, referrals, social media, etc. This report shows the top traffic sources by sessions.

SQL
SELECT
  trafficSource.source AS source,
  COUNT(*) AS sessions
FROM
  `your_project_id.your_dataset_id.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20220401' AND '20220430'
GROUP BY
  source
ORDER BY
  sessions DESC
LIMIT 10;

4.4. User Demographics

When your product or service addresses multiple ages and/or genders, a demographics breakdown of visitors will provide you with valuable insights. This report shows user demographics by age bracket and gender.

SQL
SELECT
  user_demographics.ageBracket AS age_bracket,
  user_demographics.gender AS gender,
  COUNT(*) AS users
FROM
  `your_project_id.your_dataset_id.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20220401' AND '20220430'
GROUP BY
age_bracket,
gender
ORDER BY
age_bracket,
gender;

5. Performance Optimization

To become an advanced BigQuery user, you must understand how to optimize your queries for performance. Some essential techniques include:

  • Use APPROX_COUNT_DISTINCT instead of COUNT(DISTINCT) for faster approximations on large datasets.
  • Limit the number of columns in your SELECT statement, only choosing the ones you need.
  • Use the EXTRACT function to filter by date parts instead of using string functions.
  • Use window functions for complex calculations, like running totals or moving averages.
  • Partition your tables by date to reduce the amount of data scanned during queries.
  • Cluster your tables on frequently filtered columns to improve performance.

6. Comparing BigQuery with Other RDBMS

If you have experience with other RDBMS like MySQL, PostgreSQL, Oracle, or MSSQL, there are some key differences to be aware of when transitioning to BigQuery:

Feature BigQuery MySQL PostgreSQL Oracle MSSQL
SQL Dialect BigQuery SQL MySQL SQL PostgreSQL SQL Oracle SQL T-SQL
Standard SQL:2011 Compliant Partial Partial Partial Partial Partial
Nested and Repeated fields Yes No No No No
Approximate aggregation functions Yes No Yes No No
UDFs (User-Defined Functions) JavaScript, SQL SQL SQL, PL/pgSQL PL/SQL SQL, CLR
Window Functions Yes Yes Yes Yes Yes
CTEs (Common Table Expressions) Yes Yes Yes Yes Yes
Full-text Search CONTAINS, CONTAINS_REGEX Full-Text Search Full-Text Search, tsvector Oracle Text Full-Text Search
Geospatial data support Yes Yes Yes Yes Yes
JSON data support Yes Yes Yes Yes Yes
ARRAY data type Yes No Yes No No
ARRAY Functions (e.g., ARRAY_AGG) Yes No Yes Yes Yes

Please note that while there are differences in SQL dialects and specific features between these RDBMS, all of them support most of the core SQL functionality. The main differences between BigQuery and the others include the support for nested and repeated fields, approximate aggregation functions, and the availability of ARRAY data types and functions.

Keep in mind that this comparison matrix is not exhaustive, and there are many other differences between these RDBMS that may impact your decision-making when transitioning to BigQuery. It's essential to thoroughly review the documentation and understand the specific requirements of your use case.

In conclusion, becoming an advanced BigQuery user requires understanding the platform's unique features, mastering BigQuery SQL, and learning to optimize queries for performance. By following this guide and exploring the examples provided, you'll be well on your way to becoming proficient in using BigQuery with Google Analytics data. The main differences between BigQuery and other RDBMS include the managed service, storage-based pricing, support for nested and repeated fields, and approximate aggregation functions. By understanding these differences, you can transition smoothly from other databases and become an advanced BigQuery user.

7. Migrating from Other RDBMS to BigQuery

To help you transition from other RDBMS like MySQL, PostgreSQL, Oracle, or MSSQL to BigQuery, here are some tips and best practices for migrating your data and adjusting your SQL queries:

7.1. Data Migration

When migrating your data from an RDBMS to BigQuery, consider the following steps:

  1. Export your data from the source RDBMS in a format compatible with BigQuery, such as CSV or JSON.
  2. Upload the exported data to Google Cloud Storage.
  3. Use the BigQuery Data Transfer Service or the bq command-line tool to load the data from Google Cloud Storage into BigQuery.

7.2. Schema Design

When designing your BigQuery schema, keep the following in mind:

  1. Use the appropriate data types for your columns, considering BigQuery's support for nested and repeated fields.
  2. Define partitioning and clustering options for large tables to optimize query performance.

7.3. Query Adjustments

When adjusting your SQL queries for BigQuery, consider the following:

  1. Replace traditional SQL functions with their BigQuery SQL equivalents, such as STRING_AGG for GROUP_CONCAT.
  2. Use BigQuery's approximate aggregation functions, like APPROX_COUNT_DISTINCT, for faster performance on large datasets.
  3. Adjust your queries to account for BigQuery's support for nested and repeated fields, using UNNEST and other related functions.
  4. Utilize window functions for complex calculations, like running totals or moving averages.

7.4. Performance Optimization

Optimizing your queries for performance is crucial in BigQuery. Keep these best practices in mind:

  1. Minimize the amount of data scanned by selecting only the necessary columns and applying filters.
  2. Use partitioning and clustering options to improve query performance.
  3. Use the EXTRACT function to filter by date parts, instead of using string functions.
  4. Monitor and analyze your query performance using BigQuery's built-in monitoring tools.

8. Learning Resources

To deepen your knowledge and become an advanced BigQuery user, consider the following learning resources:

  1. BigQuery Documentation - Official documentation from Google Cloud.
  2. BigQuery for Data Warehousing - A Coursera course that covers advanced topics in BigQuery.
  3. BigQuery: The Definitive Guide - A comprehensive book on BigQuery by Valliappa Lakshmanan and Jordan Tigani.
  4. BigQuery SQL Reference - Official SQL reference for BigQuery.

By following these guidelines and resources, you can successfully migrate from other RDBMS to BigQuery and take full advantage of its unique features, performance optimizations, and integration with Google Analytics data. With practice and hands-on experience, you'll become an advanced BigQuery user, allowing you to analyze large datasets and create insightful reports for your website. ß