BigQuery best practices for starters

Estimated time to read: 15 minutes

Here are 20 advanced tips and tricks to simplify your work with BigQuery:

Use partitioned tables Partition your tables on a DATE or TIMESTAMP column to reduce the amount of data scanned and improve query performance.

Take advantage of clustering Cluster your tables on frequently filtered or joined columns to optimise query performance further and reduce costs.

Use the EXTRACT function Use EXTRACT to work with date parts instead of string functions, which improves readability and performance.

Limit columns in SELECT Select only the columns you need in your queries to minimise data processing and improve performance.

Use approximate aggregation functions Use APPROX_COUNT_DISTINCT and other approximate aggregation functions for faster results on large datasets.

Cache query results BigQuery caches query results for 24 hours, so rerunning the same query within that time frame will be faster and not incur additional costs.

Use WITH clauses for subqueries Use WITH clauses (Common Table Expressions) to make your queries more readable and maintainable.

Utilize window functions Use window functions (e.g., ROW_NUMBER, RANK) to perform complex calculations, like running totals or moving averages.

Employ ARRAY_AGG and UNNEST Use ARRAY_AGG and UNNEST to work with arrays and simplify queries with complex data structures.

Take advantage of STRUCT Use STRUCT to create complex data structures and perform operations on them within your queries.

Use User-Defined Functions (UDFs) Create UDFs in JavaScript or SQL for custom calculations that aren't supported by built-in functions.

Schedule queries Schedule recurring queries in BigQuery to automate data processing and reporting tasks.

Dry run queries Use the --dry_run flag with the bq command-line tool to estimate query costs without actually running the query.

Monitor query performance Analyze query performance using BigQuery's built-in monitoring tools and the INFORMATION_SCHEMA views.

Use destination tables Store query results in a destination table to save intermediate results for further analysis or reporting.

Batch load data Load data in batches to reduce the number of API calls and improve overall performance.

Use federated data sources Query data stored in external sources like Google Sheets or Cloud Storage without loading it into BigQuery.

Optimize data storage Use BigQuery's columnar storage format to store data efficiently and minimise storage costs.

Control access Use Identity and Access Management (IAM) to manage access to datasets, tables, and views, ensuring the right people have the right permissions.

Stream data into BigQuery Use streaming inserts for real-time data ingestion and analysis, but be aware of the associated costs and limitations.

These tips and tricks will help you simplify your work with BigQuery, optimise your queries, and get the most out of your data analysis. By incorporating these best practices into your workflow, you'll improve performance, reduce costs, and streamline your data processing tasks.

Below, I will provide you with an example code and steps to utilise some of the tips mentioned above. We'll start by creating a dataset and a few tables to play with.

Create a dataset¶

To create a dataset in BigQuery, you can use the Google Cloud Console or the bq command-line tool. Here's the command to create a dataset using the bq tool:

Bash

bq mk mydataset

Replace mydataset with the desired name for your dataset.

Create tables¶

Let's create two tables, orders and order_items. We'll partition and cluster the orders table for better performance.

Create the `orders` table:¶

SQL

CREATE TABLE mydataset.orders (
  order_id INT64,
  customer_id INT64,
  order_date DATE,
  status STRING
)
PARTITION BY order_date
CLUSTER BY customer_id;

Create the `order_items` table:¶

SQL

CREATE TABLE mydataset.order_items (
  order_item_id INT64,
  order_id INT64,
  product_id INT64,
  quantity INT64,
  price FLOAT64
);

Load data¶

Load sample data into the tables. You can upload your data to Google Cloud Storage and then load it into BigQuery using the bq tool or the BigQuery Console.

Example query using tips¶

Now, let's write an example query that utilises some of the tips we discussed earlier.

SQL

WITH daily_order_stats AS (
  SELECT
    order_date,
    COUNT(DISTINCT order_id) AS num_orders,
    APPROX_COUNT_DISTINCT(customer_id) AS unique_customers,
    SUM(oi.total_price) AS total_revenue
  FROM
    mydataset.orders o
  JOIN (
    SELECT
      order_id,
      SUM(quantity * price) AS total_price
    FROM
      mydataset.order_items
    GROUP BY
      order_id
  ) oi
  ON o.order_id = oi.order_id
  WHERE
    order_date BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) AND DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
  GROUP BY
    order_date
),
moving_average AS (
  SELECT
    order_date,
    total_revenue,
    AVG(total_revenue) OVER (ORDER BY order_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS revenue_moving_average
  FROM
    daily_order_stats
)
SELECT
  order_date,
  total_revenue,
  revenue_moving_average
FROM
  moving_average
ORDER BY
  order_date;

This query calculates the daily revenue and 7-day moving average for the last 30 days, using the following tips:

WITH clause for subqueries (Tip 7)
APPROX_COUNT_DISTINCT for approximate unique customer count (Tip 5)
JOIN with a subquery to calculate order totals (Tip 7)
DATE_SUB and CURRENT_DATE to filter the last 30 days (Tip 3)
Window function for the moving average calculation (Tip 8)

This example demonstrates how to apply several advanced tips and tricks when working with BigQuery. By following these guidelines, you can simplify your work, optimize query performance, and get the most out of your data analysis.

Here's a continuation of the example that demonstrates more advanced tips in BigQuery:

Create a User-Defined Function (UDF)¶

Let's create a simple JavaScript UDF to calculate the price with a discount applied:

SQL

CREATE TEMPORARY FUNCTION discounted_price(price FLOAT64, discount FLOAT64)
RETURNS FLOAT64
LANGUAGE js AS """
  return price * (1 - discount);
""";

Query using the UDF¶

Now, let's use this UDF in a query to calculate the discounted price for each order item:

SQL

SELECT
  order_item_id,
  product_id,
  quantity,
  price,
  discounted_price(price, 0.1) AS discounted_price
FROM
  mydataset.order_items;

This query calculates the discounted price for each order item with a 10% discount, utilizing the UDF we created (Tip 11).

Create a View with STRUCT and ARRAY_AGG¶

Let's create a view that aggregates the order items into an array, using STRUCT for complex data structures and ARRAY_AGG to build an array:

SQL

CREATE VIEW mydataset.orders_with_items AS
SELECT
  o.order_id,
  o.customer_id,
  o.order_date,
  o.status,
  ARRAY_AGG(STRUCT(oi.order_item_id, oi.product_id, oi.quantity, oi.price)) AS items
FROM
  mydataset.orders o
JOIN
  mydataset.order_items oi
ON o.order_id = oi.order_id
GROUP BY
  o.order_id, o.customer_id, o.order_date, o.status;

This view combines the orders and order_items tables using STRUCT and ARRAY_AGG (Tips 10 and 9).

Query the View with UNNEST¶

Now, let's query the view and use UNNEST to flatten the items array:

SQL

SELECT
  order_id,
  customer_id,
  order_date,
  status,
  item.order_item_id,
  item.product_id,
  item.quantity,
  item.price
FROM
  mydataset.orders_with_items,
  UNNEST(items) AS item;

This query uses UNNEST to work with the array data in the view (Tip 9).

Schedule a Query¶

You can schedule a recurring query in BigQuery to automate data processing and reporting tasks (Tip 12). To do this, use the BigQuery Console:

Navigate to the BigQuery Console.
Click on "Scheduled queries" in the left-hand menu.
Click "New scheduled query" and configure the query, schedule, and destination table for the query results.

Monitor Query Performance¶

Analyze your query performance using BigQuery's built-in monitoring tools and the INFORMATION_SCHEMA views (Tip 14). To view query history and performance:

Navigate to the BigQuery Console.
Click on "Query history" in the left-hand menu.
Select a query to view details such as execution time, bytes processed, and slot usage.

You can also query the INFORMATION_SCHEMA views to analyze query performance. For example, to retrieve the 10 most recent queries and their execution times:

SQL

SELECT
  query,
  start_time,
  end_time,
  TIMESTAMP_DIFF(end_time, start_time, SECOND) AS duration_seconds
FROM
  `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
  state = 'DONE'
  AND query IS NOT NULL
ORDER BY
  end_time DESC
LIMIT 10;

Replace region-us with your specific region.

By utilizing these additional tips and the examples provided, you can further optimize your work with BigQuery and enhance your data analysis experience:

Dry Run Queries¶

You can use the --dry_run flag with the bq command-line tool to estimate query costs without actually running the query (Tip 13):

Text Only

bq query --dry_run --nouse_legacy_sql "SELECT * FROM mydataset.orders WHERE order_date = '2023-01-01'"

This command will return the estimated bytes processed without incurring any cost or executing the query.

Control Access¶

Use Identity and Access Management (IAM) to manage access to datasets, tables, and views, ensuring the right people have the right permissions (Tip 19). To manage access, follow these steps in the Google Cloud Console:

Navigate to the BigQuery Console.
Click on the dataset or table you want to manage access for.
Click on the "Share dataset" or "Share table" button.
Add members and assign roles according to your access requirements.

Stream Data into BigQuery¶

Use streaming inserts for real-time data ingestion and analysis (Tip 20). Streaming data into BigQuery can be done using the BigQuery API, client libraries, or third-party tools. Here's an example using the Python client library:

Python

from google.cloud import bigquery

client = bigquery.Client()
table_ref = client.dataset('mydataset').table('orders')
table = client.get_table(table_ref)

rows_to_insert = [
    {"order_id": 1, "customer_id": 123, "order_date": "2023-01-01", "status": "completed"},
    {"order_id": 2, "customer_id": 456, "order_date": "2023-01-02", "status": "processing"},
]

errors = client.insert_rows(table, rows_to_insert)
if errors == []:
    print("Rows inserted successfully.")
else:
    print(f"Encountered errors: {errors}")

Replace 'mydataset' and 'orders' with your dataset and table names. Note that streaming inserts have some limitations and costs associated with them. Please review the documentation before using this feature.

Federated Data Sources¶

Query data stored in external sources like Google Sheets or Cloud Storage without loading it into BigQuery (Tip 17). To query data stored in Google Cloud Storage, create an external table:

SQL

CREATE EXTERNAL TABLE mydataset.external_orders_csv (
  order_id INT64,
  customer_id INT64,
  order_date DATE,
  status STRING
)
OPTIONS (
  FORMAT = 'CSV',
  URIs = ['gs://mybucket/orders.csv']
);

Now, you can query the external table just like any other BigQuery table:

SQL

SELECT * FROM mydataset.external_orders_csv;

These additional examples demonstrate how you can make use of advanced tips and tricks in BigQuery to simplify your work, optimize query performance, and effectively analyze your data. By incorporating these best practices into your workflow, you'll improve performance, reduce costs, and streamline your data processing tasks.

We have already covered the majority of the 20 tips and tricks for BigQuery. However, I will provide a few more examples related to the remaining tips that haven't been fully explored:

Optimize Data Storage¶

Use BigQuery's columnar storage format to store data efficiently and minimise storage costs (Tip 18). When loading data into BigQuery, consider the following recommendations:

Store data in a compressed format, such as Avro or Parquet, which will optimise storage usage.
Choose the appropriate data types for your columns, such as using INT64 for integers or DATE for dates. Avoid using STRING for numeric or date fields.
Normalize your data, but be mindful of the trade-offs. Too much normalisation can lead to complex queries and affect performance.

Limit Columns in SELECT¶

Select only the columns you need in your queries to minimise data processing and improve performance (Tip 4). Here's an example of a query that retrieves only the necessary columns:

SQL

SELECT
  order_id,
  customer_id,
  status
FROM
  mydataset.orders
WHERE
  order_date >= '2023-01-01';

This query retrieves only the order_id, customer_id, and status columns, reducing the amount of data processed and improving query performance.

Batch Load Data¶

Load data in batches to reduce the number of API calls and improve overall performance (Tip 16). You can load data into BigQuery using the bq command-line tool, API, or client libraries. When using the bq tool, you can load multiple files at once:

Bash

bq load --source_format=CSV mydataset.orders gs://mybucket/orders_*.csv

This command loads all CSV files with names starting with orders_ from the specified Google Cloud Storage bucket.

Use Destination Tables¶

Store query results in a destination table to save intermediate results for further analysis or reporting (Tip 15). You can use the bq command-line tool to run a query and save the results to a destination table:

Bash

bq query --nouse_legacy_sql --destination_table=mydataset.query_results "SELECT * FROM mydataset.orders WHERE order_date >= '2023-01-01'"

This command saves the query results in a new table called query_results within the mydataset dataset.

By implementing these additional tips and tricks, you can make your work with BigQuery more efficient and productive. By optimising your data storage, selecting only necessary columns, loading data in batches, and storing query results in destination tables, you'll improve performance, reduce costs, and streamline your data processing tasks.

As we have covered most of the advanced tips and tricks for working with BigQuery, I will now provide some general advice to further improve your experience with the platform.

Caching¶

BigQuery automatically caches query results for a certain period, which can help speed up query execution (Tip 6). When you run a query, BigQuery checks whether the same query has been executed recently. If the cached results are available, the platform returns them instead of re-executing the query. Keep in mind that cached results might not be suitable in all cases, especially when real-time data analysis is required.

To force BigQuery to bypass the cache and execute the query, you can use the bq command-line tool:

Bash

bq query --nouse_legacy_sql --nocache "SELECT * FROM mydataset.orders WHERE order_date >= '2023-01-01'"

Alternatively, you can disable caching for a specific query by unchecking the "Use cached results" option in the BigQuery Console.

Monitor BigQuery Usage¶

Keep track of your BigQuery usage and costs by monitoring your resources with Google Cloud Monitoring and setting up alerts (Tip 6). To monitor your BigQuery usage, follow these steps:

Navigate to the Google Cloud Console.
Click on "Monitoring" in the left-hand menu.
In the Monitoring Dashboard, you can customise charts to display metrics related to BigQuery, such as query costs, storage usage, and more.

To set up alerts, use the following steps:

Navigate to the Google Cloud Console.
Click on "Monitoring" in the left-hand menu.
Click on "Alerting" in the Monitoring section.
Click on "Create Policy" and configure the alert conditions, such as setting a threshold for query costs or storage usage.

By following these steps, you can monitor your BigQuery usage and costs more effectively, enabling you to optimise your resources and avoid unexpected charges.

Use Partition Pruning and Clustering¶

Partition pruning and clustering can help optimise query performance and reduce costs by narrowing down the amount of data that needs to be scanned (Tip 2). When creating a table, consider using partitioning and clustering:

SQL

CREATE TABLE mydataset.optimized_orders (
  order_id INT64,
  customer_id INT64,
  order_date DATE,
  status STRING
)
PARTITION BY order_date
CLUSTER BY customer_id;

In this example, the table is partitioned by order_date and clustered by customer_id. When querying this table, BigQuery will only scan the relevant partitions and clusters, reducing the amount of data that needs to be processed and improving query performance.

Incorporating these additional best practices into your BigQuery workflow'll improve performance, reduce costs, and streamline your data processing tasks. Continuously explore new features and improvements BigQuery offers to stay up-to-date with the platform's capabilities, and always look for opportunities to optimise your work.

As we have already covered numerous advanced tips and tricks for BigQuery, let's focus on some general strategies to stay informed about new features, best practices, and case studies related to BigQuery and other Google Cloud services.

Stay Up-to-Date with BigQuery Release Notes¶

Regularly review BigQuery release notes to learn about new features, improvements, and bug fixes. Staying up-to-date with the latest changes ensures you're leveraging the full potential of the platform. To access BigQuery release notes, visit the Google Cloud documentation and select "Release Notes" for BigQuery.

Subscribe to the Google Cloud Blog to stay informed about new features, best practices, and customer stories across various Google Cloud services, including BigQuery. You can also follow BigQuery's social media channels, such as Twitter and LinkedIn, for updates and relevant news.

Attend BigQuery Webinars and Events¶

Participate in BigQuery webinars, workshops, and events to learn from experts, gain insights into real-world use cases, and network with other professionals. Google Cloud offers a variety of events, both online and in-person, covering a wide range of topics. Visit the Google Cloud events page to find upcoming events that interest you.

Engage with the BigQuery Community¶

Join BigQuery-focused communities, such as forums, Slack channels, and mailing lists, to share your knowledge, ask questions, and learn from others. Some popular BigQuery communities include:

By actively participating in these communities, you can learn from the experiences of others, get help with specific issues, and stay informed about the latest trends and best practices in BigQuery.

Learn from BigQuery Case Studies¶

Study BigQuery case studies to understand how different organisations use the platform to solve their business challenges. By analysing real-world use cases, you can gain insights into best practices, discover innovative solutions, and understand the benefits of BigQuery in various industries. Access BigQuery case studies on the Google Cloud website.

Implementing these strategies will help you stay informed about the latest developments in BigQuery, improve your skills, and adapt to new features and best practices. Continuously learning and engaging with the BigQuery community will enable you to get the most out of the platform and enhance your data analysis capabilities.

BigQuery best practices for starters

Create a dataset¶

Create tables¶

Create the orders table:¶

Create the order_items table:¶

Load data¶

Example query using tips¶

Create a User-Defined Function (UDF)¶

Query using the UDF¶

Create a View with STRUCT and ARRAY_AGG¶

Query the View with UNNEST¶

Schedule a Query¶

Monitor Query Performance¶

Dry Run Queries¶

Control Access¶

Stream Data into BigQuery¶

Federated Data Sources¶

Optimize Data Storage¶

Limit Columns in SELECT¶

Batch Load Data¶

Use Destination Tables¶

Caching¶

Monitor BigQuery Usage¶

Use Partition Pruning and Clustering¶

Stay Up-to-Date with BigQuery Release Notes¶

Follow Google Cloud Blog and BigQuery Social Media Channels¶

Attend BigQuery Webinars and Events¶

Engage with the BigQuery Community¶

Learn from BigQuery Case Studies¶

Create the `orders` table:¶

Create the `order_items` table:¶