BigQuery Cheat Sheet¶
Estimated time to read: 8 minutes
Functions¶
String Functions¶
Function Name | Description |
---|---|
CONCAT | Concatenates multiple strings into a single string. |
LENGTH | Returns the length of a string. |
LOWER | Converts a string to lowercase. |
UPPER | Converts a string to uppercase. |
TRIM | Removes leading and trailing spaces from a string. |
REPLACE | Replaces occurrences of a substring within a string. |
SUBSTR | Extracts a substring from a string. |
REGEXP_REPLACE | Replaces parts of a string that match a regular expression. |
LPAD() | Pads a string with a specified character to a given length on the left side. |
RPAD() | Pads a string with a specified character to a given length on the right side. |
SPLIT() | Splits a string into an array based on a specified delimiter. |
REGEXP_CONTAINS() | Tests whether a string contains a substring matching a regular expression. |
REGEXP_EXTRACT() | Extracts the first matching substring from a string based on a regular expression. |
REPEAT() | Repeats a string a specified number of times. |
REVERSE() | Reverses the characters in a string. |
Date and Time Functions¶
Function Name | Description |
---|---|
CURRENT_DATE | Returns the current date. |
CURRENT_TIMESTAMP | Returns the current timestamp. |
DATE_ADD | Adds an interval to a date. |
DATE_DIFF | Computes the difference between two dates. |
DATE_TRUNC | Truncates a date to a specified granularity. |
EXTRACT | Extracts a specific date part from a date, timestamp, or datetime. |
FORMAT_DATETIME | Formats a datetime value as a string. |
PARSE_DATE | Parses a string into a date using a specified format. |
DATETIME_ADD() | Adds an interval to a datetime. |
DATETIME_DIFF() | Computes the difference between two datetimes. |
DATETIME_TRUNC() | Truncates a datetime to a specified granularity. |
FORMAT_TIMESTAMP() | Formats a timestamp value as a string. |
PARSE_TIMESTAMP() | Parses a string into a timestamp using a specified format. |
TIMESTAMP_ADD() | Adds an interval to a timestamp. |
TIMESTAMP_DIFF() | Computes the difference between two timestamps. |
TIMESTAMP_TRUNC() | Truncates a timestamp to a specified granularity. |
Numeric Functions¶
Function Name | Description |
---|---|
ABS | Returns the absolute value of a number. |
CEIL | Returns the smallest integer greater than or equal to a given number. |
FLOOR | Returns the largest integer less than or equal to a given number. |
MOD | Returns the remainder of one number divided by another. |
ROUND | Rounds a number to the nearest integer or a specified number of decimal places. |
Aggregate Functions¶
Function Name | Description |
---|---|
AVG | Calculates the average value of a set of numbers. |
COUNT | Counts the number of non-null values in a column. |
MAX | Returns the maximum value in a column. |
MIN | Returns the minimum value in a column. |
SUM | Calculates the sum of all non-null values in a column. |
Window Functions¶
Function Name | Description |
---|---|
ROW_NUMBER | Assigns a unique number to each row within a result set. |
RANK | Assigns a unique rank to each row within a result set, with the same rank assigned to rows with equal values. |
DENSE_RANK | Assigns a unique rank to each row within a result set, without gaps in the ranking sequence for equal values. |
NTILE | Divides a result set into a specified number of roughly equal groups and assigns a unique group number to each row. It doesn't provide the percentile value! |
Mathematical Functions¶
Function Name | Description |
---|---|
RAND() | Returns a random floating-point value between 0 and 1. |
SQRT() | Returns the square root of a number. |
LOG() | Computes the natural logarithm of a number. |
LOG10() | Computes the base-10 logarithm of a number. |
POWER() | Raises a number to the power of another number. |
SIN(), COS(), TAN() | Trigonometric functions. |
ASIN(), ACOS(), ATAN() | Inverse trigonometric functions. |
ATAN2() | Returns the angle between the X-axis and the line segment connecting the origin and the point (x, y). |
Geospatial Functions¶
Function Name | Description |
---|---|
ST_DISTANCE() | Calculates the distance between two geospatial points. |
ST_CONTAINS() | Tests whether a geometry contains another geometry. |
ST_INTERSECTS() | Tests whether two geometries intersect. |
ST_AREA() | Computes the area of a polygon. |
ST_LENGTH() | Computes the length of a linestring. |
ST_CENTROID() | Computes the centroid of a geometry. |
ST_BUFFER() | Creates a buffer around a geometry. |
ST_MAKELINE() | Creates a linestring from a sequence of points. |
ST_MAKEPOLYGON() | Creates a polygon from a sequence of points. |
ST_MAKEPOINT() | Creates a point from X and Y coordinates. |
ST_UNION() | Computes the union of multiple geometries. |
JSON Functions¶
Function Name | Description |
---|---|
JSON_EXTRACT() | Extracts data from a JSON string. |
JSON_QUERY() | Queries JSON data using JSONPath expressions. |
JSON_EXTRACT_SCALAR() | Extracts scalar data from a JSON string. |
JSON_ARRAY_LENGTH() | Returns the length of a JSON array. |
Array Functions¶
Function Name | Description |
---|---|
ARRAY_LENGTH() | Returns the number of elements in an array. |
ARRAY_CONCAT() | Concatenates two or more arrays. |
ARRAY_REVERSE() | Reverses the order of elements in an array. |
ARRAY_TO_STRING() | Converts an array to a string with a specified delimiter. |
Other Functions¶
Function Name | Description |
---|---|
COALESCE() | Returns the first non-null value in a list of expressions. |
NULLIF() | Returns NULL if the two given expressions are equal. Otherwise returns the first expression. |
GREATEST() | Returns the greatest value from a list of expressions. |
LEAST() | Returns the smallest value from a list of expressions. |
CAST() | Converts a value from one data type to another. |
SAFE_CAST() | Safely converts a value from one data type to another, returning NULL if the conversion fails. |
IF() | Evaluates a conditional expression and returns the value of one of the two specified expressions, depending on whether the condition is true or false. |
CASE() | Evaluates a list of conditions and returns the first true result or a default value if no conditions are met. |
ARRAY_AGG() | Creates an array from a set of input values. |
STRING_AGG() | Concatenates non-null input values into a single string, separated by a specified delimiter. |
ARRAY(SELECT ...) | Creates an array from the results of a subquery. |
STRUCT() | Creates a structure with named fields from a list of expressions. |
GENERATE_ARRAY() | Generates an array of integers within a specified range. |
Bit Functions¶
Function Name | Description |
---|---|
BIT_COUNT() | Counts the number of bits set to 1 in an integer. |
BIT_XOR() | Computes the bitwise exclusive OR (XOR) of an integer column. |
BIT_AND() | Computes the bitwise AND of an integer column. |
BIT_OR() | Computes the bitwise OR of an integer column. |
BigQuery ML supports the following machine learning models¶
Find below the merged table with extra columns for use cases aligned:
Model Name | Description | Use Cases |
---|---|---|
Linear Regression (BQML.LINEAR_REG) | A regression model used to predict the value of a variable based on the value of another variable. | Predicting the price of a house in relation to the square meters, the number of sales in relation to the location, or the number of clicks on an ad in relation to the time of day. |
Logistic Regression (BQML.LOGISTIC_REG) | A classification model used to predict the probability of an event happening based on a set of known, independent variables. | Predicting whether a customer will churn, whether an email is spam, or whether a tumour is malignant based on specific predefined characteristics. |
K-means Clustering (BQML.KMEANS) | An unsupervised learning model used to group n observations in K clusters in which each observation belongs to the cluster with the nearest mean value. | Grouping customers into segments based on their purchase history, website activity, or demographics. |
Matrix Factorization (BQML.MATRIX_FACTORIZATION) | A recommendation model that finds relations between items' and users' entities. | Recommending products to customers, movies to viewers, or songs to listeners. |
Time Series (ARIMA, ARIMA_PLUS, and BQML.AUTO_ARIMA) | ARIMA stands for autoregressive integrated moving average. Models for time series forecasting or describing past events in a time period. | Predicting future sales, website traffic, or customer churn. |
Deep Neural Networks (BQML.DNN) | A deep learning model consisting of multiple layers of interconnected neurons. | Classifying images, translating languages, or generating text. |
TensorFlow Model Import (BQML.TF_IMPORT) | Enables you to deploy pre-trained TensorFlow models in BigQuery. | Deploying pre-trained TensorFlow models in BigQuery. |
Boosted Tree Models (BQML.BOOSTED_TREE and BQML.BOOSTED_TREE_CLASSIFIER) | Ensemble learning models built from decision trees to improve the accuracy of classification and regression models. | Improving the accuracy of classification and regression models. |
XGBoost (BQML.XGB_REG and BQML.XGB_CLASSIFIER) | Extreme Gradient Boosting model. The algorithm is trained on smaller decision trees, and the final outcome is the combination of these results. | Regression and classification tasks, known for their high performance and flexibility. |
AutoML Tables (BQML.AUTOML_TABLES) | A fully managed machine learning service that automatically builds and tunes machine learning models for tabular data. | Automatically building and tuning machine learning models for tabular data. |
Autoencoder (BQML.AUTOENCODER) | A type of neural network that can be used to reduce the dimensionality of data, denoise images, or detect fraud. | Reducing the dimensionality of data, denoising images, or detecting fraud. |
ML model comparison¶
Model Name | Type | Problem Type | Complexity | Computational Cost | Ease of Use |
---|---|---|---|---|---|
Linear Regression | Supervised Learning | Regression | Simple | Can be expensive for large datasets | Easy to understand and implement |
Logistic Regression | Supervised Learning | Categorical Classification | Simple | Can be expensive for large datasets | Easy to understand and implement |
K-means Clustering | Unsupervised Learning | Clustering | Simple | Can be computationally expensive for large datasets | Easy to understand and implement |
Matrix Factorization | Unsupervised Learning | Recommendation | Complex | Can be computationally expensive for large datasets | More complex to use, but can save time and effort |
Time Series | Supervised Learning | Forecasting | Complex | Can be computationally expensive for large datasets | More complex to use, but can save time and effort |
Deep Neural Networks | Supervised Learning | Regression, Classification, Image Recognition, Natural Language Processing, etc. | Complex | Can be computationally expensive for large datasets | More complex to use, but can achieve state-of-the-art results |
TensorFlow Model Import | Supervised Learning | Regression, Classification, Image Recognition, Natural Language Processing, etc. | Complex | Can be computationally expensive for large datasets | More complex to use, but can save time and effort |
Boosted Tree Models | Supervised Learning | Regression, Classification | Complex | Can be computationally expensive for large datasets | More complex to use, but can achieve state-of-the-art results |
XGBoost | Supervised Learning | Regression, Classification | Complex | Can be computationally expensive for large datasets | More complex to use, but can achieve state-of-the-art results |
AutoML Tables | Supervised Learning | Regression, Classification, Clustering, Recommendation, Forecasting, etc. | Complex | Can be computationally expensive for large datasets | More complex to use, but can save time and effort |
Autoencoder | Unsupervised Learning | Dimensionality Reduction, Image Denoising, Fraud Detection, etc. | Complex | Can be computationally expensive for large datasets | More complex to use, but can save time and effort |