BigQuery Cheat Sheet¶

Estimated time to read: 8 minutes

Functions¶

String Functions¶

Function Name	Description
CONCAT	Concatenates multiple strings into a single string.
LENGTH	Returns the length of a string.
LOWER	Converts a string to lowercase.
UPPER	Converts a string to uppercase.
TRIM	Removes leading and trailing spaces from a string.
REPLACE	Replaces occurrences of a substring within a string.
SUBSTR	Extracts a substring from a string.
REGEXP_REPLACE	Replaces parts of a string that match a regular expression.
LPAD()	Pads a string with a specified character to a given length on the left side.
RPAD()	Pads a string with a specified character to a given length on the right side.
SPLIT()	Splits a string into an array based on a specified delimiter.
REGEXP_CONTAINS()	Tests whether a string contains a substring matching a regular expression.
REGEXP_EXTRACT()	Extracts the first matching substring from a string based on a regular expression.
REPEAT()	Repeats a string a specified number of times.
REVERSE()	Reverses the characters in a string.

Date and Time Functions¶

Function Name	Description
CURRENT_DATE	Returns the current date.
CURRENT_TIMESTAMP	Returns the current timestamp.
DATE_ADD	Adds an interval to a date.
DATE_DIFF	Computes the difference between two dates.
DATE_TRUNC	Truncates a date to a specified granularity.
EXTRACT	Extracts a specific date part from a date, timestamp, or datetime.
FORMAT_DATETIME	Formats a datetime value as a string.
PARSE_DATE	Parses a string into a date using a specified format.
DATETIME_ADD()	Adds an interval to a datetime.
DATETIME_DIFF()	Computes the difference between two datetimes.
DATETIME_TRUNC()	Truncates a datetime to a specified granularity.
FORMAT_TIMESTAMP()	Formats a timestamp value as a string.
PARSE_TIMESTAMP()	Parses a string into a timestamp using a specified format.
TIMESTAMP_ADD()	Adds an interval to a timestamp.
TIMESTAMP_DIFF()	Computes the difference between two timestamps.
TIMESTAMP_TRUNC()	Truncates a timestamp to a specified granularity.

Numeric Functions¶

Function Name	Description
ABS	Returns the absolute value of a number.
CEIL	Returns the smallest integer greater than or equal to a given number.
FLOOR	Returns the largest integer less than or equal to a given number.
MOD	Returns the remainder of one number divided by another.
ROUND	Rounds a number to the nearest integer or a specified number of decimal places.

Aggregate Functions¶

Function Name	Description
AVG	Calculates the average value of a set of numbers.
COUNT	Counts the number of non-null values in a column.
MAX	Returns the maximum value in a column.
MIN	Returns the minimum value in a column.
SUM	Calculates the sum of all non-null values in a column.

Window Functions¶

Function Name	Description
ROW_NUMBER	Assigns a unique number to each row within a result set.
RANK	Assigns a unique rank to each row within a result set, with the same rank assigned to rows with equal values.
DENSE_RANK	Assigns a unique rank to each row within a result set, without gaps in the ranking sequence for equal values.
NTILE	Divides a result set into a specified number of roughly equal groups and assigns a unique group number to each row. It doesn't provide the percentile value!

Mathematical Functions¶

Function Name	Description
RAND()	Returns a random floating-point value between 0 and 1.
SQRT()	Returns the square root of a number.
LOG()	Computes the natural logarithm of a number.
LOG10()	Computes the base-10 logarithm of a number.
POWER()	Raises a number to the power of another number.
SIN(), COS(), TAN()	Trigonometric functions.
ASIN(), ACOS(), ATAN()	Inverse trigonometric functions.
ATAN2()	Returns the angle between the X-axis and the line segment connecting the origin and the point (x, y).

Geospatial Functions¶

Function Name	Description
ST_DISTANCE()	Calculates the distance between two geospatial points.
ST_CONTAINS()	Tests whether a geometry contains another geometry.
ST_INTERSECTS()	Tests whether two geometries intersect.
ST_AREA()	Computes the area of a polygon.
ST_LENGTH()	Computes the length of a linestring.
ST_CENTROID()	Computes the centroid of a geometry.
ST_BUFFER()	Creates a buffer around a geometry.
ST_MAKELINE()	Creates a linestring from a sequence of points.
ST_MAKEPOLYGON()	Creates a polygon from a sequence of points.
ST_MAKEPOINT()	Creates a point from X and Y coordinates.
ST_UNION()	Computes the union of multiple geometries.

JSON Functions¶

Function Name	Description
JSON_EXTRACT()	Extracts data from a JSON string.
JSON_QUERY()	Queries JSON data using JSONPath expressions.
JSON_EXTRACT_SCALAR()	Extracts scalar data from a JSON string.
JSON_ARRAY_LENGTH()	Returns the length of a JSON array.

Array Functions¶

Function Name	Description
ARRAY_LENGTH()	Returns the number of elements in an array.
ARRAY_CONCAT()	Concatenates two or more arrays.
ARRAY_REVERSE()	Reverses the order of elements in an array.
ARRAY_TO_STRING()	Converts an array to a string with a specified delimiter.

Other Functions¶

Function Name	Description
COALESCE()	Returns the first non-null value in a list of expressions.
NULLIF()	Returns NULL if the two given expressions are equal. Otherwise returns the first expression.
GREATEST()	Returns the greatest value from a list of expressions.
LEAST()	Returns the smallest value from a list of expressions.
CAST()	Converts a value from one data type to another.
SAFE_CAST()	Safely converts a value from one data type to another, returning NULL if the conversion fails.
IF()	Evaluates a conditional expression and returns the value of one of the two specified expressions, depending on whether the condition is true or false.
CASE()	Evaluates a list of conditions and returns the first true result or a default value if no conditions are met.
ARRAY_AGG()	Creates an array from a set of input values.
STRING_AGG()	Concatenates non-null input values into a single string, separated by a specified delimiter.
ARRAY(SELECT ...)	Creates an array from the results of a subquery.
STRUCT()	Creates a structure with named fields from a list of expressions.
GENERATE_ARRAY()	Generates an array of integers within a specified range.

Bit Functions¶

Function Name	Description
BIT_COUNT()	Counts the number of bits set to 1 in an integer.
BIT_XOR()	Computes the bitwise exclusive OR (XOR) of an integer column.
BIT_AND()	Computes the bitwise AND of an integer column.
BIT_OR()	Computes the bitwise OR of an integer column.

BigQuery ML supports the following machine learning models¶

Find below the merged table with extra columns for use cases aligned:

Model Name	Description	Use Cases
Linear Regression (BQML.LINEAR_REG)	A regression model used to predict the value of a variable based on the value of another variable.	Predicting the price of a house in relation to the square meters, the number of sales in relation to the location, or the number of clicks on an ad in relation to the time of day.
Logistic Regression (BQML.LOGISTIC_REG)	A classification model used to predict the probability of an event happening based on a set of known, independent variables.	Predicting whether a customer will churn, whether an email is spam, or whether a tumour is malignant based on specific predefined characteristics.
K-means Clustering (BQML.KMEANS)	An unsupervised learning model used to group n observations in K clusters in which each observation belongs to the cluster with the nearest mean value.	Grouping customers into segments based on their purchase history, website activity, or demographics.
Matrix Factorization (BQML.MATRIX_FACTORIZATION)	A recommendation model that finds relations between items' and users' entities.	Recommending products to customers, movies to viewers, or songs to listeners.
Time Series (ARIMA, ARIMA_PLUS, and BQML.AUTO_ARIMA)	ARIMA stands for autoregressive integrated moving average. Models for time series forecasting or describing past events in a time period.	Predicting future sales, website traffic, or customer churn.
Deep Neural Networks (BQML.DNN)	A deep learning model consisting of multiple layers of interconnected neurons.	Classifying images, translating languages, or generating text.
TensorFlow Model Import (BQML.TF_IMPORT)	Enables you to deploy pre-trained TensorFlow models in BigQuery.	Deploying pre-trained TensorFlow models in BigQuery.
Boosted Tree Models (BQML.BOOSTED_TREE and BQML.BOOSTED_TREE_CLASSIFIER)	Ensemble learning models built from decision trees to improve the accuracy of classification and regression models.	Improving the accuracy of classification and regression models.
XGBoost (BQML.XGB_REG and BQML.XGB_CLASSIFIER)	Extreme Gradient Boosting model. The algorithm is trained on smaller decision trees, and the final outcome is the combination of these results.	Regression and classification tasks, known for their high performance and flexibility.
AutoML Tables (BQML.AUTOML_TABLES)	A fully managed machine learning service that automatically builds and tunes machine learning models for tabular data.	Automatically building and tuning machine learning models for tabular data.
Autoencoder (BQML.AUTOENCODER)	A type of neural network that can be used to reduce the dimensionality of data, denoise images, or detect fraud.	Reducing the dimensionality of data, denoising images, or detecting fraud.

ML model comparison¶

Model Name	Type	Problem Type	Complexity	Computational Cost	Ease of Use
Linear Regression	Supervised Learning	Regression	Simple	Can be expensive for large datasets	Easy to understand and implement
Logistic Regression	Supervised Learning	Categorical Classification	Simple	Can be expensive for large datasets	Easy to understand and implement
K-means Clustering	Unsupervised Learning	Clustering	Simple	Can be computationally expensive for large datasets	Easy to understand and implement
Matrix Factorization	Unsupervised Learning	Recommendation	Complex	Can be computationally expensive for large datasets	More complex to use, but can save time and effort
Time Series	Supervised Learning	Forecasting	Complex	Can be computationally expensive for large datasets	More complex to use, but can save time and effort
Deep Neural Networks	Supervised Learning	Regression, Classification, Image Recognition, Natural Language Processing, etc.	Complex	Can be computationally expensive for large datasets	More complex to use, but can achieve state-of-the-art results
TensorFlow Model Import	Supervised Learning	Regression, Classification, Image Recognition, Natural Language Processing, etc.	Complex	Can be computationally expensive for large datasets	More complex to use, but can save time and effort
Boosted Tree Models	Supervised Learning	Regression, Classification	Complex	Can be computationally expensive for large datasets	More complex to use, but can achieve state-of-the-art results
XGBoost	Supervised Learning	Regression, Classification	Complex	Can be computationally expensive for large datasets	More complex to use, but can achieve state-of-the-art results
AutoML Tables	Supervised Learning	Regression, Classification, Clustering, Recommendation, Forecasting, etc.	Complex	Can be computationally expensive for large datasets	More complex to use, but can save time and effort
Autoencoder	Unsupervised Learning	Dimensionality Reduction, Image Denoising, Fraud Detection, etc.	Complex	Can be computationally expensive for large datasets	More complex to use, but can save time and effort