Essential SQL Functions for GA4 BigQuery Export Analysis

The fundamental functions you’ll need to master if you want to work with the GA4 events data export in BigQuery

Jim Barlow
Decode Data

--

You just have to find the right pieces. Photo by Xavi Cabrera on Unsplash

Motivation

The GA4 BigQuery Export is an extremely valuable raw resource, giving you ownership of your raw Google Analytics event data and access to a wide, continually expanding range of BigQuery-enabled data capabilities. It’s also free to enable and should only incur low storage costs over time.

However it is virtually impossible to query.

Which is an issue if you want to do anything to benefit from this data source.

In order to understand the underlying structure of the data and remodel it into alternative structure, which is optimised for querying, analysis and insights, there are a number of functions and concepts which need to be understood.

This article is intended to catalogue all of the functions you will need to understand in order to work with the GA4 BigQuery event export. The next article in the series will detail all of the SQL concepts you will need to understand in order to model the GA4 BigQuery event export.

Situation

When you have enabled the GA4 export to BigQuery, at a minimum you will see a table called events_YYYYMMDD, where YYYYMMDD represents the date of the event export. This is a date-sharded table, which is actually a set of tables with a common prefix and a date-based suffix. It can be queried efficiently (much like a date-partitioned table), but requires a different syntax.

The unique structure of this table presents a number of challenges to anybody wanting to query or analyse it in an efficient, readable and reusable manner.

In order to remodel this event data into a more useable event-level structure without losing information, there are a number of SQL functions you will have to use. This article outlines them all.

Functions

FORMAT_DATE

The FORMAT_DATE function is used to convert a DATE value into a STRING with a specific date syntax. In this case the '%Y%m%d' date syntax formats the output into a STRING in the format YYYYMMDD.

SELECT FORMAT_DATE('%Y%m%d', '2024-01-07')
> 20240107

PARSE_DATE

The PARSE_DATE function is the inverse of the FORMAT_DATE function, and is used to convert a STRING value into a DATE value, given an expected date syntax in the input STRING value.

sSELECT PARSE_DATE('%Y%m%d', '20240107')
> 2024-01-07

_TABLE_SUFFIX

The _TABLE_SUFFIX is technically a pseudo-column (a metadata column which is hidden but can be queried), which can be used in conjunction with a wildcard query and the FORMAT_DATE function to efficiently query the events_YYYYMMDD table. Note that this query assumes that the start_date and end_date DATE variables have been set as script variables or table function arguments.

SELECT 
_TABLE_SUFFIX AS table_suffix, *
FROM `[project_id].analytics_#########].events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE("%Y%m%d", start_date) AND
FORMAT_DATE("%Y%m%d", end_date)

This query syntax returns all source table columns with the _TABLE_SUFFIX as a STRING column called table_suffix before the rest of the columns.

TO_JSON_STRING

The TO_JSON_STRING function returns a JSON-formatted STRING representation of the data passed to it as an argument. This argument can be a variable, column value or the name of a preceding common table expression, as in the example below.

WITH
add_ids AS (
SELECT
'project_id' AS project_id,
'analytics_123456789' AS dataset_name)

SELECT
TO_JSON_STRING(add_ids) AS add_ids_json
FROM add_ids
> {"project_id":"project_id","dataset_name":"analytics_123456789"}

NULL values are also preserved in the output, which makes it robust to unpredictable input data. This can be observed in the output from the example below.

WITH
add_ids AS (
SELECT
'project_id' AS project_id,
'analytics_123456789' AS dataset_name,
NULL AS example_value)

SELECT
TO_JSON_STRING(add_ids) AS add_ids_json
FROM add_ids
> {"project_id":"project_id","dataset_name":"analytics_123456789","example_value":null}

ARRAY and STRUCT values also directly map to JSON, so any BigQuery data structure can be represented in JSON.

Note that the function TO_JSON can be used interchangeably with this function, in which case the function will return a JSON value instead of a JSON-formatted STRING value.

SHA256

The SHA256 function is a hashing function, which returns a 44 character BYTES value based on an input STRING or BYTES of any length. For a given input it will always return the same output, which makes it a useful function to check for uniqueness or identicality of inputs. Adding this to the example above gives a unique id for each row of data in the add_ids common table expression.

WITH
add_ids AS (
SELECT
'project_id' AS project_id,
'analytics_123456789' AS dataset_name,
NULL AS example_value)

SELECT
SHA256(TO_JSON_STRING(add_ids)) AS row_id
FROM add_ids
> XBbBYHTigXnQLJwZQ9dJ0y8TJexcWfhBiJWfp2pdVfE=

TO_HEX

The TO_HEX function is a function which converts a sequence of BYTES (as returned by the SHA256 function above) into a hexadecimal STRING. This means that the resulting STRING will only contain the characters 0..9 and a..f, making it cleaner for certain additional operations and copy/paste actions in the user interface.

WITH
add_ids AS (
SELECT
'project_id' AS project_id,
'analytics_123456789' AS dataset_name,
NULL AS example_value)

SELECT
TO_HEX(SHA256(TO_JSON_STRING(add_ids))) AS row_id
FROM add_ids
> 5c16c16074e28179d02c9c1943d749d32f1325ec5c59f84188959fa76a5d55f1

COALESCE

The COALESCE function returns the first non-null element from a sequence of values, which could be columns or variables. Note that all elements in the sequence need to be coercible to a common supertype.

SELECT COALESCE (NULL,"value_x","value_y", NULL) AS value
> value_x

SAFE_CAST

The SAFE_CAST function is the SAFE version of the CAST function, meaning that if the input value cannot be CAST to the target data type, it will return NULL value instead of an error. Note that for many other functions this behaviour can be enforced by using the SAFE. prefix.

SELECT SAFE_CAST("value_x" AS INT64) AS value
> null

Conversely, if using the CAST function, the function will error.

SELECT CAST("value_x" AS INT64) AS value
> Query error: Bad int64 value: value_x at [1:1]

TIMESTAMP_MICROS

The TIMESTAMP_MICROS function converts a 16-digit unix timestamp (microseconds) to a TIMESTAMP.

SELECT TIMESTAMP_MICROS(1711353064567254)
> 2024-03-25 07:51:04.567254 UTC

DATE

The DATE function extracts the DATE from a TIMESTAMP. Used in conjunction with the TIMESTAMP_MICROS function, it enables conversion from a unix timestamp to a DATE.

SELECT DATE(TIMESTAMP_MICROS(1711353064567254));
> 2024-03-25

TIME

The TIME function extracts the TIME (independent of the date) from a TIMESTAMP. Used in conjunction with the TIMESTAMP_MICROS function, it enables conversion from a unix timestamp to a TIME.

SELECT TIME(TIMESTAMP_MICROS(1711353064567254));
> 07:51:04.567254

REGEXP_CONTAINS

The REGEXP_CONTAINS function returns a BOOL, evaluating to TRUE if the input STRING matches the input regular expression:

SELECT REGEXP_CONTAINS('somebody@email.com', r'@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+') AS email_is_valid
> true

If the STRING does not match the regular expression, it will evaluate to FALSE.

SELECT REGEXP_CONTAINS('not_an_email.com', r'@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+') AS email_is_valid
> false

One important characteristic to note is that if either function argument is NULL, the expression will evaluate to NULL.

SELECT REGEXP_CONTAINS(NULL , r'@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+') AS email_is_valid
> null
SELECT REGEXP_CONTAINS('somebody@email.com', NULL) AS email_is_valid
> null

Conclusion

Familiarity with these functions is a pre-requisite for anybody working with the GA4 BigQuery event export. However, in order to leverage them there is also a number of fundamental concepts and syntaxes within which they need to be deployed.

The following article in this series will introduce the reader to these additional foundations.

Please follow me to ensure that you will be notified upon release.

--

--