Essential SQL Functions for GA4 BigQuery Export Analysis
The fundamental functions you’ll need to master if you want to work with the GA4 events data export in BigQuery
Motivation
The GA4 BigQuery Export is an extremely valuable raw resource, giving you ownership of your raw Google Analytics event data and access to a wide, continually expanding range of BigQuery-enabled data capabilities. It’s also free to enable and should only incur low storage costs over time.
However it is virtually impossible to query.
Which is an issue if you want to do anything to benefit from this data source.
In order to understand the underlying structure of the data and remodel it into alternative structure, which is optimised for querying, analysis and insights, there are a number of functions and concepts which need to be understood.
This article is intended to catalogue all of the functions you will need to understand in order to work with the GA4 BigQuery event export. The next article in the series will detail all of the SQL concepts you will need to understand in order to model the GA4 BigQuery event export.
Situation
When you have enabled the GA4 export to BigQuery, at a minimum you will see a table called events_YYYYMMDD
, where YYYYMMDD
represents the date of the event export. This is a date-sharded table, which is actually a set of tables with a common prefix and a date-based suffix. It can be queried efficiently (much like a date-partitioned table), but requires a different syntax.
The unique structure of this table presents a number of challenges to anybody wanting to query or analyse it in an efficient, readable and reusable manner.
In order to remodel this event data into a more useable event-level structure without losing information, there are a number of SQL functions you will have to use. This article outlines them all.
Functions
FORMAT_DATE
The FORMAT_DATE
function is used to convert a DATE
value into a STRING
with a specific date syntax. In this case the '%Y%m%d'
date syntax formats the output into a STRING
in the format YYYYMMDD
.
SELECT FORMAT_DATE('%Y%m%d', '2024-01-07')
> 20240107
PARSE_DATE
The PARSE_DATE
function is the inverse of the FORMAT_DATE
function, and is used to convert a STRING
value into a DATE
value, given an expected date syntax in the input STRING
value.
sSELECT PARSE_DATE('%Y%m%d', '20240107')
> 2024-01-07
_TABLE_SUFFIX
The _TABLE_SUFFIX
is technically a pseudo-column (a metadata column which is hidden but can be queried), which can be used in conjunction with a wildcard query and the FORMAT_DATE
function to efficiently query the events_YYYYMMDD
table. Note that this query assumes that the start_date
and end_date
DATE
variables have been set as script variables or table function arguments.
SELECT
_TABLE_SUFFIX AS table_suffix, *
FROM `[project_id].analytics_#########].events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE("%Y%m%d", start_date) AND
FORMAT_DATE("%Y%m%d", end_date)
This query syntax returns all source table columns with the _TABLE_SUFFIX
as a STRING
column called table_suffix
before the rest of the columns.
TO_JSON_STRING
The TO_JSON_STRING
function returns a JSON-formatted STRING
representation of the data passed to it as an argument. This argument can be a variable, column value or the name of a preceding common table expression, as in the example below.
WITH
add_ids AS (
SELECT
'project_id' AS project_id,
'analytics_123456789' AS dataset_name)
SELECT
TO_JSON_STRING(add_ids) AS add_ids_json
FROM add_ids
> {"project_id":"project_id","dataset_name":"analytics_123456789"}
NULL
values are also preserved in the output, which makes it robust to unpredictable input data. This can be observed in the output from the example below.
WITH
add_ids AS (
SELECT
'project_id' AS project_id,
'analytics_123456789' AS dataset_name,
NULL AS example_value)
SELECT
TO_JSON_STRING(add_ids) AS add_ids_json
FROM add_ids
> {"project_id":"project_id","dataset_name":"analytics_123456789","example_value":null}
ARRAY
and STRUCT
values also directly map to JSON
, so any BigQuery data structure can be represented in JSON
.
Note that the function TO_JSON
can be used interchangeably with this function, in which case the function will return a JSON
value instead of a JSON-formatted STRING
value.
SHA256
The SHA256
function is a hashing function, which returns a 44 character BYTES
value based on an input STRING
or BYTES
of any length. For a given input it will always return the same output, which makes it a useful function to check for uniqueness or identicality of inputs. Adding this to the example above gives a unique id for each row of data in the add_ids
common table expression.
WITH
add_ids AS (
SELECT
'project_id' AS project_id,
'analytics_123456789' AS dataset_name,
NULL AS example_value)
SELECT
SHA256(TO_JSON_STRING(add_ids)) AS row_id
FROM add_ids
> XBbBYHTigXnQLJwZQ9dJ0y8TJexcWfhBiJWfp2pdVfE=
TO_HEX
The TO_HEX function is a function which converts a sequence of BYTES
(as returned by the SHA256
function above) into a hexadecimal STRING
. This means that the resulting STRING
will only contain the characters 0..9
and a..f
, making it cleaner for certain additional operations and copy/paste actions in the user interface.
WITH
add_ids AS (
SELECT
'project_id' AS project_id,
'analytics_123456789' AS dataset_name,
NULL AS example_value)
SELECT
TO_HEX(SHA256(TO_JSON_STRING(add_ids))) AS row_id
FROM add_ids
> 5c16c16074e28179d02c9c1943d749d32f1325ec5c59f84188959fa76a5d55f1
COALESCE
The COALESCE function returns the first non-null element from a sequence of values, which could be columns or variables. Note that all elements in the sequence need to be coercible to a common supertype.
SELECT COALESCE (NULL,"value_x","value_y", NULL) AS value
> value_x
SAFE_CAST
The SAFE_CAST function is the SAFE
version of the CAST function, meaning that if the input value cannot be CAST
to the target data type, it will return NULL
value instead of an error. Note that for many other functions this behaviour can be enforced by using the SAFE. prefix.
SELECT SAFE_CAST("value_x" AS INT64) AS value
> null
Conversely, if using the CAST function, the function will error.
SELECT CAST("value_x" AS INT64) AS value
> Query error: Bad int64 value: value_x at [1:1]
TIMESTAMP_MICROS
The TIMESTAMP_MICROS function converts a 16-digit unix timestamp (microseconds) to a TIMESTAMP
.
SELECT TIMESTAMP_MICROS(1711353064567254)
> 2024-03-25 07:51:04.567254 UTC
DATE
The DATE function extracts the DATE
from a TIMESTAMP
. Used in conjunction with the TIMESTAMP_MICROS
function, it enables conversion from a unix timestamp to a DATE
.
SELECT DATE(TIMESTAMP_MICROS(1711353064567254));
> 2024-03-25
TIME
The TIME function extracts the TIME
(independent of the date) from a TIMESTAMP
. Used in conjunction with the TIMESTAMP_MICROS
function, it enables conversion from a unix timestamp to a TIME
.
SELECT TIME(TIMESTAMP_MICROS(1711353064567254));
> 07:51:04.567254
REGEXP_CONTAINS
The REGEXP_CONTAINS function returns a BOOL
, evaluating to TRUE
if the input STRING
matches the input regular expression:
SELECT REGEXP_CONTAINS('somebody@email.com', r'@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+') AS email_is_valid
> true
If the STRING
does not match the regular expression, it will evaluate to FALSE
.
SELECT REGEXP_CONTAINS('not_an_email.com', r'@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+') AS email_is_valid
> false
One important characteristic to note is that if either function argument is NULL
, the expression will evaluate to NULL
.
SELECT REGEXP_CONTAINS(NULL , r'@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+') AS email_is_valid
> null
SELECT REGEXP_CONTAINS('somebody@email.com', NULL) AS email_is_valid
> null
Conclusion
Familiarity with these functions is a pre-requisite for anybody working with the GA4 BigQuery event export. However, in order to leverage them there is also a number of fundamental concepts and syntaxes within which they need to be deployed.
The following article in this series will introduce the reader to these additional foundations.
Please follow me to ensure that you will be notified upon release.