Learn how to practice Google Analytics 4 (GA4) analysis in MySQL by building user paths and tracking source/medium. Step-by-step guide with SQL code, dynamic queries, and session analysis using GA4 event data.
We have a BigQuery table imported from GA4. The full scheme explanation is available here, but we are interested in event_timestamp, user_preudo_id, and event_params, because we're interested in User Paths only. Our final goal is to make a table of user_pseudo_ids and source/medium steps.
Event Timestamp | User Pseudo ID | Event Params (Key / Int Value / String Value) |
---|---|---|
1695984000000 | user_123 |
|
1695984060000 | user_123 |
|
1695984120000 | user_456 |
|
User Pseudo ID | Step 1 | Step 2 |
---|---|---|
user_123 | google/cpc | meta/cpc |
user_456 | google/organic | - |
The first thing you face when working with BigQuery - it's their costs. It's not applicable if you have small datasets, but what's the point of data analysis practice if insights are available in a simple CSV file?
Anyway, my practice with BigQuery was always nervous. For example, I paid some dollars working with the Google Transparency Center dataset, because even
SELECT COUNT(*) FROM 'bigquery-public-data.google_ads_transparency_center.creative_stats'
costed 150 GB per call. With 1 TB of free usage, you can imagine it will take 7 calls to start paying, and if you have experience with SQL you need many calls to make sure everything works.
So finally, I don't want to pay extra for practice.
Despite this article primarily using MySQL syntax, I have also translated it into Python and BigQuery. The Python representation is done with Pandas, and if you are familiar with its terminology, it provides a better representation than SQL, which is table-oriented and requires considerable practice to understand the logic behind the queries.
First, you need to copy this dataset to your BigQuery project so you can upload it to your PC. In the next article, I will generate this dataset synthetically, because Big
my_project
→ your project ID
my_dataset
→ your dataset ID
my_table
→ the new table you are creating
The events_202*
pattern selects all GA4 events tables for 2021 (partitioned by date).
BigQuery tables are referenced as: project_id.dataset_id.table_id
Project ID → Your Google Cloud project name. You can see it in the BigQuery console at the top left.
Dataset ID → The dataset (folder) inside your project. It groups related tables.
Table ID → The specific table you want to query or create.
Next, you need to use Python to upload data from BigQuery to MySQL:
import pandas_gbq
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from google.oauth2 import service_account
# --- MySQL connection via SQLAlchemy ---
#TODO fill your local MySQL user and password
engine = create_engine(
"mysql+pymysql://user:password@localhost/mydb?charset=utf8mb4"
)
# --- BigQuery base SQL ---
base_sql = """
SELECT
event_date,
event_timestamp,
event_name,
TO_JSON_STRING(event_params) AS event_params,
event_previous_timestamp,
IFNULL(event_value_in_usd, 0) AS event_value_in_usd,
event_bundle_sequence_id,
event_server_timestamp_offset,
user_id,
user_pseudo_id,
TO_JSON_STRING(privacy_info) AS privacy_info,
TO_JSON_STRING(user_properties) AS user_properties,
user_first_touch_timestamp,
TO_JSON_STRING(user_ltv) AS user_ltv,
TO_JSON_STRING(device) AS device,
TO_JSON_STRING(geo) AS geo,
TO_JSON_STRING(app_info) AS app_info,
TO_JSON_STRING(traffic_source) AS traffic_source,
stream_id,
platform,
TO_JSON_STRING(event_dimensions) AS event_dimensions,
TO_JSON_STRING(ecommerce) AS ecommerce,
TO_JSON_STRING(items) AS items
FROM `my_project.my_dataset.my_table`
"""
credentials = service_account.Credentials.from_service_account_file('secret.json')
pandas_gbq.context.credentials = credentials
pandas_gbq.context.project = "my_project"
total_rows = 4295584
chunk_size = 100000
for offset in range(100000, total_rows, chunk_size):
sql = f"{base_sql} LIMIT {chunk_size} OFFSET {offset}"
print(f"Fetching rows {offset} to {offset + chunk_size}...")
df = pandas_gbq.read_gbq(sql)
if df.empty:
print("No more rows to fetch.")
break
# Fill numeric NaN with 0
for col in df.select_dtypes(include=[np.number]).columns:
df[col] = df[col].fillna(0)
# Replace text NaN with None
for col in df.select_dtypes(include=['object']).columns:
df[col] = df[col].where(pd.notnull(df[col]), None)
# --- Push chunk to MySQL ---
df.to_sql(
'ga4_ecom',
con=engine,
if_exists='append',
index=False
)
print(f"Inserted chunk ending at row {offset + len(df)}")
print("All data inserted successfully.")
Note, I processed 43 queries (chunks) to upload 4295584 raws from the sample dataset.
You need to have a Google Cloud project with BigQuery API enabled and create a secret from a service account. You can read about credentials more here, and it looks complex, but shortly, you need to go to AIM & Admin -> Service Accounts, create a service account, create a secret key, and grant access with the IAM page.
In this section, I explain the step-by-step process of working with GA4 data with the final Procedure to create all tables in one Call.
The first table we need to create is an unnested table of JSON values in event_params. We're interested in ga_session_id, source, medium.
Event Timestamp | User Pseudo ID | Event Params (Key / Int Value / String Value) |
---|---|---|
1695984000000 | user_123 |
|
1695984060000 | user_123 |
|
1695984120000 | user_456 |
|
Event Timestamp | User Pseudo ID | Key | Int Value | String Value |
---|---|---|---|---|
1695984000000 | user_123 | ga_session_id | 000000001 | null |
1695984000000 | user_123 | source | null | |
1695984000000 | user_123 | medium | null | cpc |
1695984060000 | user_123 | ga_session_id | 000000002 | null |
1695984060000 | user_123 | source | null | meta |
1695984060000 | user_123 | medium | null | cpc |
1695984120000 | user_456 | ga_session_id | 000000003 | null |
1695984120000 | user_456 | source | null | |
1695984120000 | user_456 | medium | null | organic |
jjson_table_sample
— three variantsThese queries create the first table. We uploaded data to ga4_ecom table, but we'll be using the test_sample dataset first created with this query:
CREATE TABLE test_sample AS SELECT * FROM ga4_ecom LIMIT 10000
Because the final query will take a lot of time.
Select a tab to see Python (pandas), MySQL (JSON_TABLE) or BigQuery (UNNEST + JSON_VALUE).
JSON_TABLE
CREATE TABLE jjson_table_sample AS
SELECT
t.event_timestamp,
t.user_pseudo_id,
tt.kkey,
tt.int_value,
tt.string_value
FROM test_sample AS t
JOIN JSON_TABLE(
t.event_params,
"$[*]" COLUMNS (
kkey VARCHAR(100) PATH "$.key",
int_value BIGINT PATH "$.value.int_value",
string_value VARCHAR(255) PATH "$.value.string_value"
)
) AS tt
WHERE tt.kkey IN ('ga_session_id','source','medium');
Note GA4 table contains many events like page_view, scroll, begin_checkout, purchase, and we don't need all of them. So we apply session_id to every event_timestamp, user_pseudo_id taking their MAX value of int_value by Partitions
Event Timestamp | User Pseudo ID | Key | Int Value | String Value | Session ID |
---|---|---|---|---|---|
1695984000000(page_view) | user_123 | ga_session_id | 000000001 | null | 000000001 |
1695984000000(page_view) | user_123 | source | null | 000000001 | |
1695984000000(page_view) | user_123 | medium | null | cpc | 000000001 |
1695984100000(purchase) | user_123 | ga_session_id | 000000001 | null | 000000001 |
1695984100000(purchase) | user_123 | source | null | 000000001 | |
1695984100000(purchase) | user_123 | medium | null | cpc | 000000001 |
1695984060000 | user_123 | ga_session_id | 000000002 | null | 000000002 |
1695984060000 | user_123 | source | null | meta | 000000002 |
1695984060000 | user_123 | medium | null | cpc | 000000002 |
1695984120000 | user_456 | ga_session_id | 000000003 | null | 000000003 |
1695984120000 | user_456 | source | null | 000000003 | |
1695984120000 | user_456 | medium | null | organic | 000000003 |
Select a tab to see MySQL or Python.
SELECT
event_timestamp,
user_pseudo_id,
kkey,
int_value,
string_value,
MAX(int_value) OVER(PARTITION BY event_timestamp, user_pseudo_id) AS session_id
FROM jjson_table;
Now let's form a table with session_start, user_pseudo_id, session_id by taking the minimal event_timestamp grouped by user_pseudo_id and session_id
Session start | User Pseudo ID | Session ID |
---|---|---|
1695984000000 | user_123 | 000000001 |
1695984060000 | user_123 | 000000002 |
1695984120000 | user_456 | 000000003 |
Select a tab to see MySQL or Python.
CREATE TABLE time_user_session AS
WITH sess AS
(SELECT
event_timestamp,
user_pseudo_id,
kkey,
int_value,
string_value,
MAX(int_value) OVER(PARTITION BY event_timestamp, user_pseudo_id) AS session_id
FROM jjson_table)
SELECT
MIN(event_timestamp) AS session_start,
user_pseudo_id,
session_id
FROM sess
GROUP BY
user_pseudo_id,
session_id;
Next step, we need to concatenate our source/medium into one column. To do it, we use a Left join to the just-created time_user_session table (so we filter non-session start rows to prevent duplicates). Also, we need to fill empty sources or mediums with 'not defined'.
Select a tab to see MySQL or Python.
CREATE TABLE start_user_session_source AS
SELECT
t.session_start,
t.user_pseudo_id,
t.session_id,
CONCAT_WS(
'/',
COALESCE(MAX(CASE WHEN rt.kkey = 'source' THEN rt.string_value END), 'not defined'),
COALESCE(MAX(CASE WHEN rt.kkey = 'medium' THEN rt.string_value END), 'not defined')
) AS source_medium
FROM time_user_session t
LEFT JOIN jjson_table rt
ON t.session_start = rt.event_timestamp
AND t.user_pseudo_id = rt.user_pseudo_id
GROUP BY t.session_start, t.user_pseudo_id, t.session_id;
The next step is as brilliant as simple. If you know Window functions, there is the ROW_NUMBER function, which applies row numbers to every partition. So we need to sort our table by session_start and form columns for 1st row number, for 2nd row number, for 3rd and so on. It looks like this
SELECT
user_id,
MAX(CASE WHEN rn = 1 THEN source END) AS step1,
MAX(CASE WHEN rn = 2 THEN source END) AS step2,
MAX(CASE WHEN rn = 3 THEN source END) AS step3
-- Continue if you need more steps
FROM (
SELECT
user_id,
source,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY start_session) AS rn
FROM sessions
) AS ordered
GROUP BY user_id;
But we want a dynamic procedure that can be called in one string.
Select a tab to see Python, MySQL or BigQuery.
DELIMITER $$
CREATE PROCEDURE build_user_path()
BEGIN
-- Declare constants for loops
DECLARE max_steps INT DEFAULT 0;
DECLARE i INT DEFAULT 1;
DROP TABLE IF EXISTS jjson_table,time_user_session,start_user_session_source;
CREATE TABLE jjson_table AS
SELECT event_timestamp, user_pseudo_id, kkey, int_value, string_value FROM
ga4_ecom,
JSON_TABLE(
event_params,
"$[*]" COLUMNS
(kkey VARCHAR(100) PATH "$.key",
int_value BIGINT PATH "$.value.int_value",
string_value VARCHAR(100) PATH "$.value.string_value")
) AS tt
WHERE tt.kkey IN ('ga_session_id','source','medium');
CREATE TABLE time_user_session AS
WITH sess AS
(SELECT
event_timestamp,
user_pseudo_id,
kkey,
int_value,
string_value,
MAX(int_value) OVER(PARTITION BY event_timestamp, user_pseudo_id) AS session_id
FROM jjson_table)
SELECT
MIN(event_timestamp) AS session_start,
user_pseudo_id,
session_id
FROM sess
GROUP BY
user_pseudo_id,
session_id;
CREATE TABLE start_user_session_source AS
SELECT
t.session_start,
t.user_pseudo_id,
t.session_id,
CONCAT_WS(
'/',
COALESCE(MAX(CASE WHEN rt.kkey = 'source' THEN rt.string_value END), 'not defined'),
COALESCE(MAX(CASE WHEN rt.kkey = 'medium' THEN rt.string_value END), 'not defined')
) AS source_medium
FROM time_user_session t
LEFT JOIN jjson_table rt
ON t.session_start = rt.event_timestamp
AND t.user_pseudo_id = rt.user_pseudo_id
GROUP BY t.session_start, t.user_pseudo_id, t.session_id;
-- Maximum number of steps per user in a local procedure variable
SELECT MAX(cnt) INTO max_steps
-- Creates a table user_pseudo_id and step counts
FROM (
SELECT user_pseudo_id, COUNT(*) AS cnt
FROM start_user_session_source
GROUP BY user_pseudo_id
) t;
-- 2) Build CASE expressions into a user variable (PREPARE requires a user var)
SET @case_expressions = '';
SET i = 1;
WHILE i <= max_steps DO
IF i > 1 THEN
SET @case_expressions = CONCAT(@case_expressions, ', ');
END IF;
SET @case_expressions = CONCAT(
@case_expressions,
'MAX(CASE WHEN rn = ', i, ' THEN source_medium END) AS step', i
);
SET i = i + 1;
END WHILE;
-- 3) Build the final SQL into a user variable @sql_text
SET @sql_text = CONCAT(
'SELECT user_pseudo_id, ', @case_expressions,
' FROM (',
'SELECT user_pseudo_id, source_medium, ',
'ROW_NUMBER() OVER (PARTITION BY user_pseudo_id ORDER BY session_start) AS rn ',
'FROM start_user_session_source',
') AS ordered ',
'GROUP BY user_pseudo_id'
);
-- drop existing target (if any)
DROP TABLE IF EXISTS user_paths;
-- create table from dynamic SQL
SET @create_sql = CONCAT('CREATE TABLE user_paths AS ', @sql_text);
PREPARE stmt FROM @create_sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
END $$
DELIMITER ;
And the final result is pretty impressive
user_pseudo_id | step1 | step2 | step3 | step4 | step5 | step6 | step7 | step8 | step9 | step10 | step11 | step12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
10028188.3857887509 | <Other>/referral | (data deleted)/(data deleted) | (data deleted)/(data deleted) | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
10101905.8688293836 | google/organic | shop.googlemerchandisestore.com/referral | shop.googlemerchandisestore.com/referral | not defined/not defined | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
10115718.4867090359 | google/organic | shop.googlemerchandisestore.com/referral | <Other>/<Other> | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
1011939.1001718987 | analytics.google.com/referral | google/organic | (data deleted)/(data deleted) | not defined/not defined | creatoracademy.youtube.com/referral | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
1018731.0091054136 | <Other>/<Other> | (data deleted)/(data deleted) | <Other>/referral | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
1023953.9842241768 | <Other>/<Other> | <Other>/<Other> | <Other>/referral | (direct)/(none) | google/organic | not defined/not defined | not defined/not defined | NULL | NULL | NULL | NULL | NULL |
1026932.0858862293 | <Other>/referral | google/organic | <Other>/referral | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
1033552.6644233006 | analytics.google.com/referral | (direct)/(none) | analytics.google.com/referral | (data deleted)/(data deleted) | google/organic | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
1033716.6724392262 | (direct)/(none) | shop.googlemerchandisestore.com/referral | google/organic | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
10404067.7211955661 | google/organic | (data deleted)/(data deleted) | google/organic | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
And you will answer a logic question. "Cool! What am I going to do with 188 unique source/medium and 6398 unique User paths (I'm not telling about user distribution between them)".
Here's the time for Logistic Regression I'm going to tell about next time