How to Practice GA4 Analysis in MySQL/BigQuery: Build User Paths by Source/Medium

Learn how to practice Google Analytics 4 (GA4) analysis in MySQL by building user paths and tracking source/medium. Step-by-step guide with SQL code, dynamic queries, and session analysis using GA4 event data.

Idea

We have a BigQuery table imported from GA4. The full scheme explanation is available here, but we are interested in event_timestamp, user_preudo_id, and event_params, because we're interested in User Paths only. Our final goal is to make a table of user_pseudo_ids and source/medium steps.

GA4 Raw Event Data

Event Timestamp	User Pseudo ID	Event Params (Key / Int Value / String Value)
1695984000000	user_123	key: ga_session_id, int_value: 000000001, string_value: null key: source, int_value: null, string_value: google key: medium, int_value: null, string_value: cpc
1695984060000	user_123	key: ga_session_id, int_value: 000000002, string_value: null key: source, int_value: null, string_value: meta key: medium, int_value: null, string_value: cpc
1695984120000	user_456	key: ga_session_id, int_value: 000000003, string_value: null key: source, int_value: null, string_value: google key: medium, int_value: null, string_value: organic

Transformed User Steps

User Pseudo ID	Step 1	Step 2
user_123	google/cpc	meta/cpc
user_456	google/organic	-

Why MySQL?

The first thing you face when working with BigQuery - it's their costs. It's not applicable if you have small datasets, but what's the point of data analysis practice if insights are available in a simple CSV file?

Anyway, my practice with BigQuery was always nervous. For example, I paid some dollars working with the Google Transparency Center dataset, because even

SELECT COUNT(*) FROM 'bigquery-public-data.google_ads_transparency_center.creative_stats'

costed 150 GB per call. With 1 TB of free usage, you can imagine it will take 7 calls to start paying, and if you have experience with SQL you need many calls to make sure everything works.

So finally, I don't want to pay extra for practice.

Despite this article primarily using MySQL syntax, I have also translated it into Python and BigQuery. The Python representation is done with Pandas, and if you are familiar with its terminology, it provides a better representation than SQL, which is table-oriented and requires considerable practice to understand the logic behind the queries.

Download the data set

First, you need to copy this dataset to your BigQuery project so you can upload it to your PC. In the next article, I will generate this dataset synthetically, because BigQuery sample dataset for Google Analytics ecommerce web implementation has some natural data, but it is not applicable for User Paths analysis (they have only one cpc source/medium: google/cpc)

CREATE TABLE `my_project.my_dataset.my_table` AS SELECT * FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_202*`

my_project → your project ID
my_dataset → your dataset ID
my_table → the new table you are creating

The events_202* pattern selects all GA4 events tables for 2021 (partitioned by date).

BigQuery tables are referenced as: project_id.dataset_id.table_id

Project ID → Your Google Cloud project name. You can see it in the BigQuery console at the top left.
Dataset ID → The dataset (folder) inside your project. It groups related tables.
Table ID → The specific table you want to query or create.

Next, you need to use Python to upload data from BigQuery to MySQL:

import pandas_gbq
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from google.oauth2 import service_account
# --- MySQL connection via SQLAlchemy ---
#TODO fill your local MySQL user and password
engine = create_engine(
    "mysql+pymysql://user:password@localhost/mydb?charset=utf8mb4"
)

# --- BigQuery base SQL ---
base_sql = """ 
SELECT
  event_date,
  event_timestamp,
  event_name,
  TO_JSON_STRING(event_params) AS event_params,
  event_previous_timestamp,
  IFNULL(event_value_in_usd, 0) AS event_value_in_usd,
  event_bundle_sequence_id,
  event_server_timestamp_offset,
  user_id,
  user_pseudo_id,
  TO_JSON_STRING(privacy_info) AS privacy_info,
  TO_JSON_STRING(user_properties) AS user_properties,
  user_first_touch_timestamp,
  TO_JSON_STRING(user_ltv) AS user_ltv,
  TO_JSON_STRING(device) AS device,
  TO_JSON_STRING(geo) AS geo,
  TO_JSON_STRING(app_info) AS app_info,
  TO_JSON_STRING(traffic_source) AS traffic_source,
  stream_id,
  platform,
  TO_JSON_STRING(event_dimensions) AS event_dimensions,
  TO_JSON_STRING(ecommerce) AS ecommerce,
  TO_JSON_STRING(items) AS items
FROM `my_project.my_dataset.my_table`
"""
credentials = service_account.Credentials.from_service_account_file('secret.json')
pandas_gbq.context.credentials = credentials
pandas_gbq.context.project = "my_project"
total_rows = 4295584
chunk_size = 100000

for offset in range(100000, total_rows, chunk_size):
    sql = f"{base_sql} LIMIT {chunk_size} OFFSET {offset}"
    print(f"Fetching rows {offset} to {offset + chunk_size}...")

    df = pandas_gbq.read_gbq(sql)

    if df.empty:
        print("No more rows to fetch.")
        break

    # Fill numeric NaN with 0
    for col in df.select_dtypes(include=[np.number]).columns:
        df[col] = df[col].fillna(0)

    # Replace text NaN with None
    for col in df.select_dtypes(include=['object']).columns:
        df[col] = df[col].where(pd.notnull(df[col]), None)

    # --- Push chunk to MySQL ---
    df.to_sql(
        'ga4_ecom',
        con=engine,
        if_exists='append',
        index=False
    )

    print(f"Inserted chunk ending at row {offset + len(df)}")

print("All data inserted successfully.")

Note, I processed 43 queries (chunks) to upload 4295584 raws from the sample dataset.

You need to have a Google Cloud project with BigQuery API enabled and create a secret from a service account. You can read about credentials more here, and it looks complex, but shortly, you need to go to AIM & Admin -> Service Accounts, create a service account, create a secret key, and grant access with the IAM page.

MySQL queries

In this section, I explain the step-by-step process of working with GA4 data with the final Procedure to create all tables in one Call.

The first table we need to create is an unnested table of JSON values in event_params. We're interested in ga_session_id, source, medium.

GA4 Raw Event Data

Event Timestamp	User Pseudo ID	Event Params (Key / Int Value / String Value)
1695984000000	user_123	key: ga_session_id, int_value: 000000001, string_value: null key: source, int_value: null, string_value: google key: medium, int_value: null, string_value: cpc
1695984060000	user_123	key: ga_session_id, int_value: 000000002, string_value: null key: source, int_value: null, string_value: meta key: medium, int_value: null, string_value: cpc
1695984120000	user_456	key: ga_session_id, int_value: 000000003, string_value: null key: source, int_value: null, string_value: google key: medium, int_value: null, string_value: organic

jjson_table with the unnested column params

Event Timestamp	User Pseudo ID	Key	Int Value	String Value
1695984000000	user_123	ga_session_id	000000001	null
1695984000000	user_123	source	null	google
1695984000000	user_123	medium	null	cpc
1695984060000	user_123	ga_session_id	000000002	null
1695984060000	user_123	source	null	meta
1695984060000	user_123	medium	null	cpc
1695984120000	user_456	ga_session_id	000000003	null
1695984120000	user_456	source	null	google
1695984120000	user_456	medium	null	organic

Create `jjson_table_sample` — three variants

These queries create the first table. We uploaded data to ga4_ecom table, but we'll be using the test_sample dataset first created with this query:

CREATE TABLE test_sample AS SELECT * FROM ga4_ecom LIMIT 10000

Because the final query will take a lot of time.

Select a tab to see Python (pandas), MySQL (JSON_TABLE) or BigQuery (UNNEST + JSON_VALUE).

MySQL (JSON_TABLE)Python (pandas) BigQuery (UNNEST)

MySQL — using `JSON_TABLE`

CREATE TABLE jjson_table_sample AS
SELECT
  t.event_timestamp,
  t.user_pseudo_id,
  tt.kkey,
  tt.int_value,
  tt.string_value
FROM test_sample AS t
JOIN JSON_TABLE(
  t.event_params,
  "$[*]" COLUMNS (
    kkey VARCHAR(100) PATH "$.key",
    int_value BIGINT PATH "$.value.int_value",
    string_value VARCHAR(255) PATH "$.value.string_value"
  )
) AS tt
WHERE tt.kkey IN ('ga_session_id','source','medium');

Python — using `pandas` (example)

import pandas as pd
import json

# Raw data as a list of dictionaries
data = [
    {
        "event_timestamp": 1695984000000,
        "user_pseudo_id": "user_123",
        "event_params": [
            {"key": "ga_session_id", "int_value": 1, "string_value": None},
            {"key": "source", "int_value": None, "string_value": "google"},
            {"key": "medium", "int_value": None, "string_value": "cpc"}
        ]
    },
    {
        "event_timestamp": 1695984060000,
        "user_pseudo_id": "user_123",
        "event_params": [
            {"key": "ga_session_id", "int_value": 2, "string_value": None},
            {"key": "source", "int_value": None, "string_value": "meta"},
            {"key": "medium", "int_value": None, "string_value": "cpc"}
        ]
    },
    {
        "event_timestamp": 1695984120000,
        "user_pseudo_id": "user_456",
        "event_params": [
            {"key": "ga_session_id", "int_value": 3, "string_value": None},
            {"key": "source", "int_value": None, "string_value": "google"},
            {"key": "medium", "int_value": None, "string_value": "organic"}
        ]
    }
]

# Flatten event_params
rows = []
for row in data:
    flat_row = {
        "event_timestamp": row["event_timestamp"],
        "user_pseudo_id": row["user_pseudo_id"]
    }
    for param in row["event_params"]:
        value = param["int_value"] if param["int_value"] is not None else param["string_value"]
        flat_row[param["key"]] = value
    rows.append(flat_row)

# Create DataFrame
jjson_table_sample = pd.DataFrame(rows)
jjson_table_sample = jjson_table_sample[
    jjson_table_sample["kkey"].isin(["ga_session_id","source","medium"])
]

BigQuery — using `UNNEST` + `JSON_VALUE`

CREATE TABLE dataset.jjson_table_sample AS
SELECT
  event_timestamp,
  user_pseudo_id,
  JSON_VALUE(param, '$.key') AS kkey,
  SAFE_CAST(JSON_VALUE(param, '$.value.int_value') AS INT64) AS int_value,
  JSON_VALUE(param, '$.value.string_value') AS string_value
FROM `project.dataset.test_sample` AS t,
UNNEST(JSON_EXTRACT_ARRAY(t.event_params)) AS param
WHERE JSON_VALUE(param, '$.key') IN ('ga_session_id','source','medium');

Note GA4 table contains many events like page_view, scroll, begin_checkout, purchase, and we don't need all of them. So we apply session_id to every event_timestamp, user_pseudo_id taking their MAX value of int_value by Partitions

Applied Session ID

Event Timestamp	User Pseudo ID	Key	Int Value	String Value	Session ID
1695984000000(page_view)	user_123	ga_session_id	000000001	null	000000001
1695984000000(page_view)	user_123	source	null	google	000000001
1695984000000(page_view)	user_123	medium	null	cpc	000000001
1695984100000(purchase)	user_123	ga_session_id	000000001	null	000000001
1695984100000(purchase)	user_123	source	null	google	000000001
1695984100000(purchase)	user_123	medium	null	cpc	000000001
1695984060000	user_123	ga_session_id	000000002	null	000000002
1695984060000	user_123	source	null	meta	000000002
1695984060000	user_123	medium	null	cpc	000000002
1695984120000	user_456	ga_session_id	000000003	null	000000003
1695984120000	user_456	source	null	google	000000003
1695984120000	user_456	medium	null	organic	000000003

Select a tab to see MySQL or Python.

MySQL/BigQuery Python

MySQL/BigQuery


SELECT 
    event_timestamp, 
    user_pseudo_id, 
    kkey, 
    int_value, 
    string_value,
    MAX(int_value) OVER(PARTITION BY event_timestamp, user_pseudo_id) AS session_id
FROM jjson_table;

Python


import pandas as pd

# Sample data (your jjson_table)
data = [
    {"event_timestamp": "1695984000000(page_view)", "user_pseudo_id": "user_123", "kkey": "ga_session_id", "int_value": 1, "string_value": None, "session_id": 1},
    {"event_timestamp": "1695984000000(page_view)", "user_pseudo_id": "user_123", "kkey": "source", "int_value": None, "string_value": "google", "session_id": 1},
    {"event_timestamp": "1695984000000(page_view)", "user_pseudo_id": "user_123", "kkey": "medium", "int_value": None, "string_value": "cpc", "session_id": 1},
    {"event_timestamp": "1695984100000(purchase)", "user_pseudo_id": "user_123", "kkey": "ga_session_id", "int_value": 1, "string_value": None, "session_id": 1},
    {"event_timestamp": "1695984100000(purchase)", "user_pseudo_id": "user_123", "kkey": "source", "int_value": None, "string_value": "google", "session_id": 1},
    {"event_timestamp": "1695984100000(purchase)", "user_pseudo_id": "user_123", "kkey": "medium", "int_value": None, "string_value": "cpc", "session_id": 1},
    {"event_timestamp": "1695984060000", "user_pseudo_id": "user_123", "kkey": "ga_session_id", "int_value": 2, "string_value": None, "session_id": 2},
    {"event_timestamp": "1695984060000", "user_pseudo_id": "user_123", "kkey": "source", "int_value": None, "string_value": "meta", "session_id": 2},
    {"event_timestamp": "1695984060000", "user_pseudo_id": "user_123", "kkey": "medium", "int_value": None, "string_value": "cpc", "session_id": 2},
    {"event_timestamp": "1695984120000", "user_pseudo_id": "user_456", "kkey": "ga_session_id", "int_value": 3, "string_value": None, "session_id": 3},
    {"event_timestamp": "1695984120000", "user_pseudo_id": "user_456", "kkey": "source", "int_value": None, "string_value": "google", "session_id": 3},
    {"event_timestamp": "1695984120000", "user_pseudo_id": "user_456", "kkey": "medium", "int_value": None, "string_value": "organic", "session_id": 3},
]

df = pd.DataFrame(data)

# Compute max_session_id per (event_timestamp, user_pseudo_id)
df["max_session_id"] = df.groupby(
    ["event_timestamp", "user_pseudo_id"]
)["int_value"].transform("max")

print(df)

Now let's form a table with session_start, user_pseudo_id, session_id by taking the minimal event_timestamp grouped by user_pseudo_id and session_id

Session start	User Pseudo ID	Session ID
1695984000000	user_123	000000001
1695984060000	user_123	000000002
1695984120000	user_456	000000003

Select a tab to see MySQL or Python.

MySQL/BigQuery Python

MySQL/BigQuery


  CREATE TABLE time_user_session AS
  WITH sess AS 
  (SELECT 
    event_timestamp, 
    user_pseudo_id, 
    kkey, 
    int_value, 
    string_value,
    MAX(int_value) OVER(PARTITION BY event_timestamp, user_pseudo_id) AS session_id
  FROM jjson_table)
  SELECT 
    MIN(event_timestamp) AS session_start, 
    user_pseudo_id, 
    session_id
  FROM sess
  GROUP BY 
    user_pseudo_id, 
    session_id;

Python


import pandas as pd

# Sample data (your jjson_table)
data = [
    {"event_timestamp": 1695984000000, "user_pseudo_id": "user_123", "kkey": "ga_session_id", "int_value": 1, "string_value": None},
    {"event_timestamp": 1695984000000, "user_pseudo_id": "user_123", "kkey": "source", "int_value": None, "string_value": "google"},
    {"event_timestamp": 1695984000000, "user_pseudo_id": "user_123", "kkey": "medium", "int_value": None, "string_value": "cpc"},
    {"event_timestamp": 1695984100000, "user_pseudo_id": "user_123", "kkey": "ga_session_id", "int_value": 1, "string_value": None},
    {"event_timestamp": 1695984100000, "user_pseudo_id": "user_123", "kkey": "source", "int_value": None, "string_value": "google"},
    {"event_timestamp": 1695984100000, "user_pseudo_id": "user_123", "kkey": "medium", "int_value": None, "string_value": "cpc"},
    {"event_timestamp": 1695984060000, "user_pseudo_id": "user_123", "kkey": "ga_session_id", "int_value": 2, "string_value": None},
    {"event_timestamp": 1695984060000, "user_pseudo_id": "user_123", "kkey": "source", "int_value": None, "string_value": "meta"},
    {"event_timestamp": 1695984060000, "user_pseudo_id": "user_123", "kkey": "medium", "int_value": None, "string_value": "cpc"},
    {"event_timestamp": 1695984120000, "user_pseudo_id": "user_456", "kkey": "ga_session_id", "int_value": 3, "string_value": None},
    {"event_timestamp": 1695984120000, "user_pseudo_id": "user_456", "kkey": "source", "int_value": None, "string_value": "google"},
    {"event_timestamp": 1695984120000, "user_pseudo_id": "user_456", "kkey": "medium", "int_value": None, "string_value": "organic"},
]

df = pd.DataFrame(data)

# Step 1: Compute session_id as max(int_value) per (event_timestamp, user_pseudo_id)
df['session_id'] = df.groupby(['event_timestamp', 'user_pseudo_id'])['int_value'].transform('max')

# Step 2: Aggregate to get session_start per (user_pseudo_id, session_id)
time_user_session = df.groupby(['user_pseudo_id', 'session_id'], as_index=False).agg(
    session_start=('event_timestamp', 'min')
)

print(time_user_session)

Next step, we need to concatenate our source/medium into one column. To do it, we use a Left join to the just-created time_user_session table (so we filter non-session start rows to prevent duplicates). Also, we need to fill empty sources or mediums with 'not defined'.

Select a tab to see MySQL or Python.

MySQL/BigQuery Python

MySQL/BigQuery


  CREATE TABLE start_user_session_source AS
  SELECT 
    t.session_start,
    t.user_pseudo_id,
    t.session_id,
    CONCAT_WS(
      '/',
      COALESCE(MAX(CASE WHEN rt.kkey = 'source' THEN rt.string_value END), 'not defined'),
      COALESCE(MAX(CASE WHEN rt.kkey = 'medium' THEN rt.string_value END), 'not defined')
    ) AS source_medium
  FROM time_user_session t
  LEFT JOIN jjson_table rt
       ON t.session_start = rt.event_timestamp
      AND t.user_pseudo_id = rt.user_pseudo_id
  GROUP BY t.session_start, t.user_pseudo_id, t.session_id;

Python


import pandas as pd

# Sample jjson_table
jjson_data = [
    {"event_timestamp": 1695984000000, "user_pseudo_id": "user_123", "kkey": "ga_session_id", "int_value": 1, "string_value": None},
    {"event_timestamp": 1695984000000, "user_pseudo_id": "user_123", "kkey": "source", "int_value": None, "string_value": "google"},
    {"event_timestamp": 1695984000000, "user_pseudo_id": "user_123", "kkey": "medium", "int_value": None, "string_value": "cpc"},
    {"event_timestamp": 1695984100000, "user_pseudo_id": "user_123", "kkey": "ga_session_id", "int_value": 1, "string_value": None},
    {"event_timestamp": 1695984100000, "user_pseudo_id": "user_123", "kkey": "source", "int_value": None, "string_value": "google"},
    {"event_timestamp": 1695984100000, "user_pseudo_id": "user_123", "kkey": "medium", "int_value": None, "string_value": "cpc"},
    {"event_timestamp": 1695984060000, "user_pseudo_id": "user_123", "kkey": "ga_session_id", "int_value": 2, "string_value": None},
    {"event_timestamp": 1695984060000, "user_pseudo_id": "user_123", "kkey": "source", "int_value": None, "string_value": "meta"},
    {"event_timestamp": 1695984060000, "user_pseudo_id": "user_123", "kkey": "medium", "int_value": None, "string_value": "cpc"},
    {"event_timestamp": 1695984120000, "user_pseudo_id": "user_456", "kkey": "ga_session_id", "int_value": 3, "string_value": None},
    {"event_timestamp": 1695984120000, "user_pseudo_id": "user_456", "kkey": "source", "int_value": None, "string_value": "google"},
    {"event_timestamp": 1695984120000, "user_pseudo_id": "user_456", "kkey": "medium", "int_value": None, "string_value": "organic"},
]

jjson_df = pd.DataFrame(jjson_data)

# Step 1: Build time_user_session like before
jjson_df['session_id'] = jjson_df.groupby(['event_timestamp', 'user_pseudo_id'])['int_value'].transform('max')

time_user_session = jjson_df.groupby(['user_pseudo_id', 'session_id'], as_index=False).agg(
    session_start=('event_timestamp', 'min')
)

# Step 2: Merge time_user_session with jjson_table on session_start = event_timestamp & user_pseudo_id
merged = time_user_session.merge(
    jjson_df,
    left_on=['session_start', 'user_pseudo_id'],
    right_on=['event_timestamp', 'user_pseudo_id'],
    how='left'
)

# Step 3: Aggregate source and medium like SQL
def coalesce_max(series):
    return series.dropna().max() if not series.dropna().empty else 'not defined'

start_user_session_source = merged.groupby(
    ['session_start', 'user_pseudo_id', 'session_id'], as_index=False
).agg(
    source=lambda x: coalesce_max(merged.loc[x.index, 'string_value'][merged.loc[x.index, 'kkey']=='source']),
    medium=lambda x: coalesce_max(merged.loc[x.index, 'string_value'][merged.loc[x.index, 'kkey']=='medium'])
)

# Step 4: Concatenate source/medium like CONCAT_WS
start_user_session_source['source_medium'] = start_user_session_source['source'] + '/' + start_user_session_source['medium']

# Drop intermediate columns
start_user_session_source = start_user_session_source[['session_start', 'user_pseudo_id', 'session_id', 'source_medium']]

print(start_user_session_source)

The next step is as brilliant as simple. If you know Window functions, there is the ROW_NUMBER function, which applies row numbers to every partition. So we need to sort our table by session_start and form columns for 1st row number, for 2nd row number, for 3rd and so on. It looks like this

SELECT
  user_id,
  MAX(CASE WHEN rn = 1 THEN source END) AS step1,
  MAX(CASE WHEN rn = 2 THEN source END) AS step2,
  MAX(CASE WHEN rn = 3 THEN source END) AS step3
  -- Continue if you need more steps
FROM (
  SELECT
    user_id,
    source,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY start_session) AS rn
  FROM sessions
) AS ordered
GROUP BY user_id;

But we want a dynamic procedure that can be called in one string.

Select a tab to see Python, MySQL or BigQuery.

MySQLPython BigQuery

MySQL

DELIMITER $$

CREATE PROCEDURE build_user_path()
BEGIN
  -- Declare constants for loops
  DECLARE max_steps INT DEFAULT 0;
  DECLARE i INT DEFAULT 1;
    
    
    DROP TABLE IF EXISTS jjson_table,time_user_session,start_user_session_source;

  CREATE TABLE jjson_table AS
  SELECT event_timestamp, user_pseudo_id, kkey, int_value, string_value FROM 
    ga4_ecom, 
    JSON_TABLE(
      event_params,
      "$[*]" COLUMNS
        (kkey VARCHAR(100) PATH "$.key",
        int_value BIGINT PATH "$.value.int_value",
        string_value VARCHAR(100) PATH "$.value.string_value")
    ) AS tt
  WHERE tt.kkey IN ('ga_session_id','source','medium');

  CREATE TABLE time_user_session AS
  WITH sess AS 
  (SELECT 
    event_timestamp, 
    user_pseudo_id, 
    kkey, 
    int_value, 
    string_value,
    MAX(int_value) OVER(PARTITION BY event_timestamp, user_pseudo_id) AS session_id
  FROM jjson_table)
  SELECT 
    MIN(event_timestamp) AS session_start, 
    user_pseudo_id, 
    session_id
  FROM sess
  GROUP BY 
    user_pseudo_id, 
    session_id;

  CREATE TABLE start_user_session_source AS
  SELECT 
    t.session_start,
    t.user_pseudo_id,
    t.session_id,
    CONCAT_WS(
      '/',
      COALESCE(MAX(CASE WHEN rt.kkey = 'source' THEN rt.string_value END), 'not defined'),
      COALESCE(MAX(CASE WHEN rt.kkey = 'medium' THEN rt.string_value END), 'not defined')
    ) AS source_medium
  FROM time_user_session t
  LEFT JOIN jjson_table rt
       ON t.session_start = rt.event_timestamp
      AND t.user_pseudo_id = rt.user_pseudo_id
  GROUP BY t.session_start, t.user_pseudo_id, t.session_id;

  -- Maximum number of steps per user in a local procedure variable
  SELECT MAX(cnt) INTO max_steps
  -- Creates a table user_pseudo_id and step counts
  FROM (
  SELECT user_pseudo_id, COUNT(*) AS cnt
  FROM start_user_session_source
  GROUP BY user_pseudo_id
  ) t;


  -- 2) Build CASE expressions into a user variable (PREPARE requires a user var)
  SET @case_expressions = '';
  SET i = 1;
  WHILE i <= max_steps DO
  IF i > 1 THEN
    SET @case_expressions = CONCAT(@case_expressions, ', ');
  END IF;
  SET @case_expressions = CONCAT(
    @case_expressions,
    'MAX(CASE WHEN rn = ', i, ' THEN source_medium END) AS step', i
  );
  SET i = i + 1;
  END WHILE;

  -- 3) Build the final SQL into a user variable @sql_text
  SET @sql_text = CONCAT(
  'SELECT user_pseudo_id, ', @case_expressions,
  ' FROM (',
    'SELECT user_pseudo_id, source_medium, ',
    'ROW_NUMBER() OVER (PARTITION BY user_pseudo_id ORDER BY session_start) AS rn ',
    'FROM start_user_session_source',
  ') AS ordered ',
  'GROUP BY user_pseudo_id'
  );

  -- drop existing target (if any)
  DROP TABLE IF EXISTS user_paths;

  -- create table from dynamic SQL
  SET @create_sql = CONCAT('CREATE TABLE user_paths AS ', @sql_text);
  PREPARE stmt FROM @create_sql;
  EXECUTE stmt;
  DEALLOCATE PREPARE stmt;
    
END $$
DELIMITER ;

Python (example)

import pandas as pd
import json

# ---------------------------
# 1) Simulate GA4 JSON table
# ---------------------------
# Example GA4 raw data (simplified)
ga4_ecom = [
    {
        "event_timestamp": 1695984000000,
        "user_pseudo_id": "user_123",
        "event_params": [
            {"key": "ga_session_id", "value": {"int_value": 1}},
            {"key": "source", "value": {"string_value": "google"}},
            {"key": "medium", "value": {"string_value": "cpc"}}
        ]
    },
    {
        "event_timestamp": 1695984060000,
        "user_pseudo_id": "user_123",
        "event_params": [
            {"key": "ga_session_id", "value": {"int_value": 2}},
            {"key": "source", "value": {"string_value": "meta"}},
            {"key": "medium", "value": {"string_value": "cpc"}}
        ]
    },
    {
        "event_timestamp": 1695984120000,
        "user_pseudo_id": "user_456",
        "event_params": [
            {"key": "ga_session_id", "value": {"int_value": 3}},
            {"key": "source", "value": {"string_value": "google"}},
            {"key": "medium", "value": {"string_value": "organic"}}
        ]
    },
]

# Flatten JSON table
rows = []
for row in ga4_ecom:
    for param in row["event_params"]:
        if param["key"] in ["ga_session_id", "source", "medium"]:
            rows.append({
                "event_timestamp": row["event_timestamp"],
                "user_pseudo_id": row["user_pseudo_id"],
                "kkey": param["key"],
                "int_value": param.get("value", {}).get("int_value"),
                "string_value": param.get("value", {}).get("string_value")
            })

jjson_table = pd.DataFrame(rows)

# ---------------------------
# 2) Build time_user_session
# ---------------------------
# Compute session_id per (event_timestamp, user_pseudo_id)
jjson_table['session_id'] = jjson_table.groupby(
    ['event_timestamp', 'user_pseudo_id']
)['int_value'].transform('max')

# Aggregate to get session_start
time_user_session = jjson_table.groupby(['user_pseudo_id', 'session_id'], as_index=False).agg(
    session_start=('event_timestamp', 'min')
)

# ---------------------------
# 3) Build start_user_session_source
# ---------------------------
merged = time_user_session.merge(
    jjson_table,
    left_on=['session_start', 'user_pseudo_id'],
    right_on=['event_timestamp', 'user_pseudo_id'],
    how='left'
)

# Pivot source/medium
pivoted = merged.pivot_table(
    index=['session_start', 'user_pseudo_id', 'session_id'],
    columns='kkey',
    values='string_value',
    aggfunc='first'  # equivalent to MAX in SQL
).reset_index()

# Fill missing values
pivoted['source'] = pivoted.get('source', pd.Series()).fillna('not defined')
pivoted['medium'] = pivoted.get('medium', pd.Series()).fillna('not defined')

pivoted['source_medium'] = pivoted['source'] + '/' + pivoted['medium']

start_user_session_source = pivoted[['session_start', 'user_pseudo_id', 'session_id', 'source_medium']]

# ---------------------------
# 4) Build user_paths dynamically
# ---------------------------
# Compute number of steps per user
user_steps = start_user_session_source.groupby('user_pseudo_id').size().reset_index(name='cnt')
max_steps = user_steps['cnt'].max()

# Assign row numbers per user
start_user_session_source['rn'] = start_user_session_source.groupby('user_pseudo_id')['session_start'].rank(method='first').astype(int)

# Build wide table like dynamic SQL
user_paths = start_user_session_source.pivot(
    index='user_pseudo_id',
    columns='rn',
    values='source_medium'
).fillna('not defined')

# Rename columns to step1, step2, ...
user_paths.columns = [f'step{i}' for i in user_paths.columns]
user_paths = user_paths.reset_index()

print(user_paths)

BigQuery

CREATE OR REPLACE PROCEDURE `my_project.my_dataset.build_user_path`()
BEGIN
  DECLARE max_steps INT64;
  DECLARE i INT64 DEFAULT 1;
  DECLARE case_expressions STRING DEFAULT '';
  DECLARE sql_text STRING;
  -- 1) Extract event params into a temporary table
  CREATE OR REPLACE TABLE `my_project.my_dataset.jjson_table` AS
  SELECT
    event_timestamp,
    user_pseudo_id,
    param.key AS kkey,
    param.value.int_value AS int_value,
    param.value.string_value AS string_value
  FROM
    `my_project.my_dataset.my_table`,
    UNNEST(event_params) AS param
  WHERE
    param.key IN ('ga_session_id', 'source', 'medium');

  -- 2) Aggregate sessions
  CREATE OR REPLACE TABLE `my_project.my_dataset.time_user_session` AS
  WITH sess AS (
    SELECT
      event_timestamp,
      user_pseudo_id,
      kkey,
      int_value,
      string_value,
      MAX(int_value) OVER(PARTITION BY event_timestamp, user_pseudo_id) AS session_id
    FROM `my_project.my_dataset.jjson_table`
  )
  SELECT
    MIN(event_timestamp) AS session_start,
    user_pseudo_id,
    session_id
  FROM sess
  GROUP BY user_pseudo_id, session_id;

  -- 3) Combine source/medium per session
  CREATE OR REPLACE TABLE `my_project.my_dataset.start_user_session_source` AS
  SELECT
    t.session_start,
    t.user_pseudo_id,
    t.session_id,
    CONCAT(
      COALESCE(MAX(IF(rt.kkey = 'source', rt.string_value, NULL)), 'not defined'),
      '/',
      COALESCE(MAX(IF(rt.kkey = 'medium', rt.string_value, NULL)), 'not defined')
    ) AS source_medium
  FROM `my_project.my_dataset.time_user_session` t
  LEFT JOIN `my_project.my_dataset.jjson_table` rt
    ON t.session_start = rt.event_timestamp
   AND t.user_pseudo_id = rt.user_pseudo_id
  GROUP BY t.session_start, t.user_pseudo_id, t.session_id;

  -- 4) Find max steps per user

  SET max_steps = (
    SELECT MAX(cnt)
    FROM (
      SELECT user_pseudo_id, COUNT(*) AS cnt
      FROM `my_project.my_dataset.start_user_session_source`
      GROUP BY user_pseudo_id
    )
  );

  -- 5) Build dynamic pivot using EXECUTE IMMEDIATE
  WHILE i <= max_steps DO
    IF i > 1 THEN
      SET case_expressions = CONCAT(case_expressions, ', ');
    END IF;
    SET case_expressions = CONCAT(
      case_expressions,
      'MAX(IF(rn = ', i, ', source_medium, NULL)) AS step', i
    );
    SET i = i + 1;
  END WHILE;

  -- 6) Final dynamic SQL
  SET sql_text = CONCAT(
    'CREATE OR REPLACE TABLE `my_project.my_dataset.user_paths` AS ',
    'SELECT user_pseudo_id, ', case_expressions, ' FROM (',
      'SELECT user_pseudo_id, source_medium, ',
      'ROW_NUMBER() OVER (PARTITION BY user_pseudo_id ORDER BY session_start) AS rn ',
      'FROM `my_project.my_dataset.start_user_session_source`',
    ') AS ordered ',
    'GROUP BY user_pseudo_id'
  );

  -- 7) Execute dynamic pivot
  EXECUTE IMMEDIATE sql_text;

END;

And the final result is pretty impressive

User Path Table

user_pseudo_id	step1	step2	step3	step4	step5	step6	step7	step8	step9	step10	step11	step12
10028188.3857887509	<Other>/referral	(data deleted)/(data deleted)	(data deleted)/(data deleted)	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
10101905.8688293836	google/organic	shop.googlemerchandisestore.com/referral	shop.googlemerchandisestore.com/referral	not defined/not defined	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
10115718.4867090359	google/organic	shop.googlemerchandisestore.com/referral	<Other>/<Other>	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
1011939.1001718987	analytics.google.com/referral	google/organic	(data deleted)/(data deleted)	not defined/not defined	creatoracademy.youtube.com/referral	NULL	NULL	NULL	NULL	NULL	NULL	NULL
1018731.0091054136	<Other>/<Other>	(data deleted)/(data deleted)	<Other>/referral	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
1023953.9842241768	<Other>/<Other>	<Other>/<Other>	<Other>/referral	(direct)/(none)	google/organic	not defined/not defined	not defined/not defined	NULL	NULL	NULL	NULL	NULL
1026932.0858862293	<Other>/referral	google/organic	<Other>/referral	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
1033552.6644233006	analytics.google.com/referral	(direct)/(none)	analytics.google.com/referral	(data deleted)/(data deleted)	google/organic	NULL	NULL	NULL	NULL	NULL	NULL	NULL
1033716.6724392262	(direct)/(none)	shop.googlemerchandisestore.com/referral	google/organic	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
10404067.7211955661	google/organic	(data deleted)/(data deleted)	google/organic	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL

And you will answer a logic question. "Cool! What am I going to do with 188 unique source/medium and 6398 unique User paths (I'm not telling about user distribution between them)".

Here's the time for Logistic Regression I'm going to tell about next time

How to Practice GA4 Analysis in MySQL/BigQuery: Build User Paths by Source/Medium

Idea

GA4 Raw Event Data

Transformed User Steps

Why MySQL?

Download the data set

MySQL queries

GA4 Raw Event Data

jjson_table with the unnested column params

Create jjson_table_sample — three variants

MySQL — using JSON_TABLE

Python — using pandas (example)

BigQuery — using UNNEST + JSON_VALUE

Applied Session ID

MySQL/BigQuery

Python

MySQL/BigQuery

Python

MySQL/BigQuery

Python

MySQL

Python (example)

BigQuery

User Path Table

Create `jjson_table_sample` — three variants

MySQL — using `JSON_TABLE`

Python — using `pandas` (example)

BigQuery — using `UNNEST` + `JSON_VALUE`