How to Practice GA4 Analysis in MySQL/BigQuery: Build User Paths by Source/Medium

By: Roman Myskin - Sept. 29, 2025


Learn how to practice Google Analytics 4 (GA4) analysis in MySQL by building user paths and tracking source/medium. Step-by-step guide with SQL code, dynamic queries, and session analysis using GA4 event data.

Idea

We have a BigQuery table imported from GA4. The full scheme explanation is available here, but we are interested in event_timestamp, user_preudo_id, and event_params, because we're interested in User Paths only. Our final goal is to make a table of user_pseudo_ids and source/medium steps.

GA4 Raw Event Data

Event Timestamp User Pseudo ID Event Params (Key / Int Value / String Value)
1695984000000 user_123
  • key: ga_session_id, int_value: 000000001, string_value: null
  • key: source, int_value: null, string_value: google
  • key: medium, int_value: null, string_value: cpc
1695984060000 user_123
  • key: ga_session_id, int_value: 000000002, string_value: null
  • key: source, int_value: null, string_value: meta
  • key: medium, int_value: null, string_value: cpc
1695984120000 user_456
  • key: ga_session_id, int_value: 000000003, string_value: null
  • key: source, int_value: null, string_value: google
  • key: medium, int_value: null, string_value: organic

Transformed User Steps

User Pseudo ID Step 1 Step 2
user_123 google/cpc meta/cpc
user_456 google/organic -

Why MySQL?

The first thing you face when working with BigQuery - it's their costs. It's not applicable if you have small datasets, but what's the point of data analysis practice if insights are available in a simple CSV file?

Anyway, my practice with BigQuery was always nervous. For example, I paid some dollars working with the Google Transparency Center dataset, because even 

SELECT COUNT(*) FROM 'bigquery-public-data.google_ads_transparency_center.creative_stats'

costed 150 GB per call. With 1 TB of free usage, you can imagine it will take 7 calls to start paying, and if you have experience with SQL you need many calls to make sure everything works.

So finally, I don't want to pay extra for practice. 

Despite this article primarily using MySQL syntax, I have also translated it into Python and BigQuery. The Python representation is done with Pandas, and if you are familiar with its terminology, it provides a better representation than SQL, which is table-oriented and requires considerable practice to understand the logic behind the queries.

Download the data set

First, you need to copy this dataset to your BigQuery project so you can upload it to your PC. In the next article, I will generate this dataset synthetically, because BigQuery sample dataset for Google Analytics ecommerce web implementation has some natural data, but it is not applicable for User Paths analysis (they have only one cpc source/medium: google/cpc)

CREATE TABLE `my_project.my_dataset.my_table` AS SELECT * FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_202*`

The events_202* pattern selects all GA4 events tables for 2021 (partitioned by date).

BigQuery tables are referenced as: project_id.dataset_id.table_id

Next, you need to use Python to upload data from BigQuery to MySQL:

import pandas_gbq
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from google.oauth2 import service_account
# --- MySQL connection via SQLAlchemy ---
#TODO fill your local MySQL user and password
engine = create_engine(
    "mysql+pymysql://user:password@localhost/mydb?charset=utf8mb4"
)

# --- BigQuery base SQL ---
base_sql = """ 
SELECT
  event_date,
  event_timestamp,
  event_name,
  TO_JSON_STRING(event_params) AS event_params,
  event_previous_timestamp,
  IFNULL(event_value_in_usd, 0) AS event_value_in_usd,
  event_bundle_sequence_id,
  event_server_timestamp_offset,
  user_id,
  user_pseudo_id,
  TO_JSON_STRING(privacy_info) AS privacy_info,
  TO_JSON_STRING(user_properties) AS user_properties,
  user_first_touch_timestamp,
  TO_JSON_STRING(user_ltv) AS user_ltv,
  TO_JSON_STRING(device) AS device,
  TO_JSON_STRING(geo) AS geo,
  TO_JSON_STRING(app_info) AS app_info,
  TO_JSON_STRING(traffic_source) AS traffic_source,
  stream_id,
  platform,
  TO_JSON_STRING(event_dimensions) AS event_dimensions,
  TO_JSON_STRING(ecommerce) AS ecommerce,
  TO_JSON_STRING(items) AS items
FROM `my_project.my_dataset.my_table`
"""
credentials = service_account.Credentials.from_service_account_file('secret.json')
pandas_gbq.context.credentials = credentials
pandas_gbq.context.project = "my_project"
total_rows = 4295584
chunk_size = 100000

for offset in range(100000, total_rows, chunk_size):
    sql = f"{base_sql} LIMIT {chunk_size} OFFSET {offset}"
    print(f"Fetching rows {offset} to {offset + chunk_size}...")

    df = pandas_gbq.read_gbq(sql)

    if df.empty:
        print("No more rows to fetch.")
        break

    # Fill numeric NaN with 0
    for col in df.select_dtypes(include=[np.number]).columns:
        df[col] = df[col].fillna(0)

    # Replace text NaN with None
    for col in df.select_dtypes(include=['object']).columns:
        df[col] = df[col].where(pd.notnull(df[col]), None)

    # --- Push chunk to MySQL ---
    df.to_sql(
        'ga4_ecom',
        con=engine,
        if_exists='append',
        index=False
    )

    print(f"Inserted chunk ending at row {offset + len(df)}")

print("All data inserted successfully.")

Note, I processed 43 queries (chunks) to upload 4295584 raws from the sample dataset.

You need to have a Google Cloud project with BigQuery API enabled and create a secret from a service account. You can read about credentials more here, and it looks complex, but shortly, you need to go to AIM & Admin -> Service Accounts, create a service account, create a secret key, and grant access with the IAM page

MySQL queries

In this section, I explain the step-by-step process of working with GA4 data with the final Procedure to create all tables in one Call.

The first table we need to create is an unnested table of JSON values in event_params. We're interested in ga_session_id, source, medium.

GA4 Raw Event Data

Event Timestamp User Pseudo ID Event Params (Key / Int Value / String Value)
1695984000000 user_123
  • key: ga_session_id, int_value: 000000001, string_value: null
  • key: source, int_value: null, string_value: google
  • key: medium, int_value: null, string_value: cpc
1695984060000 user_123
  • key: ga_session_id, int_value: 000000002, string_value: null
  • key: source, int_value: null, string_value: meta
  • key: medium, int_value: null, string_value: cpc
1695984120000 user_456
  • key: ga_session_id, int_value: 000000003, string_value: null
  • key: source, int_value: null, string_value: google
  • key: medium, int_value: null, string_value: organic

jjson_table with the unnested column params

Event Timestamp User Pseudo ID Key Int Value String Value
1695984000000 user_123 ga_session_id 000000001 null
1695984000000 user_123 source null google
1695984000000 user_123 medium null cpc
1695984060000 user_123 ga_session_id 000000002 null
1695984060000 user_123 source null meta
1695984060000 user_123 medium null cpc
1695984120000 user_456 ga_session_id 000000003 null
1695984120000 user_456 source null google
1695984120000 user_456 medium null organic

Create jjson_table_sample — three variants

These queries create the first table. We uploaded data to ga4_ecom table, but we'll be using the test_sample dataset first created with this query:

CREATE TABLE test_sample AS SELECT * FROM ga4_ecom LIMIT 10000

Because the final query will take a lot of time. 

Select a tab to see Python (pandas), MySQL (JSON_TABLE) or BigQuery (UNNEST + JSON_VALUE).

MySQL — using JSON_TABLE
CREATE TABLE jjson_table_sample AS
SELECT
  t.event_timestamp,
  t.user_pseudo_id,
  tt.kkey,
  tt.int_value,
  tt.string_value
FROM test_sample AS t
JOIN JSON_TABLE(
  t.event_params,
  "$[*]" COLUMNS (
    kkey VARCHAR(100) PATH "$.key",
    int_value BIGINT PATH "$.value.int_value",
    string_value VARCHAR(255) PATH "$.value.string_value"
  )
) AS tt
WHERE tt.kkey IN ('ga_session_id','source','medium');

Note GA4 table contains many events like page_view, scroll, begin_checkout, purchase, and we don't need all of them. So we apply session_id to every event_timestamp, user_pseudo_id taking their MAX value of int_value by Partitions

Applied Session ID

Event Timestamp User Pseudo ID Key Int Value String Value Session ID
1695984000000(page_view) user_123 ga_session_id 000000001 null 000000001
1695984000000(page_view) user_123 source null google 000000001
1695984000000(page_view) user_123 medium null cpc 000000001
1695984100000(purchase) user_123 ga_session_id 000000001 null 000000001
1695984100000(purchase) user_123 source null google 000000001
1695984100000(purchase) user_123 medium null cpc 000000001
1695984060000 user_123 ga_session_id 000000002 null 000000002
1695984060000 user_123 source null meta 000000002
1695984060000 user_123 medium null cpc 000000002
1695984120000 user_456 ga_session_id 000000003 null 000000003
1695984120000 user_456 source null google 000000003
1695984120000 user_456 medium null organic 000000003

Select a tab to see MySQL or Python.

MySQL/BigQuery

SELECT 
    event_timestamp, 
    user_pseudo_id, 
    kkey, 
    int_value, 
    string_value,
    MAX(int_value) OVER(PARTITION BY event_timestamp, user_pseudo_id) AS session_id
FROM jjson_table;

Now let's form a table with session_start, user_pseudo_id, session_id by taking the minimal event_timestamp grouped by user_pseudo_id and session_id

Session start User Pseudo ID Session ID
1695984000000 user_123 000000001
1695984060000 user_123 000000002
1695984120000 user_456 000000003

Select a tab to see MySQL or Python.

MySQL/BigQuery

  CREATE TABLE time_user_session AS
  WITH sess AS 
  (SELECT 
    event_timestamp, 
    user_pseudo_id, 
    kkey, 
    int_value, 
    string_value,
    MAX(int_value) OVER(PARTITION BY event_timestamp, user_pseudo_id) AS session_id
  FROM jjson_table)
  SELECT 
    MIN(event_timestamp) AS session_start, 
    user_pseudo_id, 
    session_id
  FROM sess
  GROUP BY 
    user_pseudo_id, 
    session_id;

Next step, we need to concatenate our source/medium into one column. To do it, we use a Left join to the just-created time_user_session table (so we filter non-session start rows to prevent duplicates). Also, we need to fill empty sources or mediums with 'not defined'.

Select a tab to see MySQL or Python.

MySQL/BigQuery

  CREATE TABLE start_user_session_source AS
  SELECT 
    t.session_start,
    t.user_pseudo_id,
    t.session_id,
    CONCAT_WS(
      '/',
      COALESCE(MAX(CASE WHEN rt.kkey = 'source' THEN rt.string_value END), 'not defined'),
      COALESCE(MAX(CASE WHEN rt.kkey = 'medium' THEN rt.string_value END), 'not defined')
    ) AS source_medium
  FROM time_user_session t
  LEFT JOIN jjson_table rt
       ON t.session_start = rt.event_timestamp
      AND t.user_pseudo_id = rt.user_pseudo_id
  GROUP BY t.session_start, t.user_pseudo_id, t.session_id;

The next step is as brilliant as simple. If you know Window functions, there is the ROW_NUMBER function, which applies row numbers to every partition. So we need to sort our table by session_start and form columns for 1st row number, for 2nd row number, for 3rd and so on. It looks like this 

SELECT
  user_id,
  MAX(CASE WHEN rn = 1 THEN source END) AS step1,
  MAX(CASE WHEN rn = 2 THEN source END) AS step2,
  MAX(CASE WHEN rn = 3 THEN source END) AS step3
  -- Continue if you need more steps
FROM (
  SELECT
    user_id,
    source,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY start_session) AS rn
  FROM sessions
) AS ordered
GROUP BY user_id;

But we want a dynamic procedure that can be called in one string. 

Select a tab to see Python, MySQL or BigQuery.

MySQL
DELIMITER $$

CREATE PROCEDURE build_user_path()
BEGIN
  -- Declare constants for loops
  DECLARE max_steps INT DEFAULT 0;
  DECLARE i INT DEFAULT 1;
    
    
    DROP TABLE IF EXISTS jjson_table,time_user_session,start_user_session_source;

  CREATE TABLE jjson_table AS
  SELECT event_timestamp, user_pseudo_id, kkey, int_value, string_value FROM 
    ga4_ecom, 
    JSON_TABLE(
      event_params,
      "$[*]" COLUMNS
        (kkey VARCHAR(100) PATH "$.key",
        int_value BIGINT PATH "$.value.int_value",
        string_value VARCHAR(100) PATH "$.value.string_value")
    ) AS tt
  WHERE tt.kkey IN ('ga_session_id','source','medium');

  CREATE TABLE time_user_session AS
  WITH sess AS 
  (SELECT 
    event_timestamp, 
    user_pseudo_id, 
    kkey, 
    int_value, 
    string_value,
    MAX(int_value) OVER(PARTITION BY event_timestamp, user_pseudo_id) AS session_id
  FROM jjson_table)
  SELECT 
    MIN(event_timestamp) AS session_start, 
    user_pseudo_id, 
    session_id
  FROM sess
  GROUP BY 
    user_pseudo_id, 
    session_id;

  CREATE TABLE start_user_session_source AS
  SELECT 
    t.session_start,
    t.user_pseudo_id,
    t.session_id,
    CONCAT_WS(
      '/',
      COALESCE(MAX(CASE WHEN rt.kkey = 'source' THEN rt.string_value END), 'not defined'),
      COALESCE(MAX(CASE WHEN rt.kkey = 'medium' THEN rt.string_value END), 'not defined')
    ) AS source_medium
  FROM time_user_session t
  LEFT JOIN jjson_table rt
       ON t.session_start = rt.event_timestamp
      AND t.user_pseudo_id = rt.user_pseudo_id
  GROUP BY t.session_start, t.user_pseudo_id, t.session_id;

  -- Maximum number of steps per user in a local procedure variable
  SELECT MAX(cnt) INTO max_steps
  -- Creates a table user_pseudo_id and step counts
  FROM (
  SELECT user_pseudo_id, COUNT(*) AS cnt
  FROM start_user_session_source
  GROUP BY user_pseudo_id
  ) t;


  -- 2) Build CASE expressions into a user variable (PREPARE requires a user var)
  SET @case_expressions = '';
  SET i = 1;
  WHILE i <= max_steps DO
  IF i > 1 THEN
    SET @case_expressions = CONCAT(@case_expressions, ', ');
  END IF;
  SET @case_expressions = CONCAT(
    @case_expressions,
    'MAX(CASE WHEN rn = ', i, ' THEN source_medium END) AS step', i
  );
  SET i = i + 1;
  END WHILE;

  -- 3) Build the final SQL into a user variable @sql_text
  SET @sql_text = CONCAT(
  'SELECT user_pseudo_id, ', @case_expressions,
  ' FROM (',
    'SELECT user_pseudo_id, source_medium, ',
    'ROW_NUMBER() OVER (PARTITION BY user_pseudo_id ORDER BY session_start) AS rn ',
    'FROM start_user_session_source',
  ') AS ordered ',
  'GROUP BY user_pseudo_id'
  );

  -- drop existing target (if any)
  DROP TABLE IF EXISTS user_paths;

  -- create table from dynamic SQL
  SET @create_sql = CONCAT('CREATE TABLE user_paths AS ', @sql_text);
  PREPARE stmt FROM @create_sql;
  EXECUTE stmt;
  DEALLOCATE PREPARE stmt;
    
END $$
DELIMITER ;

And the final result is pretty impressive

User Path Table

user_pseudo_id step1 step2 step3 step4 step5 step6 step7 step8 step9 step10 step11 step12
10028188.3857887509 <Other>/referral (data deleted)/(data deleted) (data deleted)/(data deleted) NULL NULL NULL NULL NULL NULL NULL NULL NULL
10101905.8688293836 google/organic shop.googlemerchandisestore.com/referral shop.googlemerchandisestore.com/referral not defined/not defined NULL NULL NULL NULL NULL NULL NULL NULL
10115718.4867090359 google/organic shop.googlemerchandisestore.com/referral <Other>/<Other> NULL NULL NULL NULL NULL NULL NULL NULL NULL
1011939.1001718987 analytics.google.com/referral google/organic (data deleted)/(data deleted) not defined/not defined creatoracademy.youtube.com/referral NULL NULL NULL NULL NULL NULL NULL
1018731.0091054136 <Other>/<Other> (data deleted)/(data deleted) <Other>/referral NULL NULL NULL NULL NULL NULL NULL NULL NULL
1023953.9842241768 <Other>/<Other> <Other>/<Other> <Other>/referral (direct)/(none) google/organic not defined/not defined not defined/not defined NULL NULL NULL NULL NULL
1026932.0858862293 <Other>/referral google/organic <Other>/referral NULL NULL NULL NULL NULL NULL NULL NULL NULL
1033552.6644233006 analytics.google.com/referral (direct)/(none) analytics.google.com/referral (data deleted)/(data deleted) google/organic NULL NULL NULL NULL NULL NULL NULL
1033716.6724392262 (direct)/(none) shop.googlemerchandisestore.com/referral google/organic NULL NULL NULL NULL NULL NULL NULL NULL NULL
10404067.7211955661 google/organic (data deleted)/(data deleted) google/organic NULL NULL NULL NULL NULL NULL NULL NULL NULL

And you will answer a logic question. "Cool! What am I going to do with 188 unique source/medium and 6398 unique User paths (I'm not telling about user distribution between them)".

Here's the time for Logistic Regression I'm going to tell about next time



Home