Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chore: Added stage tables for portal pageviews #1267

Open
wants to merge 51 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
2e0cdbc
Added stage tables
munish7771 May 2, 2023
f0acfe5
Update base_portal_prod__pageviews.sql
munish7771 May 2, 2023
3b05b3d
Revert "Update base_portal_prod__pageviews.sql"
munish7771 May 2, 2023
3473a19
Revert "Added stage tables"
munish7771 May 2, 2023
1ba4685
Update stage models
munish7771 May 2, 2023
d9d1475
Update base_portal_prod__pageviews.sql
munish7771 May 2, 2023
8e06fc1
Update base_portal_prod__pageviews.sql
munish7771 May 2, 2023
76e37d9
Update base_portal_prod__pageviews.sql
munish7771 May 2, 2023
e499ede
Update base_portal_prod__pageviews.sql
munish7771 May 2, 2023
8a75f26
Update base_portal_prod__pageviews.sql
munish7771 May 2, 2023
024e972
Update base_portal_prod__pageviews.sql
munish7771 May 2, 2023
67d054c
Added new base models
munish7771 May 2, 2023
71c2d47
Update base_portal_prod__identifies.sql
munish7771 May 2, 2023
b681562
Update stg_portal_prod__pageviews.sql
munish7771 May 2, 2023
6a9748d
Added Intermediate model
munish7771 May 3, 2023
3c56450
Update int_portal_prod_signups.sql
munish7771 May 3, 2023
864850e
Updated intermediate model
munish7771 May 3, 2023
3a7ff3e
Update int_portal_prod_signups_aggregated_to_users.sql
munish7771 May 3, 2023
cad413a
Update int_portal_prod_signups_aggregated_to_users.sql
munish7771 May 3, 2023
fda7444
Update int_portal_prod_signups_aggregated_to_users.sql
munish7771 May 3, 2023
17a8feb
Some more changes
munish7771 May 3, 2023
c01aa85
Update int_portal_prod_signups_aggregated_to_users.sql
munish7771 May 3, 2023
ac2c852
Update int_portal_prod_signups_aggregated_to_users.sql
munish7771 May 3, 2023
280b8dc
Update int_portal_prod_signups_aggregated_to_users.sql
munish7771 May 3, 2023
7794524
Update int_portal_prod_signups_aggregated_to_users.sql
munish7771 May 3, 2023
3d0d7c1
Test changes
munish7771 May 3, 2023
e5c35df
Update int_portal_prod_signups_aggregated_to_users.sql
munish7771 May 3, 2023
5c7cb65
Added documentation
munish7771 May 3, 2023
e20fe5d
Update _portal_prod__models.yml
munish7771 May 3, 2023
82e5231
Update _int_signup__models.yml
munish7771 May 3, 2023
5838893
Update int_signups_aggregated_to_users.sql
munish7771 May 3, 2023
3e3426c
Update int_signups_aggregated_to_users.sql
munish7771 May 3, 2023
4375545
Updated base models
munish7771 May 4, 2023
0f171eb
Updated intermediate tables
munish7771 May 4, 2023
5fcf837
Update stg_portal_prod__identifies.sql
munish7771 May 4, 2023
e55f634
Update int_rudder_portal_user_mapping.sql
munish7771 May 4, 2023
fe86592
Create int_user_signup_stages.sql
munish7771 May 4, 2023
8d8052c
Update int_rudder_portal_user_mapping.sql
munish7771 May 4, 2023
14557c9
Update int_user_signup_stages.sql
munish7771 May 4, 2023
26ac99a
Update int_user_signup_stages.sql
munish7771 May 4, 2023
f64a115
Update int_rudder_portal_user_mapping.sql
munish7771 May 4, 2023
8b5aa68
Update int_user_signup_stages.sql
munish7771 May 4, 2023
1ac6047
Update int_user_signup_stages.sql
munish7771 May 4, 2023
23427ec
Update int_user_signup_stages.sql
munish7771 May 4, 2023
b618102
Update int_user_signup_stages.sql
munish7771 May 4, 2023
0243f29
Update _int_signup__models.yml
munish7771 May 4, 2023
4f1d7be
Update int_rudder_portal_user_mapping.sql
munish7771 May 4, 2023
fb51c91
Update int_rudder_portal_user_mapping.sql
munish7771 May 4, 2023
24842b5
Update int_rudder_portal_user_mapping.sql
munish7771 May 4, 2023
ddccc72
Changed event table name
munish7771 May 4, 2023
e9cbd79
Review changes
munish7771 May 5, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
version: 2
Copy link
Contributor

@ifoukarakis ifoukarakis May 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better place to place signup related models can be under the name of the team responsible for this flow. Placing them under data_eng feels a bit strange.


models:
- name: int_signups_aggregated_to_users
description: User signup stages, aggregated by users.

columns:
- name: portal_customer_id
description: Customer identifier that joins to customer info coming from stripe.
tests:
ifoukarakis marked this conversation as resolved.
Show resolved Hide resolved
- not_null
- name: account_created
description: Boolean value indicating if the user created the account.
- name: email_verified
description: Boolean value indicating if the user verified the email address.
- name: workspace_created
description: Boolean value indicating if the user created the workspace, defaults to false.
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
{{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this model capture the users that have completed signups. The only aggregation that happens seems to be happening in order to achieve some deduplication (?). Perhaps the name of the model could be int_user_signups.

config({
"materialized": "table",
"incremental_strategy": "merge",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incremental strategy is defined, but the model is not incremental.

"unique_key": ['portal_customer_id'],
"merge_update_columns": ['verify_email']
})
}}

WITH identifies as (
SELECT user_id,
coalesce(portal_customer_id, context_traits_portal_customer_id) as portal_customer_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This coalesce should ideally be in the staging model. It looks like a reusable part and having it on the first model would ensure other models follow the same convention. It will also make downstream queries easier to read.

The same goes for any filtering or deduplication that might be needed.

ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY RECEIVED_AT) AS row_number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use qualify in identifies CTE?

If I understand correctly, the idea is to create a mapping between user_id and portal_customer_id. For this case, the logic is rather simple, as for each user_id a single portal_customer_id must exist and vice versa. This can be moved to a separate intermediate model because:

  • it's extremely likely that it will be reused,
  • this way tests on the 1:1 mapping can be added.

In fact it's a common challenge when using data from CDP platforms. Here's a few examples on how it's performed using DBT:

FROM
{{ ref('stg_portal_prod__identifies') }}
WHERE
coalesce(portal_customer_id, context_traits_portal_customer_id) IS NOT NULL and received_at >= '2023-04-04'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 2023-04-04? Any chance this is a left-over to make things run faster?

), pageviews as (
SELECT
user_id,
event_table,
received_at
FROM
{{ ref('stg_portal_prod__pageviews') }}
WHERE
received_at >= '2023-04-04'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same date appears here. What does it mean?

Also does it have to be received_at or timestamp? The first one is the timestamp that the event was processed by rudderstack, while the second is the event's timestamp.

), signups as(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
), signups as(
), signups as (

SELECT
identifies.portal_customer_id,
-- Account is created when portal_customer_id exists
true AS account_created,
-- Email is verified and user is redirected to `pageview_create_workspace` screen.
MAX(CASE WHEN pageviews.event_table = 'pageview_create_workspace' THEN true ELSE false END) AS email_verified,
-- Setting to false as we consider `workspace_installation_id` from stripe as source of truth
false AS workspace_created
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this false always? If it is information that will come from a different source, why add it here?

FROM
pageviews
JOIN (select * from identifies where row_number = 1) identifies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use qualify in identifies CTE?

ON pageviews.user_id = identifies.user_id
GROUP BY
identifies.portal_customer_id
)

select * from
signups
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,30 @@ models:
- name: received_at
description: Timestamp registered by RudderStack when the event was ingested (received).

- name: stg_portal_prod__identifies
description: |
Contains mapping from rudder user_id to portal user_id

columns:
- name: user_id
description: The ID of the user that sent the event.
- name: portal_customer_id
description: Portal customer id for the user
- name: context_traits_portal_customer_id
description: Duplicate of portal_customer_id ingested via context traits, required since sometimes portal_customer_id is null
- name: received_at
description: Timestamp registered by RudderStack when the event was ingested (received).

- name: stg_portal_prod__pageviews
description: |
Contains mapping from rudder user_id to portal user_id
catalintomai marked this conversation as resolved.
Show resolved Hide resolved

columns:
- name: pageview_id
description: The pageview ID of the event.
- name: user_id
description: The ID of the user that sent the event.
- name: event_table
description: The event table that records the event request.
- name: received_at
description: Timestamp registered by RudderStack when the event was ingested (received).
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{% set rudder_relations = dbt_utils.get_relations_by_prefix(schema="PORTAL_PROD", database="RAW", prefix="PAGEVIEW_") %}

{{ dbt_utils.union_relations(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all pageview columns required?

Unioning the tables to create a staging "tracks" from events is not really following DBT's suggestion for project structure, but definitely a necessary thing to do. One thing that can be done to make things more predictable is to explicitly define which columns to keep. In this case it should be the columns shared between event tables. This way whenever a new property is added on any pageview_ table the current model won't be affected.

relations=rudder_relations
) }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
WITH identifies as(
SELECT
{{ dbt_utils.star(source('portal_prod', 'identifies')) }}
FROM
{{ source('portal_prod', 'identifies') }}
)

SELECT
user_id,
portal_customer_id,
context_traits_portal_customer_id,
received_at
FROM
identifies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach introduces a new approach for staging models. Let's follow the pattern described in DBT's documentation. DBT codegen automatically generated this, so less effort there.

Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@

with pageviews as(
SELECT
{{ dbt_utils.star(ref('base_portal_prod__tracks')) }}
FROM
{{ ref ('base_portal_prod__tracks') }}
)

select
id as pageview_id,
user_id,
event as event_table,
received_at
from
pageviews
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the CTE is needed.