-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chore: Added stage tables for portal pageviews #1267
base: master
Are you sure you want to change the base?
Changes from 32 commits
2e0cdbc
f0acfe5
3b05b3d
3473a19
1ba4685
d9d1475
8e06fc1
76e37d9
e499ede
8a75f26
024e972
67d054c
71c2d47
b681562
6a9748d
3c56450
864850e
3a7ff3e
cad413a
fda7444
17a8feb
c01aa85
ac2c852
280b8dc
7794524
3d0d7c1
e5c35df
5c7cb65
e20fe5d
82e5231
5838893
3e3426c
4375545
0f171eb
5fcf837
e55f634
fe86592
8d8052c
14557c9
26ac99a
f64a115
8b5aa68
1ac6047
23427ec
b618102
0243f29
4f1d7be
fb51c91
24842b5
ddccc72
e9cbd79
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
version: 2 | ||
|
||
models: | ||
- name: int_signups_aggregated_to_users | ||
description: User signup stages, aggregated by users. | ||
|
||
columns: | ||
- name: portal_customer_id | ||
description: Customer identifier that joins to customer info coming from stripe. | ||
tests: | ||
ifoukarakis marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- not_null | ||
- name: account_created | ||
description: Boolean value indicating if the user created the account. | ||
- name: email_verified | ||
description: Boolean value indicating if the user verified the email address. | ||
- name: workspace_created | ||
description: Boolean value indicating if the user created the workspace, defaults to false. |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,45 @@ | ||||||
{{ | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I understand correctly, this model capture the users that have completed signups. The only aggregation that happens seems to be happening in order to achieve some deduplication (?). Perhaps the name of the model could be |
||||||
config({ | ||||||
"materialized": "table", | ||||||
"incremental_strategy": "merge", | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Incremental strategy is defined, but the model is not incremental. |
||||||
"unique_key": ['portal_customer_id'], | ||||||
"merge_update_columns": ['verify_email'] | ||||||
}) | ||||||
}} | ||||||
|
||||||
WITH identifies as ( | ||||||
SELECT user_id, | ||||||
coalesce(portal_customer_id, context_traits_portal_customer_id) as portal_customer_id, | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This The same goes for any filtering or deduplication that might be needed. |
||||||
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY RECEIVED_AT) AS row_number | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not use If I understand correctly, the idea is to create a mapping between
In fact it's a common challenge when using data from CDP platforms. Here's a few examples on how it's performed using DBT: |
||||||
FROM | ||||||
{{ ref('stg_portal_prod__identifies') }} | ||||||
WHERE | ||||||
coalesce(portal_customer_id, context_traits_portal_customer_id) IS NOT NULL and received_at >= '2023-04-04' | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why |
||||||
), pageviews as ( | ||||||
SELECT | ||||||
user_id, | ||||||
event_table, | ||||||
received_at | ||||||
FROM | ||||||
{{ ref('stg_portal_prod__pageviews') }} | ||||||
WHERE | ||||||
received_at >= '2023-04-04' | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The same date appears here. What does it mean? Also does it have to be |
||||||
), signups as( | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
SELECT | ||||||
identifies.portal_customer_id, | ||||||
-- Account is created when portal_customer_id exists | ||||||
true AS account_created, | ||||||
-- Email is verified and user is redirected to `pageview_create_workspace` screen. | ||||||
MAX(CASE WHEN pageviews.event_table = 'pageview_create_workspace' THEN true ELSE false END) AS email_verified, | ||||||
-- Setting to false as we consider `workspace_installation_id` from stripe as source of truth | ||||||
false AS workspace_created | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is this false always? If it is information that will come from a different source, why add it here? |
||||||
FROM | ||||||
pageviews | ||||||
JOIN (select * from identifies where row_number = 1) identifies | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not use |
||||||
ON pageviews.user_id = identifies.user_id | ||||||
GROUP BY | ||||||
identifies.portal_customer_id | ||||||
) | ||||||
|
||||||
select * from | ||||||
signups |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
{% set rudder_relations = dbt_utils.get_relations_by_prefix(schema="PORTAL_PROD", database="RAW", prefix="PAGEVIEW_") %} | ||
|
||
{{ dbt_utils.union_relations( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are all pageview columns required? Unioning the tables to create a staging "tracks" from events is not really following DBT's suggestion for project structure, but definitely a necessary thing to do. One thing that can be done to make things more predictable is to explicitly define which columns to keep. In this case it should be the columns shared between event tables. This way whenever a new property is added on any |
||
relations=rudder_relations | ||
) }} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
WITH identifies as( | ||
SELECT | ||
{{ dbt_utils.star(source('portal_prod', 'identifies')) }} | ||
FROM | ||
{{ source('portal_prod', 'identifies') }} | ||
) | ||
|
||
SELECT | ||
user_id, | ||
portal_customer_id, | ||
context_traits_portal_customer_id, | ||
received_at | ||
FROM | ||
identifies | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This approach introduces a new approach for staging models. Let's follow the pattern described in DBT's documentation. DBT codegen automatically generated this, so less effort there. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
|
||
with pageviews as( | ||
SELECT | ||
{{ dbt_utils.star(ref('base_portal_prod__tracks')) }} | ||
FROM | ||
{{ ref ('base_portal_prod__tracks') }} | ||
) | ||
|
||
select | ||
id as pageview_id, | ||
user_id, | ||
event as event_table, | ||
received_at | ||
from | ||
pageviews | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure if the CTE is needed. |
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better place to place signup related models can be under the name of the team responsible for this flow. Placing them under
data_eng
feels a bit strange.