Skip to content

Commit

Permalink
Upgrade report data pipelines (#30)
Browse files Browse the repository at this point in the history
* demo report

* fix local package

* crawl reports tag triggered

* timeseries added

* split tables

* lint

* tech report tables

* check tech report sql

* missing declaration

* formatting

* preOps

* dataset change

* cwv_tech_report tested

* tech_reports moved

* exporter function draft

* fix depependencies

* rename

* dataset renamed

* storage exp draft

* date column for histograms

* dev flag

* gsc export tested

* pubsub sink prepared

* export fn deployed

* order incompatible with partitions

* monitoring

* lint

* event parsing draft

* cleanup before inserts

* event parsing

* partitioned exports

* exclude scripts

* firestore export draft

* optional description

* single dataset

* move

* incremental operations

* docs update

* firestore dict tested

* reports tested

* full sql export

* trigger params

* hashed doc ids

* more resources and timeout

* extend timeout

* gzip

* event example

* esm

* more parallelization improvements

* tested batch reports

* testing fast deletion

* deletion tested

* limit concurrency

* retries

* wait to resolve

* tested deployed version

* cleanup for test merge

* cwv-tech-report to prod db

* note to unwrap pubsub payloads

* cleanup

* lint

* revisited template builder

* cleanup

* tf 6.13

* lint

* renamed

* aligned timeout with prod

* simplify tags
  • Loading branch information
max-ostapenko authored Dec 9, 2024
1 parent 5f0c2ed commit ef54451
Show file tree
Hide file tree
Showing 41 changed files with 4,113 additions and 201 deletions.
1 change: 0 additions & 1 deletion .github/workflows/linter.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,4 +33,3 @@ jobs:
VALIDATE_JSCPD: false
VALIDATE_JAVASCRIPT_PRETTIER: false
VALIDATE_MARKDOWN_PRETTIER: false
VALIDATE_GITHUB_ACTIONS: false
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@ node_modules/

# Terraform
infra/tf/.terraform/
infra/tf/tmp/
**/*.zip
11 changes: 2 additions & 9 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,14 +1,7 @@
FN_NAME = dataform-trigger

.PHONY: *

start:
npx functions-framework --target=$(FN_NAME) --source=./infra/dataform-trigger/ --signature-type=http --port=8080 --debug

tf_plan:
terraform -chdir=infra/tf init -upgrade && terraform -chdir=infra/tf plan \
-var="FUNCTION_NAME=$(FN_NAME)"
terraform -chdir=infra/tf init -upgrade && terraform -chdir=infra/tf plan

tf_apply:
terraform -chdir=infra/tf init && terraform -chdir=infra/tf apply -auto-approve \
-var="FUNCTION_NAME=$(FN_NAME)"
terraform -chdir=infra/tf init && terraform -chdir=infra/tf apply -auto-approve
42 changes: 7 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Tag: `crawl_complete`

### Core Web Vitals Technology Report

Tag: `cwv_tech_report`
Tag: `crux_ready`

- httparchive.core_web_vitals.technologies

Expand All @@ -26,7 +26,7 @@ Consumers:

### Blink Features Report

Tag: `blink_features_report`
Tag: `crawl_complete`

- httparchive.blink_features.features
- httparchive.blink_features.usage
Expand All @@ -35,30 +35,15 @@ Consumers:

- chromestatus.com - [example](https://chromestatus.com/metrics/feature/timeline/popularity/2089)

### Legacy crawl results (to be deprecated)

Tag: `crawl_results_legacy`

- httparchive.all.pages
- httparchive.all.parsed_css
- httparchive.all.requests
- httparchive.lighthouse.YYYY_MM_DD_client
- httparchive.pages.YYYY_MM_DD_client
- httparchive.requests.YYYY_MM_DD_client
- httparchive.response_bodies.YYYY_MM_DD_client
- httparchive.summary_pages.YYYY_MM_DD_client
- httparchive.summary_requests.YYYY_MM_DD_client
- httparchive.technologies.YYYY_MM_DD_client

## Schedules

1. [crawl-complete](https://console.cloud.google.com/cloudpubsub/subscription/detail/dataformTrigger?authuser=7&project=httparchive) PubSub subscription

Tags: ["crawl_complete", "blink_features_report", "crawl_results_legacy"]
Tags: ["crawl_complete"]

2. [bq-poller-cwv-tech-report](https://console.cloud.google.com/cloudscheduler/jobs/edit/us-east4/bq-poller-cwv-tech-report?authuser=7&project=httparchive) Scheduler

Tags: ["cwv_tech_report"]
Tags: ["crux_ready"]

### Triggering workflows

Expand All @@ -72,20 +57,7 @@ In order to unify the workflow triggering mechanism, we use [a Cloud Run functio
2. Make adjustments to the dataform configuration files and manually run a workflow to verify.
3. Push all your changes to a dev branch & open a PR with the link to the BigQuery artifacts generated in the test workflow.

### Dataform development workspace hints

1. In workflow settings vars:

- set `env_name: dev` to process sampled data in dev workspace.
- change `today` variable to a month in the past. May be helpful for testing pipelines based on `chrome-ux-report` data.

2. `definitions/extra/test_env.sqlx` script helps to setup the tables required to run pipelines when in dev workspace. It's disabled by default.

### Error Monitoring

The issues within the pipeline are being tracked using the following alerts:

1. the event trigger processing fails - [Dataform Trigger Function Error](https://console.cloud.google.com/monitoring/alerting/policies/570799173843203905?authuser=7&project=httparchive)
2. a job in the workflow fails - "[Dataform Workflow Invocation Failed](https://console.cloud.google.com/monitoring/alerting/policies/16526940745374967367?authuser=7&project=httparchive)
#### Workspace hints

Error notifications are sent to [#10x-infra](https://httparchive.slack.com/archives/C030V4WAVL3) Slack channel.
1. In `workflow_settings.yaml` set `env_name: dev` to process sampled data.
2. In `includes/constants.js` set `today` or other variables to a custome value.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,8 @@ for (const table of stagingTables) {
name: table
})
}

declare({
schema: 'wappalyzer',
name: 'apps'
})
2 changes: 1 addition & 1 deletion definitions/output/blink_features/features.js
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ publish('features', {
partitionBy: 'yyyymmdd',
clusterBy: ['client', 'rank']
},
tags: ['blink_features_report']
tags: ['crawl_complete']
}).preOps(ctx => `
DELETE FROM ${ctx.self()}
WHERE yyyymmdd = DATE '${constants.currentMonth}';
Expand Down
2 changes: 1 addition & 1 deletion definitions/output/blink_features/usage.js
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ publish('usage', {
schema: 'blink_features',
type: 'incremental',
protected: true,
tags: ['blink_features_report']
tags: ['crawl_complete']
}).preOps(ctx => `
DELETE FROM ${ctx.self()}
WHERE yyyymmdd = REPLACE('${constants.currentMonth}', '-', '');
Expand Down
65 changes: 28 additions & 37 deletions definitions/output/core_web_vitals/technologies.js
Original file line number Diff line number Diff line change
Expand Up @@ -9,17 +9,25 @@ publish('technologies', {
clusterBy: ['geo', 'app', 'rank', 'client'],
requirePartitionFilter: true
},
tags: ['cwv_tech_report'],
tags: ['crux_ready'],
dependOnDependencyAssertions: true
}).preOps(ctx => `
DELETE FROM ${ctx.self()}
WHERE date = '${pastMonth}';
CREATE TEMP FUNCTION IS_GOOD(good FLOAT64, needs_improvement FLOAT64, poor FLOAT64) RETURNS BOOL AS (
CREATE TEMP FUNCTION IS_GOOD(
good FLOAT64,
needs_improvement FLOAT64,
poor FLOAT64
) RETURNS BOOL AS (
SAFE_DIVIDE(good, good + needs_improvement + poor) >= 0.75
);
CREATE TEMP FUNCTION IS_NON_ZERO(good FLOAT64, needs_improvement FLOAT64, poor FLOAT64) RETURNS BOOL AS (
CREATE TEMP FUNCTION IS_NON_ZERO(
good FLOAT64,
needs_improvement FLOAT64,
poor FLOAT64
) RETURNS BOOL AS (
good + needs_improvement + poor > 0
);
`).query(ctx => `
Expand All @@ -28,17 +36,15 @@ WITH geo_summary AS (
CAST(REGEXP_REPLACE(CAST(yyyymm AS STRING), r'(\\d{4})(\\d{2})', r'\\1-\\2-01') AS DATE) AS date,
* EXCEPT (country_code),
\`chrome-ux-report\`.experimental.GET_COUNTRY(country_code) AS geo
FROM
${ctx.ref('chrome-ux-report', 'materialized', 'country_summary')}
FROM ${ctx.ref('chrome-ux-report', 'materialized', 'country_summary')}
WHERE
yyyymm = CAST(FORMAT_DATE('%Y%m', '${pastMonth}') AS INT64) AND
device IN ('desktop', 'phone')
UNION ALL
SELECT
* EXCEPT (yyyymmdd, p75_fid_origin, p75_cls_origin, p75_lcp_origin, p75_inp_origin),
'ALL' AS geo
FROM
${ctx.ref('chrome-ux-report', 'materialized', 'device_summary')}
FROM ${ctx.ref('chrome-ux-report', 'materialized', 'device_summary')}
WHERE
date = '${pastMonth}' AND
device IN ('desktop', 'phone')
Expand Down Expand Up @@ -81,20 +87,17 @@ crux AS (
IS_GOOD(fast_ttfb, avg_ttfb, slow_ttfb) AS good_ttfb,
IS_NON_ZERO(fast_inp, avg_inp, slow_inp) AS any_inp,
IS_GOOD(fast_inp, avg_inp, slow_inp) AS good_inp
FROM
geo_summary,
FROM geo_summary,
UNNEST([1000, 10000, 100000, 1000000, 10000000, 100000000]) AS _rank
WHERE
rank <= _rank
WHERE rank <= _rank
),
technologies AS (
SELECT
technology.technology AS app,
client,
page AS url
FROM
${ctx.ref('crawl', 'pages')},
FROM ${ctx.ref('crawl', 'pages')},
UNNEST(technologies) AS technology
WHERE
date = '${pastMonth}'
Expand All @@ -106,8 +109,7 @@ UNION ALL
'ALL' AS app,
client,
page AS url
FROM
${ctx.ref('crawl', 'pages')}
FROM ${ctx.ref('crawl', 'pages')}
WHERE
date = '${pastMonth}'
${constants.devRankFilter}
Expand All @@ -117,21 +119,18 @@ categories AS (
SELECT
technology.technology AS app,
ARRAY_TO_STRING(ARRAY_AGG(DISTINCT category IGNORE NULLS ORDER BY category), ', ') AS category
FROM
${ctx.ref('crawl', 'pages')},
FROM ${ctx.ref('crawl', 'pages')},
UNNEST(technologies) AS technology,
UNNEST(technology.categories) AS category
WHERE
date = '${pastMonth}'
${constants.devRankFilter}
GROUP BY
app
GROUP BY app
UNION ALL
SELECT
'ALL' AS app,
ARRAY_TO_STRING(ARRAY_AGG(DISTINCT category IGNORE NULLS ORDER BY category), ', ') AS category
FROM
${ctx.ref('crawl', 'pages')},
FROM ${ctx.ref('crawl', 'pages')},
UNNEST(technologies) AS technology,
UNNEST(technology.categories) AS category
WHERE
Expand All @@ -153,8 +152,7 @@ summary_stats AS (
SAFE.FLOAT64(lighthouse.categories.performance.score) AS performance,
SAFE.FLOAT64(lighthouse.categories.pwa.score) AS pwa,
SAFE.FLOAT64(lighthouse.categories.seo.score) AS seo
FROM
${ctx.ref('crawl', 'pages')}
FROM ${ctx.ref('crawl', 'pages')}
WHERE
date = '${pastMonth}'
${constants.devRankFilter}
Expand All @@ -174,16 +172,11 @@ lab_data AS (
AVG(performance) AS performance,
AVG(pwa) AS pwa,
AVG(seo) AS seo
FROM
summary_stats
JOIN
technologies
USING
(client, url)
JOIN
categories
USING
(app)
FROM summary_stats
JOIN technologies
USING (client, url)
JOIN categories
USING (app)
GROUP BY
client,
root_page_url,
Expand Down Expand Up @@ -232,10 +225,8 @@ SELECT
SAFE_CAST(APPROX_QUANTILES(bytesJS, 1000)[OFFSET(500)] AS INT64) AS median_bytes_js,
SAFE_CAST(APPROX_QUANTILES(bytesImg, 1000)[OFFSET(500)] AS INT64) AS median_bytes_image
FROM
lab_data
JOIN
crux
FROM lab_data
JOIN crux
USING
(client, root_page_url)
GROUP BY
Expand Down
49 changes: 49 additions & 0 deletions definitions/output/reports/cwv_tech_adoption.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
const pastMonth = constants.fnPastMonth(constants.currentMonth)

publish('cwv_tech_adoption', {
schema: 'reports',
type: 'incremental',
protected: true,
bigquery: {
partitionBy: 'date',
clusterBy: ['rank', 'geo']
},
tags: ['crux_ready']
}).preOps(ctx => `
CREATE TEMPORARY FUNCTION GET_ADOPTION(
records ARRAY<STRUCT<
client STRING,
origins INT64
>>)
RETURNS STRUCT<
desktop INT64,
mobile INT64
>
LANGUAGE js AS '''
return Object.fromEntries(
records.map(({client, origins}) => {
return [client, origins]
}))
''';
DELETE FROM ${ctx.self()}
WHERE date = '${pastMonth}';
`).query(ctx => `
/* {"dataform_trigger": "report_cwv_tech_complete", "date": "${pastMonth}", "name": "adoption", "type": "report"} */
SELECT
date,
app AS technology,
rank,
geo,
GET_ADOPTION(ARRAY_AGG(STRUCT(
client,
origins
))) AS adoption
FROM ${ctx.ref('core_web_vitals', 'technologies')}
WHERE date = '${pastMonth}'
GROUP BY
date,
app,
rank,
geo
`)
51 changes: 51 additions & 0 deletions definitions/output/reports/cwv_tech_categories.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
const pastMonth = constants.fnPastMonth(constants.currentMonth)

publish('cwv_tech_categories', {
schema: 'reports',
type: 'table',
tags: ['crux_ready']
}).query(ctx => `
/* {"dataform_trigger": "report_cwv_tech_complete", "name": "categories", "type": "dict"} */
WITH pages AS (
SELECT
root_page,
technologies
FROM ${ctx.ref('crawl', 'pages')}
WHERE
date = '${pastMonth}' AND
client = 'mobile'
${constants.devRankFilter}
),categories AS (
SELECT
category,
COUNT(DISTINCT root_page) AS origins
FROM pages,
UNNEST(technologies) AS t,
UNNEST(t.categories) AS category
GROUP BY category
),
technologies AS (
SELECT
category,
technology,
COUNT(DISTINCT root_page) AS origins
FROM pages,
UNNEST(technologies) AS t,
UNNEST(t.categories) AS category
GROUP BY
category,
technology
)
SELECT
category,
categories.origins,
ARRAY_AGG(technology ORDER BY technologies.origins DESC) AS technologies
FROM categories
JOIN technologies
USING (category)
GROUP BY
category,
categories.origins
ORDER BY categories.origins DESC
`)
Loading

0 comments on commit ef54451

Please sign in to comment.