-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1922986 - Create script that exports BMO data as JSON suitable for import into a BigQuery instance in GCP #2370
base: master
Are you sure you want to change the base?
Conversation
…r import into a BigQuery instance in GCP
…g ./extensions/BMO/bin/export_bmo_etl.pl
* upstream/master: Bumped version to 20241211.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just a few comments/questions.
It's a lot less scary in terms of PR size once you realize there is quite a lot of repetition. You could abstract a lot of that away, but if you're pressed for time/deadlines it's okay to leave as-is.
extensions/BMO/bin/export_bmo_etl.pl
Outdated
# to make sure other entries with this date are not already present. | ||
check_for_duplicates(); | ||
|
||
### Bugs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code related to each individual table should be moved to a separate function, like extract_bugs
, instead of being one large script separated by comments. This goes for attachments, flags, keywords etc.
Nit: also, looking at each of these sections, it seems there is some repeated structure amongst them. Each one preps a statement, runs a while loop over the total number of rows, fetches a row, checks the cache, then extracts the data and submits. You could eliminate that repetition.
extensions/BMO/bin/export_bmo_etl.pl
Outdated
$table_name = 'bug_mentors'; | ||
$rows | ||
= $dbh->selectall_arrayref( | ||
'SELECT bug_id, user_id FROM bug_mentors ORDER BY bug_id', | ||
{Slice => {}}); | ||
|
||
$total = scalar @{$rows}; | ||
$count = 0; | ||
|
||
print "Processing $total $table_name.\n" if $verbose; | ||
|
||
@results = (); | ||
foreach my $row (@{$rows}) { | ||
my $data = {bug_id => $row->{bug_id}, user_id => $row->{user_id}}; | ||
|
||
push @results, $data; | ||
|
||
$count++; | ||
|
||
# Send the rows to the server if we have a specific sized block' | ||
# or we are at the last row | ||
if (scalar @results == API_BLOCK_COUNT || $total == $count) { | ||
send_data($table_name, \@results, $count); | ||
@results = (); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These sections should also be moved to separate functions.
There is room for abstracting away the general structure here as well, lots of shared code between "mentors"/"see also"/etc.
extensions/BMO/template/en/default/hook/admin/params/editparams-current_panel.html.tmpl
Outdated
Show resolved
Hide resolved
I need to make some other non-related changes anyway as I realized i left out two additional requirements. 1) I need to skip sending bug data related to the 'Legal' product and 2) I need to filter the summaries, etc. for bugs that are private. |
I am going to make one last schema change to the BQ data so I will need to create the needed PRs in webservices-infra for this :( Once that is done, I will be able to submitted the updated code here for last review. Context: The bugs.group field needs to be a string as I originally had it as 1) most bugs do not have more than one group anyway and 2) if there is more than one group, we just want to put the group name with the least amount of users. I missed that little detail from the schema outlined here: https://docs.google.com/spreadsheets/d/1LDrYkGxZGnktI3-Ei3fpPK2TAXjxTj7fBUjHaJ8tpBE/edit Sorry for the extra delay. |
Sorry this so late. Lots of moving pieces had to be coordinated with webservices-infra, etc. just so I could test and then small changes one after another to get to this point.
Let me know if you have any questions as it is a lot.