-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CPDB nightly build #1405
base: main
Are you sure you want to change the base?
Fix CPDB nightly build #1405
Conversation
5835d7c
to
fb996d8
Compare
run_sql_command "DROP TABLE IF EXISTS doitt_buildingfootprints_source;" | ||
echo "fixing doitt_buildingfootprints_source and saving result into doitt_buildingfootprints" | ||
run_sql_command " | ||
DROP TABLE IF EXISTS doitt_buildingfootprints; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I might lean dropping this entirely, and just handling null bins as needed. But I think that's getting into more of a general refactor - if we ever dbtify cpdb something like that can happen then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we can drop this step. I don't think we need to handle the null bins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree we can drop this. looks to me like attributes_agencyverified_geoms.sql
handles null bins during the UPDATE dcp_cpdb_agencyverified
part
and if this is dropped, than maybe there's no need to do import_as: doitt_buildingfootprints_source
in the recipe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@damonmcc yep, I will make the updates. Just waiting to meet with AD tmw to finalize this PR
@@ -1121,16 +1121,16 @@ Canceled - remove from map,DCAS,,,,, Fixed Asset, BPL - MILL BASIN BR LED RETROF | |||
,DOHMH,,,,, Fixed Asset, APICHA COMMUNITY HEALTH CENTER: OUTFITTING OF CLINICAL EXAM ,Pass-through,850HLQNAPICH,unmapped,,, | |||
,DOHMH,,,,, , CATHOLIC MANAGED LONG TERM CARE (D/B/A ARCHCARE SENIOR LIFE ,Pass-through,850HL82ARCHC,unmapped,,, | |||
,DOHMH,,,,, Fixed Asset, DOH CLINIC CODE CORRECTIONS ,Pass-through,81901199601,unmapped,,, | |||
385 Throop Ave.,DOHMH,3017960001.0,3050268???,Brooklyn,, Fixed Asset, DOHMH - LED LIGHTING UPGRADE (Bedford) ,Yes - Multi-site,816ACEDOH502,unmapped,,,11221.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you investigated these bbls/bins at all? Do we know if this is a valid bin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They do seem to all be 7 digits after removing question marks which at least means they're all theoretically valid bins now. I'm so fascinated by the question marks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fvankrieken, great question. Yes, I did investigate them in doitt_buildingfootprints
and found them to be valid bins but without a question mark in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might just want to check with AD if she has any idea why these were like this in the first place.
We have other invalid rows like
Which do have at least a valid BIN. Not that we should do anything about those right now, but maybe the question marks were included because they were unsure?? Just would be worth trying to make sure there's no reason/meaning. Don't see anything in the cpdb wiki
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like me to check with AD prior merging the PR?
At some point we need to have an effort to go through our whole codebase and make sure bbls and bins are strings everywhere. But today is not that day. |
Yeah, this should probably happen after we fully implement data checks for source data to validate data type (rn column data type in a ingest recipe is ignored) |
- Keep raw dataset version - Do processing in database instead of pandas
The bin column in `dcp_cpdb_agencyverified` has invalid numeric values like "#REF!", "2610 / 336", and "3004301 (Additional BIN: 3829437 )". During the join, the table is filtered for decimal and integer values. Added casting to decimal because decimal string like "123.0" cannot be casted to integer directly
So there are two points I can find that refer to
Can you assess what impact that might have on these queries and any downstream logic? |
I'm running repeat builds of 24adopt on main and this branch and would like to look at the final tables and differences between them before merging. I can't imagine anything that would prevent merging, but just seems that we should assess impact before getting this fix on main |
We haven't used |
Sorry - we made a decision to not filter |
@fvankrieken, the most recent |
@fvankrieken, filtering for non-null geoms in When you were comparing the tables between my older branch and main, was the datasource |
Did you repeat 24adopt, or build with latest data? I repeated 24adopt, so if you built with latest there's a chance things have changed. Given that you now have no This is what I saw on my runs - all rows that are getting assigned "footprint_script" as source on dev branch shown below
|
Closes #1399.
What
Nightly qa build for CPDB was failing, and this PR is meant to fix it. I recommend reviewing the GH issue linked above for more details.
Note: the build started passing again but it's the same behavior as in builds prior fails where
doitt_buildingfootprints
is empty prior the join. This PR is still very much needed.Changes
dcp_cpdb_agencyverified
: remove?
from values in thebin
columndoitt_buildingfootprints
dataset frompandas
to SQL and keep original raw version (previously it was dropped after pre-processing making it hard to debug)bin
column only has valid numeric values. Very basic.pandas
? Becausepandas
changes the rawbin
column to decimal (there are NULL values present) and then uploads the dataset back to database with the column as decimal. We do correct data types iningest
and thispandas
behavior adds more complexity.&
doitt_buildingfootprintsto filter out invalid
bin` values.Successful build here.