Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CPDB nightly build #1405

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Fix CPDB nightly build #1405

wants to merge 8 commits into from

Conversation

sf-dcp
Copy link
Contributor

@sf-dcp sf-dcp commented Jan 27, 2025

Closes #1399.

What

Nightly qa build for CPDB was failing, and this PR is meant to fix it. I recommend reviewing the GH issue linked above for more details.

Note: the build started passing again but it's the same behavior as in builds prior fails where doitt_buildingfootprints is empty prior the join. This PR is still very much needed.

Changes

  1. Clean up input csv dcp_cpdb_agencyverified: remove ? from values in the bin column
  2. Move pre-processing of doitt_buildingfootprints dataset from pandas to SQL and keep original raw version (previously it was dropped after pre-processing making it hard to debug)
    • During the pre-processing step, the dataset is filtered where thebin column only has valid numeric values. Very basic.
    • Why not to keep pre-processing in pandas? Because pandas changes the raw bin column to decimal (there are NULL values present) and then uploads the dataset back to database with the column as decimal. We do correct data types in ingest and this pandas behavior adds more complexity.
  3. Update the join query between dcp_cpdb_agencyverified&doitt_buildingfootprintsto filter out invalidbin` values.

Successful build here.

@sf-dcp sf-dcp force-pushed the sf-debug-cpdb branch 7 times, most recently from 5835d7c to fb996d8 Compare January 28, 2025 19:24
@sf-dcp sf-dcp marked this pull request as ready for review January 28, 2025 19:30
run_sql_command "DROP TABLE IF EXISTS doitt_buildingfootprints_source;"
echo "fixing doitt_buildingfootprints_source and saving result into doitt_buildingfootprints"
run_sql_command "
DROP TABLE IF EXISTS doitt_buildingfootprints;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I might lean dropping this entirely, and just handling null bins as needed. But I think that's getting into more of a general refactor - if we ever dbtify cpdb something like that can happen then

Copy link
Contributor Author

@sf-dcp sf-dcp Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can drop this step. I don't think we need to handle the null bins

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree we can drop this. looks to me like attributes_agencyverified_geoms.sql handles null bins during the UPDATE dcp_cpdb_agencyverified part

and if this is dropped, than maybe there's no need to do import_as: doitt_buildingfootprints_source in the recipe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@damonmcc yep, I will make the updates. Just waiting to meet with AD tmw to finalize this PR

@@ -1121,16 +1121,16 @@ Canceled - remove from map,DCAS,,,,, Fixed Asset, BPL - MILL BASIN BR LED RETROF
,DOHMH,,,,, Fixed Asset, APICHA COMMUNITY HEALTH CENTER: OUTFITTING OF CLINICAL EXAM ,Pass-through,850HLQNAPICH,unmapped,,,
,DOHMH,,,,, , CATHOLIC MANAGED LONG TERM CARE (D/B/A ARCHCARE SENIOR LIFE ,Pass-through,850HL82ARCHC,unmapped,,,
,DOHMH,,,,, Fixed Asset, DOH CLINIC CODE CORRECTIONS ,Pass-through,81901199601,unmapped,,,
385 Throop Ave.,DOHMH,3017960001.0,3050268???,Brooklyn,, Fixed Asset, DOHMH - LED LIGHTING UPGRADE (Bedford) ,Yes - Multi-site,816ACEDOH502,unmapped,,,11221.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you investigated these bbls/bins at all? Do we know if this is a valid bin?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do seem to all be 7 digits after removing question marks which at least means they're all theoretically valid bins now. I'm so fascinated by the question marks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fvankrieken, great question. Yes, I did investigate them in doitt_buildingfootprints and found them to be valid bins but without a question mark in the end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might just want to check with AD if she has any idea why these were like this in the first place.

We have other invalid rows like
image

Which do have at least a valid BIN. Not that we should do anything about those right now, but maybe the question marks were included because they were unsure?? Just would be worth trying to make sure there's no reason/meaning. Don't see anything in the cpdb wiki

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like me to check with AD prior merging the PR?

@fvankrieken
Copy link
Contributor

At some point we need to have an effort to go through our whole codebase and make sure bbls and bins are strings everywhere. But today is not that day.

@sf-dcp
Copy link
Contributor Author

sf-dcp commented Jan 28, 2025

At some point we need to have an effort to go through our whole codebase and make sure bbls and bins are strings everywhere. But today is not that day.

Yeah, this should probably happen after we fully implement data checks for source data to validate data type (rn column data type in a ingest recipe is ignored)

- Keep raw dataset version
- Do processing in database instead of pandas
The bin column in `dcp_cpdb_agencyverified` has invalid numeric values like "#REF!", "2610 / 336", and "3004301 (Additional BIN: 3829437 )". During the join, the table is filtered for decimal and integer values.
Added casting to decimal because decimal string like "123.0" cannot be casted to integer directly
@sf-dcp
Copy link
Contributor Author

sf-dcp commented Jan 31, 2025

Met with AD: we decided to filter out all non-numerical bin values like 123??? or 123 (additional bin 321). I also created #1415.

CPDB run with new code updates here.

@fvankrieken
Copy link
Contributor

So there are two points I can find that refer to doitt_buildingfootprints without either joining on bin column or specifying where bin is not null

Can you assess what impact that might have on these queries and any downstream logic?

@fvankrieken
Copy link
Contributor

I'm running repeat builds of 24adopt on main and this branch and would like to look at the final tables and differences between them before merging. I can't imagine anything that would prevent merging, but just seems that we should assess impact before getting this fix on main

@sf-dcp
Copy link
Contributor Author

sf-dcp commented Feb 3, 2025

So there are two points I can find that refer to doitt_buildingfootprints without either joining on bin column or specifying where bin is not null

Can you assess what impact that might have on these queries and any downstream logic?

We haven't used doitt_buildingfootprints during CPDB at all since the table had been empty after its pre-processing step. Can you clarify how to assess the impact? @fvankrieken

@fvankrieken
Copy link
Contributor

So there are two points I can find that refer to doitt_buildingfootprints without either joining on bin column or specifying where bin is not null

Can you assess what impact that might have on these queries and any downstream logic?

We haven't used doitt_buildingfootprints during CPDB at all since the table had been empty after its pre-processing step. Can you clarify how to assess the impact? @fvankrieken

Sorry - we made a decision to not filter where bin is not null, I am referring to the impact of that decision

@sf-dcp
Copy link
Contributor Author

sf-dcp commented Feb 3, 2025

@fvankrieken, the most recent doit_buildingfootprints version has a total of 2 records with null bins (geometry is not null). So the impact is not dramatic. But it makes sense to add a clause 'where bin is not null' just in case, before we add actual data checks.

@sf-dcp
Copy link
Contributor Author

sf-dcp commented Feb 5, 2025

@fvankrieken, filtering for non-null geoms in doitt_buildingfootprints didn't solve the problem: the projects 846P-4OLMST3 & 846P-400ZOO8 still got a NULL value. No datasource is present.

When you were comparing the tables between my older branch and main, was the datasource doitt_buildingfootprints present?

@fvankrieken
Copy link
Contributor

@fvankrieken, filtering for non-null geoms in doitt_buildingfootprints didn't solve the problem: the projects 846P-4OLMST3 & 846P-400ZOO8 still got a NULL value. No datasource is present.

When you were comparing the tables between my older branch and main, was the datasource doitt_buildingfootprints present?

Did you repeat 24adopt, or build with latest data? I repeated 24adopt, so if you built with latest there's a chance things have changed. Given that you now have no datasource I would guess this might be the case and that your code changes are working as expected. However, you should repeat 24adopt to make sure

This is what I saw on my runs - all rows that are getting assigned "footprint_script" as source on dev branch shown below

maprojid description dev_geom_source main_geom_source dev_geom main_geom
846P-4OLMST3 OLMSTED CTR FURNITURE Q099-808MA1 footprint_script dpr ST_MultiPolygon
846SANDY3-14 SANDY-Dock at 79 Street Boat Basin M071-310M footprint_script dpr ST_MultiPoint ST_MultiPolygon
846P-400ZOO8 Queens Zoo Education Classroom Construction Q099-123M footprint_script dpr ST_MultiPolygon
846P-4OLMST OLMSTED CTR RC AND ANNEX DESIGN FEE Q099-808M/1108M footprint_script dpr
846P-415PAVI FMCP - NYS PAVILION RC Q099-116M footprint_script dpr ST_MultiPoint ST_MultiPolygon
846SANDY2-14 SANDY- Abe Stark boilers and chillers (B169-111M) footprint_script dpr ST_MultiPoint ST_MultiPolygon
846SANDY4-21 SANDY-FOREST PARK GREENHOUSE BROKEN GLASS REPLACEMENT footprint_script Algorithm
846P-415LWPG FMCP - Lawrence Plgd & PRB R/C Q099-216M- DESIGN footprint_script dpr
846P-200CRSL PPA- Recon. Carousel (B073-218M) footprint_script dpr ST_MultiPoint ST_MultiPolygon
846P-200PPOP PPA - Prospect Park Oriental Pavilion B073-416M footprint_script dpr ST_MultiPoint ST_MultiPolygon
846P-200WCST PPA-Wellhouse Public Rest Room Renovation - (B073-209M) footprint_script dpr ST_MultiPoint ST_MultiPolygon
846P-200PPPH PPA- PROSPECT PARK PICNIC HOUSE RENOVATION B073-419M footprint_script dpr ST_MultiPoint ST_MultiPolygon
846P-200LFHH PPA- Restor. Lefferts Historic House B073-317M footprint_script dpr ST_MultiPoint ST_MultiPolygon
846P-313CPSW Central Park - Swedish Cottage M010-116M footprint_script dpr
846P-200BAND PPA- RC Prospect Park Bandshell - equipment Purch. footprint_script Algorithm ST_MultiPoint ST_MultiPolygon
126PV274PERI NYHOS - New York Hall of Science, Perimeter Landscaping footprint_script Facilities database ST_MultiPoint ST_MultiPoint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: New
Development

Successfully merging this pull request may close these issues.

CPDB: clean bins
4 participants