-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP - DONT MERGE: improves over the merge_fragments #190
base: hotfix/4.2.3
Are you sure you want to change the base?
Conversation
* Sorts the scoring of matching the segment with the fragment, placing * the highest score at the end. * Concept of stale_keys (seg_id and frag_id) * Skipped items with stale keys when iterate * frags_by_day function changes generator tuple from list to dict.
for day, frags_by_day in by_day(frags): | ||
yield day, [x["frag_id"] for x in frags_by_day] | ||
yield day, {x["frag_id"] for x in frags_by_day} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bitsofbits should it be a list
instead of a dict
? because it will collect the frag_ids of the day..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smpiano : is is actually a set, not a dict here. We don't care about the order, and this prevents duplicates (which I don't think we care about here) and more importantly allows items to be removed in constant time (as opposed to a list, which is linear time I believe). And we remove items from this down on line 121.
I'm running a job with these changes for |
I've run the following jobs using all ssvids for day 2024-10-01.
As first view I see the elapsed time is different from the one that run in Airflow, specifically for that day the
Airflow shown that the segment took ~6h: In memory consumption I see them the same, Left hotfix/pipeline-2398, Right v4.2.3; hotfix/pipeline-2398 command ./scripts/run.sh segment --date_range='2024-10-01,2024-10-02' --segmenter_params='{"max_hours": 24}' --source=world-fishing-827.pipe_ais_sources_v20220628.pipe_nmea_normalized_ --msg_dest=scratch_matias_ttl_60_days.PIPE2398_messages_segmented_ --segment_dest=scratch_matias_ttl_60_days.PIPE2398_segments_ --fragment_tbl=scratch_matias_ttl_60_days.PIPE2398_fragments_ --sat_source=pipe_ais_sources_v20220628.pipe_nmea_normalized_ --sat_offset_dest=scratch_matias_ttl_60_days.PIPE2398_satellite_timing_offsets --norad_to_receiver_tbl=pipe_static.norad_to_receiver_v20230510 --sat_positions_tbl=satellite_positions_v20190208.satellite_positions_one_second_resolution_ --setup_file=./setup.py --sdk_container_image=gcr.io/world-fishing-827/github.com/globalfishingwatch/pipe-segment/worker:v4.2.3-hotfix-2398 --labels=environment=develop --labels=resource_creator=matias --labels=project=core_pipeline --labels=version=2398 --labels=step=segment --labels=stage=develop --project=world-fishing-827 -- --runner=dataflow --wait_for_job --project=world-fishing-827 --temp_location=gs://pipe-temp-us-central-ttl7/dataflow_temp --staging_location=gs://pipe-temp-us-central-ttl7/dataflow_staging --region=us-central1 --max_num_workers=600 --worker_machine_type=custom-1-71680-ext --disk_size_gb=50 --job_name=pipe-2398--segment-segment--20250115 --experiments=use_runner_v2 --no_use_public_ips --network=gfw-internal-network --subnetwork=regions/us-central1/subnetworks/gfw-internal-us-central1 v4.2.3 command: ./scripts/run.sh segment --date_range='2024-10-01,2024-10-02' --segmenter_params='{"max_hours": 24}' --source=world-fishing-827.pipe_ais_sources_v20220628.pipe_nmea_normalized_ --msg_dest=scratch_matias_ttl_60_days.v423_messages_segmented_ --segment_dest=scratch_matias_ttl_60_days.v423_segments_ --fragment_tbl=scratch_matias_ttl_60_days.v423_fragments_ --sat_source=pipe_ais_sources_v20220628.pipe_nmea_normalized_ --sat_offset_dest=scratch_matias_ttl_60_days.v423_satellite_timing_offsets --norad_to_receiver_tbl=pipe_static.norad_to_receiver_v20230510 --sat_positions_tbl=satellite_positions_v20190208.satellite_positions_one_second_resolution_ --setup_file=./setup.py --sdk_container_image=gcr.io/world-fishing-827/github.com/globalfishingwatch/pipe-segment/worker:v4.2.3 --labels=environment=develop --labels=resource_creator=matias --labels=project=core_pipeline --labels=version=v423 --labels=step=segment --labels=stage=develop --project=world-fishing-827 --runner=dataflow --wait_for_job --project=world-fishing-827 --temp_location=gs://pipe-temp-us-central-ttl7/dataflow_temp --staging_location=gs://pipe-temp-us-central-ttl7/dataflow_staging --region=us-central1 --max_num_workers=600 --worker_machine_type=custom-1-71680-ext --disk_size_gb=50 --job_name=v423--segment-segment--20241001 --experiments=use_runner_v2 --no_use_public_ips --network=gfw-internal-network --subnetwork=regions/us-central1/subnetworks/gfw-internal-us-central1 |
@smpiano I think you wouldn't see any of the changes in a run on 2 days because both the CPU as well as memory issues are almost certainly due to the long history of fragments. When I ran this, the duration of
Also, please note that we usually only run the segmenter on a single day (I also ran it on 2 days at a time and I wasn't sure whether that's representative). From what I've seen (during the recent orchestration bug which had the segmenter looking at 2 days instead of 1), the results are almost the same but maybe the performance results we see aren't precise. |
@Chr96er I'll trigger the jobs having 10 days of history to be reproducible.. 10 for v4.2.3 and 10 for this code change to later compare them. Also I've noticed the run of those 10 should be sequential between them to have the history to use inside the job. Jobs of v4.2.3:
Jobs of hotfix:
|
@Chr96er UPDATE: I stopped the sequencial run of the 10 jobs with each version. I decided to move on using this change that let us set the job to use the already produced tables and write the results of the fragments in a separate table ( such as in some scratch). |
IMPORTANT: Do not merge, this is a PR to iterate the code/getting feedback.
This is an extraction of the code that @bitsofbits was being worked under branch speed-up-createsegmap-423 that was based over support/4.2 while it should be based over v4.2.3 as it is a hotfix for the current running version.
Changes are:
frags_by_day
function changes generator tuple fromlist
toset
.cc: @Chr96er who reported the situation.
Related with> https://globalfishingwatch.atlassian.net/browse/PIPELINE-2398