WIP - DONT MERGE: improves over the merge_fragments #190

smpiano · 2025-01-16T14:01:17Z

IMPORTANT: Do not merge, this is a PR to iterate the code/getting feedback.

This is an extraction of the code that @bitsofbits was being worked under branch speed-up-createsegmap-423 that was based over support/4.2 while it should be based over v4.2.3 as it is a hotfix for the current running version.

Changes are:

Stripping the identity part from fragments to reduce the data to iterate.
Sorts the scoring of matching the segment with the fragment, placing the highest score at the end.
Concept of stale_keys (seg_id and frag_id)
Skipped items with stale keys when iterate
frags_by_day function changes generator tuple from list to set.

cc: @Chr96er who reported the situation.

Related with> https://globalfishingwatch.atlassian.net/browse/PIPELINE-2398

* Sorts the scoring of matching the segment with the fragment, placing * the highest score at the end. * Concept of stale_keys (seg_id and frag_id) * Skipped items with stale keys when iterate * frags_by_day function changes generator tuple from list to dict.

smpiano · 2025-01-16T18:26:33Z

pipe_segment/transform/create_segment_map.py

        for day, frags_by_day in by_day(frags):
-            yield day, [x["frag_id"] for x in frags_by_day]
+            yield day, {x["frag_id"] for x in frags_by_day}


@bitsofbits should it be a list instead of a dict? because it will collect the frag_ids of the day..

@smpiano : is is actually a set, not a dict here. We don't care about the order, and this prevents duplicates (which I don't think we care about here) and more importantly allows items to be removed in constant time (as opposed to a list, which is linear time I believe). And we remove items from this down on line 121.

smpiano · 2025-01-17T19:08:57Z

I'm running a job with these changes for 2024-10-01 to later compare them vs another that I've already run.
But at first want share the profiling..

FYI: I was reading the ticket comments and also run locally a profile (using only the hotkey 200000000) to compare the changes in merge_fragments:
v4.2.3 11.4s:

hotfix/PIPELINE-2398 4.07s:

smpiano · 2025-01-21T12:34:31Z

I've run the following jobs using all ssvids for day 2024-10-01.

The v4.2.3 job (Elapsed time 57 min 4 sec - $ 23.92)
The job with these changes (Elapsed time 58 min 16 sec - $ 23.81)

As first view I see the elapsed time is different from the one that run in Airflow, specifically for that day the segment took ~6h.
Then:

Total vCPU time 67.453 vCPU hr vs 67.602 vCPU hr
Total memory time 4,721.729 GB hr vs 4,732.143 GB hr
Total HDD PD time 3,372.663 GB hr vs 3,380.102 GB hr
Total SSD PD time 0 GB hr vs 0 GB hr
Total Shuffle data processed 642.72 GB vs 652.69 GB

Airflow shown that the segment took ~6h:

In memory consumption I see them the same, Left hotfix/pipeline-2398, Right v4.2.3;

hotfix/pipeline-2398 command

./scripts/run.sh segment  --date_range='2024-10-01,2024-10-02'  --segmenter_params='{"max_hours": 24}'  --source=world-fishing-827.pipe_ais_sources_v20220628.pipe_nmea_normalized_    --msg_dest=scratch_matias_ttl_60_days.PIPE2398_messages_segmented_  --segment_dest=scratch_matias_ttl_60_days.PIPE2398_segments_  --fragment_tbl=scratch_matias_ttl_60_days.PIPE2398_fragments_  --sat_source=pipe_ais_sources_v20220628.pipe_nmea_normalized_  --sat_offset_dest=scratch_matias_ttl_60_days.PIPE2398_satellite_timing_offsets  --norad_to_receiver_tbl=pipe_static.norad_to_receiver_v20230510  --sat_positions_tbl=satellite_positions_v20190208.satellite_positions_one_second_resolution_  --setup_file=./setup.py  --sdk_container_image=gcr.io/world-fishing-827/github.com/globalfishingwatch/pipe-segment/worker:v4.2.3-hotfix-2398   --labels=environment=develop  --labels=resource_creator=matias  --labels=project=core_pipeline  --labels=version=2398  --labels=step=segment  --labels=stage=develop  --project=world-fishing-827  -- --runner=dataflow  --wait_for_job  --project=world-fishing-827  --temp_location=gs://pipe-temp-us-central-ttl7/dataflow_temp  --staging_location=gs://pipe-temp-us-central-ttl7/dataflow_staging  --region=us-central1  --max_num_workers=600  --worker_machine_type=custom-1-71680-ext  --disk_size_gb=50  --job_name=pipe-2398--segment-segment--20250115  --experiments=use_runner_v2  --no_use_public_ips  --network=gfw-internal-network  --subnetwork=regions/us-central1/subnetworks/gfw-internal-us-central1

v4.2.3 command:

./scripts/run.sh segment  --date_range='2024-10-01,2024-10-02'  --segmenter_params='{"max_hours": 24}'  --source=world-fishing-827.pipe_ais_sources_v20220628.pipe_nmea_normalized_    --msg_dest=scratch_matias_ttl_60_days.v423_messages_segmented_  --segment_dest=scratch_matias_ttl_60_days.v423_segments_  --fragment_tbl=scratch_matias_ttl_60_days.v423_fragments_  --sat_source=pipe_ais_sources_v20220628.pipe_nmea_normalized_  --sat_offset_dest=scratch_matias_ttl_60_days.v423_satellite_timing_offsets  --norad_to_receiver_tbl=pipe_static.norad_to_receiver_v20230510  --sat_positions_tbl=satellite_positions_v20190208.satellite_positions_one_second_resolution_  --setup_file=./setup.py  --sdk_container_image=gcr.io/world-fishing-827/github.com/globalfishingwatch/pipe-segment/worker:v4.2.3   --labels=environment=develop  --labels=resource_creator=matias  --labels=project=core_pipeline  --labels=version=v423  --labels=step=segment  --labels=stage=develop  --project=world-fishing-827  --runner=dataflow  --wait_for_job  --project=world-fishing-827  --temp_location=gs://pipe-temp-us-central-ttl7/dataflow_temp  --staging_location=gs://pipe-temp-us-central-ttl7/dataflow_staging  --region=us-central1  --max_num_workers=600  --worker_machine_type=custom-1-71680-ext  --disk_size_gb=50  --job_name=v423--segment-segment--20241001  --experiments=use_runner_v2  --no_use_public_ips  --network=gfw-internal-network  --subnetwork=regions/us-central1/subnetworks/gfw-internal-us-central1

Chr96er · 2025-01-21T14:05:37Z

@smpiano I think you wouldn't see any of the changes in a run on 2 days because both the CPU as well as memory issues are almost certainly due to the long history of fragments. When I ran this, the duration of merge_fragments increased exponentially with each extra day and it was super fast on the first few days. The same probably applies to memory. So that's what I meant that you need about 10 days of segments to really see the effect of the changes.

I would run this on a few more days so we can get an idea of performance and also have a quick look at the output.
The next step would be to run this on the current production fragments/segments as an input (filtered for hotkeys only) and compare the run of the original code vs the adjustments.

Also, please note that we usually only run the segmenter on a single day (I also ran it on 2 days at a time and I wasn't sure whether that's representative). From what I've seen (during the recent orchestration bug which had the segmenter looking at 2 days instead of 1), the results are almost the same but maybe the performance results we see aren't precise.

smpiano · 2025-01-21T14:22:48Z

@Chr96er I'll trigger the jobs having 10 days of history to be reproducible.. 10 for v4.2.3 and 10 for this code change to later compare them. Also I've noticed the run of those 10 should be sequential between them to have the history to use inside the job.
I've checked the quota and we are complete.. so this will take some time to complete..

Jobs of v4.2.3:

2024-10-01 -> https://console.cloud.google.com/dataflow/jobs/us-central1/2025-01-20_08_15_32-12113549128666539398
2024-10-02 -> https://console.cloud.google.com/dataflow/jobs/us-central1/2025-01-21_06_27_18-6154459529682687326
2024-10-03 -> https://console.cloud.google.com/dataflow/jobs/us-central1/2025-01-21_07_50_43-16619118794641185154
2024-10-04 ->
2024-10-05 ->
2024-10-06 ->
2024-10-07 ->
2024-10-08 ->
2024-10-09 ->
2024-10-10 ->

Jobs of hotfix:

2024-10-01 -> https://console.cloud.google.com/dataflow/jobs/us-central1/2025-01-20_06_09_33-14751483966834569435
2024-10-02 -> https://console.cloud.google.com/dataflow/jobs/us-central1/2025-01-21_07_13_41-12876585999283565891
2024-10-03 -> https://console.cloud.google.com/dataflow/jobs/us-central1/2025-01-21_08_22_43-4480458479563092928
2024-10-04 ->
2024-10-05 ->
2024-10-06 ->
2024-10-07 ->
2024-10-08 ->
2024-10-09 ->
2024-10-10 ->

smpiano · 2025-01-22T14:14:38Z

@Chr96er UPDATE: I stopped the sequencial run of the 10 jobs with each version. I decided to move on using this change that let us set the job to use the already produced tables and write the results of the fragments in a separate table ( such as in some scratch).
Once it's integrated, the idea is to test with an isolated change, ex striping of fragment identities and destinations and then compare those jobs and see if there is an improvement.

smpiano added 2 commits January 14, 2025 18:18

Strip the identities from fragment to reduce size of element to process

703b1e8

smpiano requested review from Chr96er and bitsofbits January 16, 2025 14:01

fix inverted test in get_next and mispelled function name in expand

77b63ef

smpiano commented Jan 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP - DONT MERGE: improves over the merge_fragments #190

WIP - DONT MERGE: improves over the merge_fragments #190

smpiano commented Jan 16, 2025 •

edited

Loading

smpiano Jan 16, 2025

bitsofbits Jan 16, 2025

smpiano commented Jan 17, 2025

smpiano commented Jan 21, 2025 •

edited

Loading

Chr96er commented Jan 21, 2025

smpiano commented Jan 21, 2025 •

edited

Loading

smpiano commented Jan 22, 2025

WIP - DONT MERGE: improves over the merge_fragments #190

Are you sure you want to change the base?

WIP - DONT MERGE: improves over the merge_fragments #190

Conversation

smpiano commented Jan 16, 2025 • edited Loading

smpiano Jan 16, 2025

Choose a reason for hiding this comment

bitsofbits Jan 16, 2025

Choose a reason for hiding this comment

smpiano commented Jan 17, 2025

smpiano commented Jan 21, 2025 • edited Loading

Chr96er commented Jan 21, 2025

smpiano commented Jan 21, 2025 • edited Loading

smpiano commented Jan 22, 2025

smpiano commented Jan 16, 2025 •

edited

Loading

smpiano commented Jan 21, 2025 •

edited

Loading

smpiano commented Jan 21, 2025 •

edited

Loading