Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP - DONT MERGE: improves over the merge_fragments #190

Open
wants to merge 3 commits into
base: hotfix/4.2.3
Choose a base branch
from

Conversation

smpiano
Copy link
Contributor

@smpiano smpiano commented Jan 16, 2025

IMPORTANT: Do not merge, this is a PR to iterate the code/getting feedback.

This is an extraction of the code that @bitsofbits was being worked under branch speed-up-createsegmap-423 that was based over support/4.2 while it should be based over v4.2.3 as it is a hotfix for the current running version.

Changes are:

  • Stripping the identity part from fragments to reduce the data to iterate.
  • Sorts the scoring of matching the segment with the fragment, placing the highest score at the end.
  • Concept of stale_keys (seg_id and frag_id)
  • Skipped items with stale keys when iterate
  • frags_by_day function changes generator tuple from list to set.

cc: @Chr96er who reported the situation.

Related with> https://globalfishingwatch.atlassian.net/browse/PIPELINE-2398

* Sorts the scoring of matching the segment with the fragment, placing
* the highest score at the end.
* Concept of stale_keys (seg_id and frag_id)
* Skipped items with stale keys when iterate
* frags_by_day function changes generator tuple from list to dict.
@smpiano smpiano requested review from Chr96er and bitsofbits January 16, 2025 14:01
for day, frags_by_day in by_day(frags):
yield day, [x["frag_id"] for x in frags_by_day]
yield day, {x["frag_id"] for x in frags_by_day}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bitsofbits should it be a list instead of a dict? because it will collect the frag_ids of the day..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smpiano : is is actually a set, not a dict here. We don't care about the order, and this prevents duplicates (which I don't think we care about here) and more importantly allows items to be removed in constant time (as opposed to a list, which is linear time I believe). And we remove items from this down on line 121.

@smpiano
Copy link
Contributor Author

smpiano commented Jan 17, 2025

I'm running a job with these changes for 2024-10-01 to later compare them vs another that I've already run.
But at first want share the profiling..

  • FYI: I was reading the ticket comments and also run locally a profile (using only the hotkey 200000000) to compare the changes in merge_fragments:
    v4.2.3 11.4s:
    Screenshot from 2025-01-17 13-34-09
    hotfix/PIPELINE-2398 4.07s:
    Screenshot from 2025-01-17 13-33-38

@smpiano
Copy link
Contributor Author

smpiano commented Jan 21, 2025

I've run the following jobs using all ssvids for day 2024-10-01.

As first view I see the elapsed time is different from the one that run in Airflow, specifically for that day the segment took ~6h.
Then:

  • Total vCPU time 67.453 vCPU hr vs 67.602 vCPU hr
  • Total memory time 4,721.729 GB hr vs 4,732.143 GB hr
  • Total HDD PD time 3,372.663 GB hr vs 3,380.102 GB hr
  • Total SSD PD time 0 GB hr vs 0 GB hr
  • Total Shuffle data processed 642.72 GB vs 652.69 GB

Airflow shown that the segment took ~6h:
Screenshot from 2025-01-21 10-46-31

In memory consumption I see them the same, Left hotfix/pipeline-2398, Right v4.2.3;
Screenshot from 2025-01-21 10-19-50

hotfix/pipeline-2398 command

./scripts/run.sh segment  --date_range='2024-10-01,2024-10-02'  --segmenter_params='{"max_hours": 24}'  --source=world-fishing-827.pipe_ais_sources_v20220628.pipe_nmea_normalized_    --msg_dest=scratch_matias_ttl_60_days.PIPE2398_messages_segmented_  --segment_dest=scratch_matias_ttl_60_days.PIPE2398_segments_  --fragment_tbl=scratch_matias_ttl_60_days.PIPE2398_fragments_  --sat_source=pipe_ais_sources_v20220628.pipe_nmea_normalized_  --sat_offset_dest=scratch_matias_ttl_60_days.PIPE2398_satellite_timing_offsets  --norad_to_receiver_tbl=pipe_static.norad_to_receiver_v20230510  --sat_positions_tbl=satellite_positions_v20190208.satellite_positions_one_second_resolution_  --setup_file=./setup.py  --sdk_container_image=gcr.io/world-fishing-827/github.com/globalfishingwatch/pipe-segment/worker:v4.2.3-hotfix-2398   --labels=environment=develop  --labels=resource_creator=matias  --labels=project=core_pipeline  --labels=version=2398  --labels=step=segment  --labels=stage=develop  --project=world-fishing-827  -- --runner=dataflow  --wait_for_job  --project=world-fishing-827  --temp_location=gs://pipe-temp-us-central-ttl7/dataflow_temp  --staging_location=gs://pipe-temp-us-central-ttl7/dataflow_staging  --region=us-central1  --max_num_workers=600  --worker_machine_type=custom-1-71680-ext  --disk_size_gb=50  --job_name=pipe-2398--segment-segment--20250115  --experiments=use_runner_v2  --no_use_public_ips  --network=gfw-internal-network  --subnetwork=regions/us-central1/subnetworks/gfw-internal-us-central1

v4.2.3 command:

./scripts/run.sh segment  --date_range='2024-10-01,2024-10-02'  --segmenter_params='{"max_hours": 24}'  --source=world-fishing-827.pipe_ais_sources_v20220628.pipe_nmea_normalized_    --msg_dest=scratch_matias_ttl_60_days.v423_messages_segmented_  --segment_dest=scratch_matias_ttl_60_days.v423_segments_  --fragment_tbl=scratch_matias_ttl_60_days.v423_fragments_  --sat_source=pipe_ais_sources_v20220628.pipe_nmea_normalized_  --sat_offset_dest=scratch_matias_ttl_60_days.v423_satellite_timing_offsets  --norad_to_receiver_tbl=pipe_static.norad_to_receiver_v20230510  --sat_positions_tbl=satellite_positions_v20190208.satellite_positions_one_second_resolution_  --setup_file=./setup.py  --sdk_container_image=gcr.io/world-fishing-827/github.com/globalfishingwatch/pipe-segment/worker:v4.2.3   --labels=environment=develop  --labels=resource_creator=matias  --labels=project=core_pipeline  --labels=version=v423  --labels=step=segment  --labels=stage=develop  --project=world-fishing-827  --runner=dataflow  --wait_for_job  --project=world-fishing-827  --temp_location=gs://pipe-temp-us-central-ttl7/dataflow_temp  --staging_location=gs://pipe-temp-us-central-ttl7/dataflow_staging  --region=us-central1  --max_num_workers=600  --worker_machine_type=custom-1-71680-ext  --disk_size_gb=50  --job_name=v423--segment-segment--20241001  --experiments=use_runner_v2  --no_use_public_ips  --network=gfw-internal-network  --subnetwork=regions/us-central1/subnetworks/gfw-internal-us-central1

@Chr96er
Copy link

Chr96er commented Jan 21, 2025

@smpiano I think you wouldn't see any of the changes in a run on 2 days because both the CPU as well as memory issues are almost certainly due to the long history of fragments. When I ran this, the duration of merge_fragments increased exponentially with each extra day and it was super fast on the first few days. The same probably applies to memory. So that's what I meant that you need about 10 days of segments to really see the effect of the changes.

  1. I would run this on a few more days so we can get an idea of performance and also have a quick look at the output.
  2. The next step would be to run this on the current production fragments/segments as an input (filtered for hotkeys only) and compare the run of the original code vs the adjustments.

Also, please note that we usually only run the segmenter on a single day (I also ran it on 2 days at a time and I wasn't sure whether that's representative). From what I've seen (during the recent orchestration bug which had the segmenter looking at 2 days instead of 1), the results are almost the same but maybe the performance results we see aren't precise.

@smpiano
Copy link
Contributor Author

smpiano commented Jan 21, 2025

@Chr96er I'll trigger the jobs having 10 days of history to be reproducible.. 10 for v4.2.3 and 10 for this code change to later compare them. Also I've noticed the run of those 10 should be sequential between them to have the history to use inside the job.
I've checked the quota and we are complete.. so this will take some time to complete..
Screenshot from 2025-01-21 12-39-05

Jobs of v4.2.3:

Jobs of hotfix:

@smpiano
Copy link
Contributor Author

smpiano commented Jan 22, 2025

@Chr96er UPDATE: I stopped the sequencial run of the 10 jobs with each version. I decided to move on using this change that let us set the job to use the already produced tables and write the results of the fragments in a separate table ( such as in some scratch).
Once it's integrated, the idea is to test with an isolated change, ex striping of fragment identities and destinations and then compare those jobs and see if there is an improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants