Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve load time (mpsc channel, post page-lock, directory buckets) #968

Merged
merged 1 commit into from
Jan 28, 2025

Conversation

eaypek-tfh
Copy link
Collaborator

@eaypek-tfh eaypek-tfh commented Jan 27, 2025

Change

This PR improves load time via:

  1. Switching from (returning vector of streams followed by select_all and try_next) to (spawning a separate tokio task for each get_object and sending items to a tokio mpsc channel). Instead of polling multiple dynamically changed list of streams, we now wait for a single channel.
  2. Having page locking later than import: Working on memory at the same time with page-lock and import writes is affecting the performance. When we move the page-lock after the import, we saw that page-lock time decreases by 70% while import time remains almost the same
  3. Infra change: Directory buckets (s3express) bring more stable s3 throughput which used to be varying a lot for general purpose buckets. Overall, it improved the performance in stage by around 20-25%.

Background & More Details

  • We need page-lock for better batch performance and this is why we do not start processing them before page lock finishes. It takes around 7,75s to page-lock 1M records in memory. Parallelizing it does not help as it's probably serialized on OS level.
  • We used to start page-lock in the background and immediately start s3 and aurora streams to write into the memory. This used to affect memory write and stream performance indirectly.
  • Since we move page-locked after importing everything, the real s3 throughput became obvious where we exceeded aws defined network throughput by around 20-25% in our stage instances.
  • In order not to waste the whole time for page-lock, we tried to divide it into chunks and start writing into the memory chunk which is already page-locked. However, chunked page locks resulted in CUDA_ERROR_INVALID_VALUE errors. So, we abandoned them.
  • With these changes, current bottleneck seems to be memory writing
  • Please note that we can further avoid the time to send items to tokio mpsc channel by directly writing into the memory in importer.

@eaypek-tfh eaypek-tfh changed the title Test benchmark Improve load time (mpsc channel, chunked page-lock, s3-first import, directory buckets) Jan 27, 2025
chunk_offset,
chunk_offset + chunk_length
);
let size = chunk_length / device_manager.device_count();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of interest - why do we divide by the count?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we actually don't allocate one blob of memory, but device_count many. Elements are allocated to devices by serial_id % device_index. This is mostly a leftover from when the memory was loaded onto the devices, we could also do it differently now.

}));
}

drop(tx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eaypek-tfh what's the reason for the manual drop?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't need to drop it expicitly as it'd be implicitly dropped after all clones are dropped, right? I think I just wanted to make it obvious but let me remove it 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah should be all fine, if anything this is a bit awkward since you drop it before the handles are awaited.

Copy link
Contributor

@philsippl philsippl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@eaypek-tfh eaypek-tfh force-pushed the test-benchmark branch 3 times, most recently from 51dbcaf to c4ce09b Compare January 28, 2025 22:10
@eaypek-tfh eaypek-tfh changed the title Improve load time (mpsc channel, chunked page-lock, s3-first import, directory buckets) Improve load time (mpsc channel, post page-lock, directory buckets) Jan 28, 2025
@eaypek-tfh eaypek-tfh merged commit 021fc02 into main Jan 28, 2025
13 checks passed
@eaypek-tfh eaypek-tfh deleted the test-benchmark branch January 28, 2025 22:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants