Improve load time (mpsc channel, post page-lock, directory buckets) #968

eaypek-tfh · 2025-01-27T10:57:56Z

Change

This PR improves load time via:

Switching from (returning vector of streams followed by select_all and try_next) to (spawning a separate tokio task for each get_object and sending items to a tokio mpsc channel). Instead of polling multiple dynamically changed list of streams, we now wait for a single channel.
Having page locking later than import: Working on memory at the same time with page-lock and import writes is affecting the performance. When we move the page-lock after the import, we saw that page-lock time decreases by 70% while import time remains almost the same
Infra change: Directory buckets (s3express) bring more stable s3 throughput which used to be varying a lot for general purpose buckets. Overall, it improved the performance in stage by around 20-25%.

Background & More Details

We need page-lock for better batch performance and this is why we do not start processing them before page lock finishes. It takes around 7,75s to page-lock 1M records in memory. Parallelizing it does not help as it's probably serialized on OS level.
We used to start page-lock in the background and immediately start s3 and aurora streams to write into the memory. This used to affect memory write and stream performance indirectly.
Since we move page-locked after importing everything, the real s3 throughput became obvious where we exceeded aws defined network throughput by around 20-25% in our stage instances.
In order not to waste the whole time for page-lock, we tried to divide it into chunks and start writing into the memory chunk which is already page-locked. However, chunked page locks resulted in CUDA_ERROR_INVALID_VALUE errors. So, we abandoned them.
With these changes, current bottleneck seems to be memory writing
Please note that we can further avoid the time to send items to tokio mpsc channel by directly writing into the memory in importer.

danielle-tfh · 2025-01-28T08:24:25Z

iris-mpc-gpu/src/helpers/mod.rs

+        chunk_offset,
+        chunk_offset + chunk_length
+    );
+    let size = chunk_length / device_manager.device_count();


out of interest - why do we divide by the count?

Because we actually don't allocate one blob of memory, but device_count many. Elements are allocated to devices by serial_id % device_index. This is mostly a leftover from when the memory was loaded onto the devices, we could also do it differently now.

philsippl · 2025-01-28T11:33:14Z

iris-mpc-store/src/s3_importer.rs

+        }));
+    }
+
+    drop(tx);


@eaypek-tfh what's the reason for the manual drop?

Yeah, I don't need to drop it expicitly as it'd be implicitly dropped after all clones are dropped, right? I think I just wanted to make it obvious but let me remove it 👍

Yeah should be all fine, if anything this is a bit awkward since you drop it before the handles are awaited.

philsippl

lgtm!

eaypek-tfh changed the title ~~Test benchmark~~ Improve load time (mpsc channel, chunked page-lock, s3-first import, directory buckets) Jan 27, 2025

eaypek-tfh force-pushed the test-benchmark branch from 8aa66a2 to fe62b16 Compare January 27, 2025 11:57

danielle-tfh reviewed Jan 28, 2025

View reviewed changes

philsippl reviewed Jan 28, 2025

View reviewed changes

philsippl approved these changes Jan 28, 2025

View reviewed changes

eaypek-tfh force-pushed the test-benchmark branch 3 times, most recently from 51dbcaf to c4ce09b Compare January 28, 2025 22:10

improve load time with mpsc, post page lock, directory bucket

7d93887

eaypek-tfh force-pushed the test-benchmark branch from c4ce09b to 7d93887 Compare January 28, 2025 22:32

eaypek-tfh changed the title ~~Improve load time (mpsc channel, chunked page-lock, s3-first import, directory buckets)~~ Improve load time (mpsc channel, post page-lock, directory buckets) Jan 28, 2025

eaypek-tfh merged commit 021fc02 into main Jan 28, 2025
13 checks passed

eaypek-tfh deleted the test-benchmark branch January 28, 2025 22:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve load time (mpsc channel, post page-lock, directory buckets) #968

Improve load time (mpsc channel, post page-lock, directory buckets) #968

eaypek-tfh commented Jan 27, 2025 •

edited

Loading

danielle-tfh Jan 28, 2025

philsippl Jan 28, 2025

philsippl Jan 28, 2025

eaypek-tfh Jan 28, 2025

philsippl Jan 28, 2025

philsippl left a comment

Improve load time (mpsc channel, post page-lock, directory buckets) #968

Improve load time (mpsc channel, post page-lock, directory buckets) #968

Conversation

eaypek-tfh commented Jan 27, 2025 • edited Loading

danielle-tfh Jan 28, 2025

Choose a reason for hiding this comment

philsippl Jan 28, 2025

Choose a reason for hiding this comment

philsippl Jan 28, 2025

Choose a reason for hiding this comment

eaypek-tfh Jan 28, 2025

Choose a reason for hiding this comment

philsippl Jan 28, 2025

Choose a reason for hiding this comment

philsippl left a comment

Choose a reason for hiding this comment

eaypek-tfh commented Jan 27, 2025 •

edited

Loading