-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add S3Stage
downloader
#13784
base: main
Are you sure you want to change the base?
feat: add S3Stage
downloader
#13784
Conversation
79066e8
to
43e6188
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool,
I mostly have some questions re async or rather how we integrate the downloader in the s3stage
} | ||
} | ||
|
||
while !metadata.is_done() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably emit some traces?
8d67276
to
7efb229
Compare
7efb229
to
61811df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have a few more questions about workers <> orchestrators
let client = Client::new(); | ||
let resp = client.head(url).send().await?; | ||
let total_length: usize = resp | ||
.headers() | ||
.get(CONTENT_LENGTH) | ||
.and_then(|v| v.to_str().ok()) | ||
.and_then(|s| s.parse().ok()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need any additional header setup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, that's it. just need to know the total size of the file
// Spawns the downloader task | ||
self.spawn_fetch(input); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will always be called when we call the poll function? this seems wrong, because we don't track whether we've downloaded everything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, just when self.fetch_rx
is None
, meaning that there's no task in the background running. I should make it more explicit.
now that i look at it, there's an issue inside spawn_fetch
that will always cause it to spawn, when it should only when theres files to be downloaded (maybe that's what you mean?)
// Distribute chunk ranges to workers when they free up | ||
while let Some(worker_msg) = orchestrator_rx.recv().await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this more performant than using a single downloader? I'd assume downloading is bound by bandwidth so unclear if using multiple downloaders improves this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although in theory it's bound by bandwidth, many times servers will throttle single connections. Using multiple connections helps
// Create channels for communication between workers and orchestrator | ||
let (orchestrator_tx, orchestrator_rx) = unbounded_channel(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit confusing to me,
why do we need multiple workers
doesn't this download just one file
44cc6e6
to
c910229
Compare
6c631be
to
6ae894e
Compare
6ae894e
to
a591e83
Compare
a591e83
to
3331f34
Compare
S3Stage
skeleton logic. It's incomplete and not enabled. TODOs can be disregarded for now.S3Stage
downloader. Downloads a file in parallel allowing resumes after shutdowns.