-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(dataobj): deduplicate decoder code across bucket and io.ReadSeeker #15945
Conversation
github.com/dustin/go-humanize v1.0.1 | ||
github.com/grafana/loki/v3 v3.3.2 | ||
) | ||
require github.com/grafana/loki/v3 v3.3.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benclive Go told me I needed to run go mod tidy
, let me know if anything here seems off
ReadRange(ctx context.Context, offset int64, length int64) (io.ReadCloser, error) | ||
} | ||
|
||
type rangeDecoder struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more or less a copy/paste from the old bucket decoder but it uses rangeReader instead.
A previous comment identified that the code for BucketDecoder and ReadSeekerDecoder were extremely similar, and that they could be deduplicated by introducing some kind of "range reader" interface. This commit inroduces such an interface, which maps perfectly to bucket decoding. Implementations of the interface must be able to tolerate concurrent instance of readers, which io.ReadSeeker does not. To tolerate this while still allowing to decode data objects that are either in-memory or backed by a file, ReadSeekerDecoder has been updated to ReaderAtDecoder.
edc9d87
to
cf745b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! A couple of comments but nothing major.
return dataset.PageData(data), nil | ||
} | ||
|
||
// TODO(rfratto): this retrieves all pages for all columns individually; we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to increase the page size if we want to minimise roundtrips, or is the smaller page size desirable for other reasons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, I don't really know yet.
Increasing the page size would help reduce the number of downloads you need to get the whole dataset. On the other hand, the more rows in a page, the harder it can be to filter out entire pages based on their min/max value statistics, and the longer it'll take to scan through an individual page.
For example, right now, we're seeing ~8M stream ID records fit into two pages. Most queries won't be able to filter out any of these pages based on their value ranges alone. This is an extreme example, but it applies for other columns too: the less pages there are, the more likely you have to download them to be able to iterate through data.
The page size also impacts memory usage for reading through a dataset. As long as a column has more than one page, the memory overhead of that column is the page size (4MB for us in dev). For our logs sections with 17 colummns, that's 68MB per section per data object. (I'm seeing 5 log sections on average, so ~340MB per data object).
We'll have to find the right tradeoff between TCO of reads and TCO/latency of object storage requests, but I definitely don't know what the right number is yet.
A previous comment identified that the code for BucketDecoder and ReadSeekerDecoder were extremely similar, and that they could be deduplicated by introducing some kind of "range reader" interface.
This commit introduces such an interface, which maps perfectly to bucket decoding. Implementations of the interface must be able to tolerate concurrent instance of readers, which io.ReadSeeker does not. To tolerate this while still allowing to decode data objects that are either in-memory or backed by a file, ReadSeekerDecoder has been updated to ReaderAtDecoder and accepts a size argument to note how large the object is.