Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(dataobj): deduplicate decoder code across bucket and io.ReadSeeker #15945

Merged
merged 2 commits into from
Jan 27, 2025

Conversation

rfratto
Copy link
Member

@rfratto rfratto commented Jan 24, 2025

A previous comment identified that the code for BucketDecoder and ReadSeekerDecoder were extremely similar, and that they could be deduplicated by introducing some kind of "range reader" interface.

This commit introduces such an interface, which maps perfectly to bucket decoding. Implementations of the interface must be able to tolerate concurrent instance of readers, which io.ReadSeeker does not. To tolerate this while still allowing to decode data objects that are either in-memory or backed by a file, ReadSeekerDecoder has been updated to ReaderAtDecoder and accepts a size argument to note how large the object is.

@rfratto rfratto requested a review from a team as a code owner January 24, 2025 15:06
github.com/dustin/go-humanize v1.0.1
github.com/grafana/loki/v3 v3.3.2
)
require github.com/grafana/loki/v3 v3.3.2
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benclive Go told me I needed to run go mod tidy, let me know if anything here seems off

ReadRange(ctx context.Context, offset int64, length int64) (io.ReadCloser, error)
}

type rangeDecoder struct {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more or less a copy/paste from the old bucket decoder but it uses rangeReader instead.

A previous comment identified that the code for BucketDecoder and
ReadSeekerDecoder were extremely similar, and that they could be
deduplicated by introducing some kind of "range reader" interface.

This commit inroduces such an interface, which maps perfectly to bucket
decoding. Implementations of the interface must be able to tolerate
concurrent instance of readers, which io.ReadSeeker does not. To
tolerate this while still allowing to decode data objects that are
either in-memory or backed by a file, ReadSeekerDecoder has been updated
to ReaderAtDecoder.
@rfratto rfratto force-pushed the dataobj-dedupe-decoders branch from edc9d87 to cf745b6 Compare January 24, 2025 15:08
Copy link
Contributor

@benclive benclive left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! A couple of comments but nothing major.

pkg/dataobj/internal/encoding/decoder_range.go Outdated Show resolved Hide resolved
return dataset.PageData(data), nil
}

// TODO(rfratto): this retrieves all pages for all columns individually; we
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to increase the page size if we want to minimise roundtrips, or is the smaller page size desirable for other reasons?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I don't really know yet.

Increasing the page size would help reduce the number of downloads you need to get the whole dataset. On the other hand, the more rows in a page, the harder it can be to filter out entire pages based on their min/max value statistics, and the longer it'll take to scan through an individual page.

For example, right now, we're seeing ~8M stream ID records fit into two pages. Most queries won't be able to filter out any of these pages based on their value ranges alone. This is an extreme example, but it applies for other columns too: the less pages there are, the more likely you have to download them to be able to iterate through data.

The page size also impacts memory usage for reading through a dataset. As long as a column has more than one page, the memory overhead of that column is the page size (4MB for us in dev). For our logs sections with 17 colummns, that's 68MB per section per data object. (I'm seeing 5 log sections on average, so ~340MB per data object).

We'll have to find the right tradeoff between TCO of reads and TCO/latency of object storage requests, but I definitely don't know what the right number is yet.

@rfratto rfratto merged commit 5929b05 into grafana:main Jan 27, 2025
58 checks passed
@rfratto rfratto deleted the dataobj-dedupe-decoders branch January 27, 2025 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants