chore(dataobj): deduplicate decoder code across bucket and io.ReadSeeker #15945

rfratto · 2025-01-24T15:06:13Z

A previous comment identified that the code for BucketDecoder and ReadSeekerDecoder were extremely similar, and that they could be deduplicated by introducing some kind of "range reader" interface.

This commit introduces such an interface, which maps perfectly to bucket decoding. Implementations of the interface must be able to tolerate concurrent instance of readers, which io.ReadSeeker does not. To tolerate this while still allowing to decode data objects that are either in-memory or backed by a file, ReadSeekerDecoder has been updated to ReaderAtDecoder and accepts a size argument to note how large the object is.

rfratto · 2025-01-24T15:06:38Z

cmd/dataobj-inspect/go.mod

-	github.com/dustin/go-humanize v1.0.1
-	github.com/grafana/loki/v3 v3.3.2
-)
+require github.com/grafana/loki/v3 v3.3.2


@benclive Go told me I needed to run go mod tidy, let me know if anything here seems off

rfratto · 2025-01-24T15:07:12Z

pkg/dataobj/internal/encoding/decoder_range.go

+	ReadRange(ctx context.Context, offset int64, length int64) (io.ReadCloser, error)
+}
+
+type rangeDecoder struct {


This is more or less a copy/paste from the old bucket decoder but it uses rangeReader instead.

A previous comment identified that the code for BucketDecoder and ReadSeekerDecoder were extremely similar, and that they could be deduplicated by introducing some kind of "range reader" interface. This commit inroduces such an interface, which maps perfectly to bucket decoding. Implementations of the interface must be able to tolerate concurrent instance of readers, which io.ReadSeeker does not. To tolerate this while still allowing to decode data objects that are either in-memory or backed by a file, ReadSeekerDecoder has been updated to ReaderAtDecoder.

benclive

LGTM! A couple of comments but nothing major.

pkg/dataobj/internal/encoding/decoder_range.go

benclive · 2025-01-27T10:03:48Z

pkg/dataobj/internal/encoding/decoder_range.go

+		return dataset.PageData(data), nil
+	}
+
+	// TODO(rfratto): this retrieves all pages for all columns individually; we


Does it make sense to increase the page size if we want to minimise roundtrips, or is the smaller page size desirable for other reasons?

Good question, I don't really know yet.

Increasing the page size would help reduce the number of downloads you need to get the whole dataset. On the other hand, the more rows in a page, the harder it can be to filter out entire pages based on their min/max value statistics, and the longer it'll take to scan through an individual page.

For example, right now, we're seeing ~8M stream ID records fit into two pages. Most queries won't be able to filter out any of these pages based on their value ranges alone. This is an extreme example, but it applies for other columns too: the less pages there are, the more likely you have to download them to be able to iterate through data.

The page size also impacts memory usage for reading through a dataset. As long as a column has more than one page, the memory overhead of that column is the page size (4MB for us in dev). For our logs sections with 17 colummns, that's 68MB per section per data object. (I'm seeing 5 log sections on average, so ~340MB per data object).

We'll have to find the right tradeoff between TCO of reads and TCO/latency of object storage requests, but I definitely don't know what the right number is yet.

rfratto requested a review from a team as a code owner January 24, 2025 15:06

pull-request-size bot added the size/XXL label Jan 24, 2025

rfratto commented Jan 24, 2025

View reviewed changes

rfratto force-pushed the dataobj-dedupe-decoders branch from edc9d87 to cf745b6 Compare January 24, 2025 15:08

rfratto requested review from cyriltovena and benclive January 24, 2025 15:09

benclive approved these changes Jan 27, 2025

View reviewed changes

chore(dataobj): correct error message

fbc3b08

rfratto merged commit 5929b05 into grafana:main Jan 27, 2025
58 checks passed

rfratto deleted the dataobj-dedupe-decoders branch January 27, 2025 13:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(dataobj): deduplicate decoder code across bucket and io.ReadSeeker #15945

chore(dataobj): deduplicate decoder code across bucket and io.ReadSeeker #15945

rfratto commented Jan 24, 2025

rfratto Jan 24, 2025

rfratto Jan 24, 2025

benclive left a comment

benclive Jan 27, 2025

rfratto Jan 27, 2025

chore(dataobj): deduplicate decoder code across bucket and io.ReadSeeker #15945

chore(dataobj): deduplicate decoder code across bucket and io.ReadSeeker #15945

Conversation

rfratto commented Jan 24, 2025

rfratto Jan 24, 2025

Choose a reason for hiding this comment

rfratto Jan 24, 2025

Choose a reason for hiding this comment

benclive left a comment

Choose a reason for hiding this comment

benclive Jan 27, 2025

Choose a reason for hiding this comment

rfratto Jan 27, 2025

Choose a reason for hiding this comment