Write RowGroup total size and compressed size #580

justinas-marozas · 2024-12-17T14:07:42Z

I noticed that the row group total size in the file I wrote was suspiciously small compared to the column sizes and found that the ParquetRowGroupWriter sums up compressed column sizes to write the total byte size. Changing TotalByteSize to count TotalUncompressedSizes fixed it.

I have decided to also write TotalCompressedSize for good measure.

Test Code

using System.IO.Compression;
using Parquet;
using Parquet.Data;
using Parquet.Schema;

var rand = new Random(123456);
List<DateTime> dates = Enumerable.Range(1, 5).Select(x => new DateTime(2024, 12, x)).ToList();
List<string> ids = ["123", "234", "345", "456", "567", "678", "789", "890"];
List<int> values = Enumerable.Range(0, 1000).Select(x => rand.Next(1000)).ToList();

var schema = new ParquetSchema(
    new DataField<DateTime>("timestamp", nullable: false),
    new DataField<string>("id", nullable: false),
    new DataField<int>("value", nullable: false));

var timestampColumn = new DataColumn(
    schema.DataFields.First(f => f.Name == "timestamp"),
    Enumerable.Range(0, 100_000).Select(_ => dates[rand.Next(dates.Count)]).ToArray());
var idColumn = new DataColumn(
    schema.DataFields.First(f => f.Name == "id"),
    Enumerable.Range(0, 100_000).Select(_ => ids[rand.Next(ids.Count)]).ToArray());
var valuesColumn = new DataColumn(
    schema.DataFields.First(f => f.Name == "value"),
    Enumerable.Range(0, 100_000).Select(_ => values[rand.Next(values.Count)]).ToArray());

using var stream = File.Create("test.parquet");
using ParquetWriter writer = await ParquetWriter.CreateAsync(schema, stream);

writer.CompressionMethod = CompressionMethod.Zstd;
writer.CompressionLevel = CompressionLevel.Optimal;

using ParquetRowGroupWriter groupWriter = writer.CreateRowGroup();

await groupWriter.WriteColumnAsync(timestampColumn);
await groupWriter.WriteColumnAsync(idColumn);
await groupWriter.WriteColumnAsync(valuesColumn);

Before

After

aloneguid

This is really good, thank you!

fix rowgroup size metadata

3538462

aloneguid added this to the 5.0.3 milestone Dec 17, 2024

aloneguid approved these changes Dec 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write RowGroup total size and compressed size #580

Write RowGroup total size and compressed size #580

justinas-marozas commented Dec 17, 2024

aloneguid left a comment

Write RowGroup total size and compressed size #580

Are you sure you want to change the base?

Write RowGroup total size and compressed size #580

Conversation

justinas-marozas commented Dec 17, 2024

Test Code

Before

After

aloneguid left a comment

Choose a reason for hiding this comment