-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RNTuple] Wrong offset Index32/Index64
array when read from multiple pages
#312
Comments
@jpivarski this is a head-scratcher. There are a few special features of this file:
869 matches the number of total column records after appending the ones found int footer to that of the header none of these directly explain why the offset/content would misalign suddenly though, my understanding is the field and column records in the footer extension should just be "appended" to the header ones, so in theory they shouldn't mess up the indexing of storage column, thus everything I can read from 1st cluster should continue to work in 2nd cluster. |
I think that might be red herring: # remember Julia is 1-based index
julia> rf.header.field_records[182:201]
20-element rVector{UnROOT.FieldRecord}:
UnROOT.FieldRecord(field_version=0x00000000, type_version=0xffffffff, parent_field_id=0x000000b5, struct_role=0x0002, flags=0x0000, repetition=0, field_name="AntiKt4TruthDressedWZJetsAux:", type_name="xAOD::JetAuxContainer_v1", type_alias="", field_desc="", )
UnROOT.FieldRecord(field_version=0x00000000, type_version=0xffffffff, parent_field_id=0x000000b5, struct_role=0x0002, flags=0x0000, repetition=0, field_name=":_0", type_name="xAOD::AuxContainerBase", type_alias="", field_desc="", )
UnROOT.FieldRecord(field_version=0x00000000, type_version=0xffffffff, parent_field_id=0x000000b6, struct_role=0x0002, flags=0x0000, repetition=0, field_name=":_0", type_name="SG::IAuxStore", type_alias="", field_desc="", )
UnROOT.FieldRecord(field_version=0x00000000, type_version=0xffffffff, parent_field_id=0x000000b7, struct_role=0x0002, flags=0x0000, repetition=0, field_name=":_0", type_name="SG::IConstAuxStore", type_alias="", field_desc="", )
UnROOT.FieldRecord(field_version=0x00000000, type_version=0xffffffff, parent_field_id=0x000000b6, struct_role=0x0002, flags=0x0000, repetition=0, field_name=":_1", type_name="SG::IAuxStoreIO", type_alias="", field_desc="", )
UnROOT.FieldRecord(field_version=0x00000000, type_version=0xffffffff, parent_field_id=0x000000b6, struct_role=0x0002, flags=0x0000, repetition=0, field_name=":_2", type_name="SG::IAuxStoreHolder", type_alias="", field_desc="", )
UnROOT.FieldRecord(field_version=0x00000000, type_version=0xffffffff, parent_field_id=0x000000b6, struct_role=0x0002, flags=0x0000, repetition=0, field_name=":_3", type_name="ILockable", type_alias="", field_desc="", )
UnROOT.FieldRecord(field_version=0x00000000, type_version=0x00000000, parent_field_id=0x000000b5, struct_role=0x0001, flags=0x0000, repetition=0, field_name="pt", type_name="std::vector<float>", type_alias="", field_desc="", )
UnROOT.FieldRecord(field_version=0x00000000, type_version=0x00000000, parent_field_id=0x000000bc, struct_role=0x0000, flags=0x0000, repetition=0, field_name="_0", type_name="float", type_alias="", field_desc="", )
UnROOT.FieldRecord(field_version=0x00000000, type_version=0x00000000, parent_field_id=0x000000b5, struct_role=0x0001, flags=0x0000, repetition=0, field_name="eta", type_name="std::vector<float>", type_alias="", field_desc="", )
UnROOT.FieldRecord(field_version=0x00000000, type_version=0x00000000, parent_field_id=0x000000be, struct_role=0x0000, flags=0x0000, repetition=0, field_name="_0", type_name="float", type_alias="", field_desc="", )
UnROOT.FieldRecord(field_version=0x00000000, type_version=0x00000000, parent_field_id=0x000000b5, struct_role=0x0001, flags=0x0000, repetition=0, field_name="phi", type_name="std::vector<float>", type_alias="", field_desc="", )
UnROOT.FieldRecord(field_version=0x00000000, type_version=0x00000000, parent_field_id=0x000000c0, struct_role=0x0000, flags=0x0000, repetition=0, field_name="_0", type_name="float", type_alias="", field_desc="", )
UnROOT.FieldRecord(field_version=0x00000000, type_version=0x00000000, parent_field_id=0x000000b5, struct_role=0x0001, flags=0x0000, repetition=0, field_name="m", type_name="std::vector<float>", type_alias="", field_desc="", )
UnROOT.FieldRecord(field_version=0x00000000, type_version=0x00000000, parent_field_id=0x000000c2, struct_role=0x0000, flags=0x0000, repetition=0, field_name="_0", type_name="float", type_alias="", field_desc="", )
UnROOT.FieldRecord(field_version=0x00000000, type_version=0x00000000, parent_field_id=0x000000b5, struct_role=0x0001, flags=0x0000, repetition=0, field_name="constituentLinks", type_name="std::vector<std::vector<ElementLink<DataVector<xAOD::IParticle> >>>", type_alias="xAOD::JetAuxContainer_v1::ConstituentLinks_t", field_desc="", )
UnROOT.FieldRecord(field_version=0x00000000, type_version=0x00000000, parent_field_id=0x000000c4, struct_role=0x0001, flags=0x0000, repetition=0, field_name="_0", type_name="std::vector<ElementLink<DataVector<xAOD::IParticle> >>", type_alias="", field_desc="", )
UnROOT.FieldRecord(field_version=0x00000000, type_version=0xffffffff, parent_field_id=0x000000c5, struct_role=0x0002, flags=0x0000, repetition=0, field_name="_0", type_name="ElementLink<DataVector<xAOD::IParticle> >", type_alias="", field_desc="", )
UnROOT.FieldRecord(field_version=0x00000000, type_version=0xffffffff, parent_field_id=0x000000c6, struct_role=0x0002, flags=0x0000, repetition=0, field_name=":_0", type_name="ElementLinkBase", type_alias="", field_desc="", )
UnROOT.FieldRecord(field_version=0x00000000, type_version=0x00000000, parent_field_id=0x000000c7, struct_role=0x0000, flags=0x0000, repetition=0, field_name="m_persKey", type_name="std::uint32_t", type_alias="SG::sgkey_t", field_desc="", )
julia> collect(rnt.var"AntiKt4TruthDressedWZJetsAux:")
field.columnrecord = UnROOT.ColumnRecord(type=0x000e, nbits=0x0040, field_id=0x000000bc, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x0011, nbits=0x0020, field_id=0x000000bd, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x000e, nbits=0x0040, field_id=0x000000be, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x0011, nbits=0x0020, field_id=0x000000bf, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x000e, nbits=0x0040, field_id=0x000000c0, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x0011, nbits=0x0020, field_id=0x000000c1, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x000e, nbits=0x0040, field_id=0x000000c2, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x0011, nbits=0x0020, field_id=0x000000c3, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x000e, nbits=0x0040, field_id=0x000000c4, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x000e, nbits=0x0040, field_id=0x000000c5, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x0014, nbits=0x0020, field_id=0x000000c8, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x0014, nbits=0x0020, field_id=0x000000c9, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x000e, nbits=0x0040, field_id=0x000000ca, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x000e, nbits=0x0040, field_id=0x000000cb, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x0011, nbits=0x0020, field_id=0x000000cc, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x000e, nbits=0x0040, field_id=0x000000bc, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x0011, nbits=0x0020, field_id=0x000000bd, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x000e, nbits=0x0040, field_id=0x000000be, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x0011, nbits=0x0020, field_id=0x000000bf, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x000e, nbits=0x0040, field_id=0x000000c0, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x0011, nbits=0x0020, field_id=0x000000c1, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x000e, nbits=0x0040, field_id=0x000000c2, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x0011, nbits=0x0020, field_id=0x000000c3, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x000e, nbits=0x0040, field_id=0x000000c4, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x000e, nbits=0x0040, field_id=0x000000c5, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x0014, nbits=0x0020, field_id=0x000000c8, flags=0x00000000, first_ele_idx=0, )
field.columnrecord = UnROOT.ColumnRecord(type=0x0014, nbits=0x0020, field_id=0x000000c9, flags=0x00000000, first_ele_idx=0, ) none of the columns touched have |
julia> rnt = LazyTree("./DAOD_TRUTH1.zprime125.rntuple.root", "RNT:CollectionTree", "AntiKt4TruthDressedWZJetsAux:");
julia> rnt.var"AntiKt4TruthDressedWZJetsAux:".rn.schema
RNTupleSchema with 1 top fields
└─ Symbol("AntiKt4TruthDressedWZJetsAux:") ⇒ Struct
├─ :m ⇒ Vector
│ ├─ :offset ⇒ Leaf{UnROOT.Index64}(col=38)
│ └─ :content ⇒ Leaf{Float32}(col=39)
├─ :pt ⇒ Vector
│ ├─ :offset ⇒ Leaf{UnROOT.Index64}(col=32)
│ └─ :content ⇒ Leaf{Float32}(col=33)
├─ :eta ⇒ Vector
│ ├─ :offset ⇒ Leaf{UnROOT.Index64}(col=34)
│ └─ :content ⇒ Leaf{Float32}(col=35)
├─ :constituentWeights ⇒ Vector
│ ├─ :offset ⇒ Leaf{UnROOT.Index64}(col=44)
│ └─ :content ⇒ Vector
│ ├─ :offset ⇒ Leaf{UnROOT.Index64}(col=45)
│ └─ :content ⇒ Leaf{Float32}(col=46)
├─ :phi ⇒ Vector
│ ├─ :offset ⇒ Leaf{UnROOT.Index64}(col=36)
│ └─ :content ⇒ Leaf{Float32}(col=37)
└─ :constituentLinks ⇒ Vector
├─ :offset ⇒ Leaf{UnROOT.Index64}(col=40)
└─ :content ⇒ Vector
├─ :offset ⇒ Leaf{UnROOT.Index64}(col=41)
└─ :content ⇒ Struct
└─ Symbol(":_0") ⇒ Struct
├─ :m_persKey ⇒ Leaf{UInt32}(col=42)
└─ :m_persIndex ⇒ Leaf{UInt32}(col=43)
let's manually check all the columns under the julia> io = rnt.var"AntiKt4TruthDressedWZJetsAux:".rn.io;
julia> rnt.var"AntiKt4TruthDressedWZJetsAux:"[1]; # fill the pagelinks cache
julia> cluster_group = rnt.var"AntiKt4TruthDressedWZJetsAux:".rn.pagelinks[1];
julia> cluster_group.cluster_summaries[2] # we're interested in the second cluster
UnROOT.ClusterSummary(528, 1432)
julia> cluster_group.nested_page_locations[2][40] # this is col=40 in the schema, we converted already
1-element UnROOT.RNTupleListNoFrame{UnROOT.PageDescription}:
UnROOT.PageDescription(0x00000598, UnROOT.Locator(num_bytes=816, offset=0x0000000003f3eb4a, )
)
# ├─ :offset ⇒ Leaf{UnROOT.Index64}(col=40)
julia> reinterpret(UnROOT.Index64, UnROOT.read_pagedesc(io, cluster_group.nested_page_locations[2][40], 64; split=true)) |> cumsum .|> Int
1432-element Vector{Int64}:
9
30
36
43
57
65
⋮
12906
12919
12930
julia> cluster_group.nested_page_locations[2][41] # this is col=41 in the schema, we converted already
2-element UnROOT.RNTupleListNoFrame{UnROOT.PageDescription}:
UnROOT.PageDescription(0x00002000, UnROOT.Locator(num_bytes=4740, offset=0x0000000003c6ab67, )
)
UnROOT.PageDescription(0x00001282, UnROOT.Locator(num_bytes=2770, offset=0x0000000003f3eea4, )
)
# this is split and delta encoded
# ├─ :offset ⇒ Leaf{UnROOT.Index64}(col=41)
julia> reinterpret(UnROOT.Index64, UnROOT.read_pagedesc(io, cluster_group.nested_page_locations[2][41], 64; split=true)) |> cumsum .|> Int
12930-element Vector{Int64}:
30
34
52
66
...
209999
210008
julia> cluster_group.nested_page_locations[2][42]
8-element UnROOT.RNTupleListNoFrame{UnROOT.PageDescription}:
UnROOT.PageDescription(0x00004000, UnROOT.Locator(num_bytes=38, offset=0x000000000196af81, )
)
UnROOT.PageDescription(0x00004000, UnROOT.Locator(num_bytes=38, offset=0x0000000001f361e0, )
)
UnROOT.PageDescription(0x00004000, UnROOT.Locator(num_bytes=38, offset=0x00000000024cac93, )
)
UnROOT.PageDescription(0x00004000, UnROOT.Locator(num_bytes=38, offset=0x0000000002ac8a47, )
)
UnROOT.PageDescription(0x00004000, UnROOT.Locator(num_bytes=38, offset=0x00000000030d9f71, )
)
UnROOT.PageDescription(0x00004000, UnROOT.Locator(num_bytes=38, offset=0x00000000036576d6, )
)
UnROOT.PageDescription(0x00004000, UnROOT.Locator(num_bytes=38, offset=0x0000000003c710aa, )
)
UnROOT.PageDescription(0x00003648, UnROOT.Locator(num_bytes=38, offset=0x0000000003f3f9a0, )
)
julia> julia> reinterpret(UInt32, UnROOT.read_pagedesc(io, cluster_group.nested_page_locations[2][42], 32; split=true))
128584-element reinterpret(UInt32, ::Vector{UInt8}):
0x2784318b
0x2784318b
0x2784318b
0x2784318b
0x2784318b
0x2784318b
0x2784318b
0x2784318b |
First cluster seems to match:ROOTIn [39]: df = ROOT.RDataFrame("RNT:CollectionTree", "./DAOD_TRUTH1.zprime125.rntuple.root")
In [41]: df.GetColumnType("AntiKt4TruthDressedWZJetsAux:.constituentLinks.:_0.m_persKey")
Out[41]: 'ROOT::VecOps::RVec<ROOT::VecOps::RVec<std::uint32_t>>'
In [67]: list(list(list(df.Take['ROOT::VecOps::RVec<ROOT::VecOps::RVec<std::uint32_t>>']("AntiKt4TruthDressedWZJetsAux:.constituentLinks.:_0.m_persIndex"))[3])[2])
Out[67]: [748, 932, 936, 935, 934]
In [68]: list(list(list(df.Take['ROOT::VecOps::RVec<ROOT::VecOps::RVec<std::uint32_t>>']("AntiKt4TruthDressedWZJetsAux:.constituentLinks.:_0.m_persKey"))[3])[2])
Out[68]: [662974859, 662974859, 662974859, 662974859, 662974859]
In [62]: list(list(list(df.Take['ROOT::VecOps::RVec<ROOT::VecOps::RVec<std::uint32_t>>']("AntiKt4TruthDressedWZJetsAux:.constituentLinks.:_0.m_persIndex"))[527])[7])
Out[62]: [1575, 2481, 435, 2480, 2477, 2532]
In [63]: list(list(list(df.Take['ROOT::VecOps::RVec<ROOT::VecOps::RVec<std::uint32_t>>']("AntiKt4TruthDressedWZJetsAux:.constituentLinks.:_0.m_persKey"))[527])[7])
Out[63]: [662974859, 662974859, 662974859, 662974859, 662974859, 662974859] UnROOTjulia> br = rnt.var"AntiKt4TruthDressedWZJetsAux:";
julia> [Int(x[1].m_persIndex) for x in br[4].constituentLinks[3]]
5-element Vector{Int64}:
748
932
936
935
934
julia> [Int(x[1].m_persKey) for x in br[4].constituentLinks[3]]
5-element Vector{Int64}:
662974859
662974859
662974859
662974859
662974859
julia> [Int(x[1].m_persIndex) for x in br[528].constituentLinks[8]]
6-element Vector{Int64}:
1575
2481
435
2480
2477
2532
julia> [Int(x[1].m_persKey) for x in br[528].constituentLinks[8]]
6-element Vector{Int64}:
662974859
662974859
662974859
662974859
662974859
662974859 |
Scond clusterROOTIn [65]: list(list(list(df.Take['ROOT::VecOps::RVec<ROOT::VecOps::RVec<std::uint32_t>>']("AntiKt4TruthDressedWZJetsAux:.constituentLinks.:_0.m_persIndex"))[528])[0])
Out[65]:
[1709,
1122,
1132,
1808,
1807,
...
In [92]: list(list(list(df.Take['ROOT::VecOps::RVec<ROOT::VecOps::RVec<std::uint32_t>>']("AntiKt4TruthDressedWZJetsAux:.constituentLinks.:_0.m_persIndex"))[528+1432-1])[-1])
Out[92]: [1427, 900, 546, 849, 1433, 1425, 1431, 845, 964] Juliawe crash here, so using debug output: # @show inside second cluster
length(content) = 128584
[Int((x[1]).m_persIndex) for x = first(content, 5)] = [1709, 1122, 1132, 1808, 1807]
[Int((x[1]).m_persIndex) for x = last(content, 9)] = [1427, 900, 546, 849, 1433, 1425, 1431, 845, 964] So it looks like we have all the data we want, just somehow the index are not aligned .... what about total amount of data in 2nd cluster? # be warned: this is slow
a = []
In [116]: for i in range(528, 528+1432):
...: ns = list(list(df.Take['ROOT::VecOps::RVec<ROOT::VecOps::RVec<std::uint32_t>>']("AntiKt4TruthDressedWZJetsAux:.constituentLinks.:_0.m_persIndex"))[i])
...: for n in ns:
...: a.append(len(list(n)))
In [121]: sum(a)
Out[121]: 128584 well, I'm not missing any data here, so idk what the is going on |
Given the content column is fine, now I suspect we're doing something wrong with the field.offset_col = Leaf{UnROOT.Index64}(col=41) and sure enough, this is that weird column we weren't able to read earlier, because we couldn't inflate 29 bytes into 131072 Let's compare the offsets ROOTIn [8]: a[:10]
Out[8]: [30, 4, 18, 14, 5, 8, 10, 7, 8, 18] UnROOTInt.(first(offset, 10)) = [30, 34, 52, 66, 71, 79, 89, 96, 104, 122]
diff([0; Int.(first(offset, 10))]) = [30, 4, 18, 14, 5, 8, 10, 7, 8, 18] so they start out agreeing! |
Compare the content of col=41They appear to be the same at first glance, in fact they start and end the same: julia> length(my_offset) == length(ref_offset)
true
julia> first(my_offset, 10) == first(ref_offset, 10)
true
julia> last(my_offset, 10) == last(ref_offset, 10)
true ah, it's because of
specifically, "counting is relative to the cluster" was not clear to me. But what it means is, if you have 2 pages of first page: [30, 4, 18, 14, 5, 8, 10, 7, 8, 18, ..., 22, 1, 16, 14, 14, 12, 4, 7, 15, 24]
first page after cumsum: [..., 81317, 81318, 81334, 81348, 81362, 81374, 81378, 81385, 81400, 81424]
second page: [81428, 13, 9, 7, 5, 20, 21, 4, 8, 6, ..., 8, 3, 4, 9, 14, 18, 8, 14, 16, 6, 8, 7, 11, 9] you can see that the second page doesn't start with 4, instead, it starts with a huge number see also: root-project/root#14982 |
Index32/Index64
array when read from multiple pages
The text was updated successfully, but these errors were encountered: