Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not manage to read a TTree with a structure of arrays of basic types #298

Open
peremato opened this issue Dec 20, 2023 · 17 comments
Open

Comments

@peremato
Copy link
Member

EDM4hep root files store in a tree called podio_metadata an object of the type

struct  podio::CollectionIDTable
  vector<unsigned int> m_collectionIDs;
  vector<string> m_names;
};

The following is a reproducer:

using UnROOT

struct CollectionIDTable
    collectionIDs::Vector{UInt32}
    names::Vector{String}
end

f = "Output_REC.root"

tfile = ROOTFile(f)
# tfile.customstructs["podio::CollectionIDTable"] = CollectionIDTable
meta = UnROOT.LazyTree(tfile, "podio_metadata", ["events___idTable"])

The test file can be downloaded from https://github.com/peremato/EDM4hep.jl/blob/main/examples/Output_REC.root

@Moelf
Copy link
Member

Moelf commented Dec 20, 2023

@tamasgal this thing hits fID equals -2, I think we're missing something fundamental here

@tamasgal
Copy link
Member

tamasgal commented Dec 20, 2023

Actually the only missing thing in this case is the leaf type support for vector<unsigned int> (see #299). I should have added those, so you can blame me ;) The vector<string> stuff is already supported. You don't need a custom streamer.

With #299 the following works (without, you will fail reading the m_collectionIDs part:

julia> using UnROOT

julia> f = ROOTFile("/Users/tamasgal/Downloads/Output_REC.root")
ROOTFile with 3 entries and 51 streamers.
/Users/tamasgal/Downloads/Output_REC.root
├─ runs (TTree)
│  └─ "PARAMETERS"
├─ events (TTree)
│  ├─ "AllCaloHitContributionsCombined"
│  ├─ "_AllCaloHitContributionsCombined_particle"
│  ├─ "BeamCal_Hits"
│  ├─ ""
│  ├─ "YokeEndcapCollection"
│  ├─ "_YokeEndcapCollection_contributions"
│  └─ "PARAMETERS"
└─ podio_metadata (TTree)
   ├─ "events___idTable"
   ├─ "events___CollectionTypeInfo"
   ├─ "runs___idTable"
   ├─ "runs___CollectionTypeInfo"
   ├─ "PodioBuildVersion"
   └─ "EDMDefinitions"


julia> LazyBranch(f, "podio_metadata/events___idTable/m_names")
1-element LazyBranch{SubArray{String, 1, Vector{String}, Tuple{UnitRange{Int64}}, true}, UnROOT.Offsetjagg, ArraysOfArrays.VectorOfVectors{String, Vector{String}, Vector{Int32}, Vector{Tuple{}}}}: 
 ["AllCaloHitContributionsCombined", "EventHeader", "BeamCalClusters", "BeamCalClusters_particleIDs", "BeamCalCollection", "BeamCalRecoParticles", "BeamCalRecoParticles_particleIDs", "BeamCal_Hits", "BuildUpVertices", "BuildUpVertices_RP"    "TightSelectedPandoraPFOs", "InnerTrackerBarrelHitsRelations", "InnerTrackerEndcapHitsRelations", "OuterTrackerBarrelHitsRelations", "OuterTrackerEndcapHitsRelations", "RefinedVertexJets_rel", "RelationCaloHit", "RelationMuonHit", "VXDEndcapTrackerHitRelations", "VXDTrackerHitRelations"]

julia> LazyBranch(f, "podio_metadata/events___idTable/m_collectionIDs")
1-element LazyBranch{SubArray{UInt32, 1, Vector{UInt32}, Tuple{UnitRange{Int64}}, true}, UnROOT.Offsetjagg, ArraysOfArrays.VectorOfVectors{UInt32, Vector{UInt32}, Vector{Int32}, Vector{Tuple{}}}}: 
 UInt32[0x3a25675d, 0xd793ab91, 0xf0d073dd, 0x1d19206c, 0xc298a348, 0xc29370d2, 0x3954b563, 0xd2b19e7b, 0xfd03f5d0, 0x310a0f04    0x5fa7cf93, 0x029be193, 0x743732ae, 0xc42bbbee, 0xd1211017, 0x8dac6bb6, 0x603a5016, 0xdf24625a, 0xbb4cff22, 0x178c9330]

julia> LazyTree(f, "podio_metadata", [Regex("events___idTable/(.*)") => s"\1"])
 Row │ m_names                                                    m_collectionIDs                                     │ SubArray{String                                            SubArray{UInt32                                ─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────
 1   │ ["AllCaloHitContributionsCombined", "EventHeader", "BeamC  [975529821, 3616779153, 4040192989, 488185964, ⋯                                                                                                  1 column omitted

@tamasgal
Copy link
Member

Fixed in v0.10.21.

@peremato let me know if it works for you.

Btw. just a little bit of clarification: the custom parsing always applies to a branch and not a tree (or set of branches). It's usually needed when the split-level is low (so that one needs to deserialise compound structures) or if the type for a specific branch is simply not supported.

@Moelf
Copy link
Member

Moelf commented Dec 20, 2023

huh, I don't know why this doesn't error due to fID== -2, maybe because custom struct logic doesn't hit that?

@tamasgal
Copy link
Member

tamasgal commented Dec 20, 2023

How did you get the fID == -2 bubble up? Sorry for my ignorance, I have not looked closely enough 😆

@tamasgal
Copy link
Member

tamasgal commented Dec 20, 2023

Ah I see:

julia> UnROOT.LazyTree(f, "podio_metadata", ["events___idTable"])
fID = -2   # <- added a @show here...
ERROR: BoundsError: attempt to access 2-element Vector{Any} at index [-1]
Stacktrace:
  [1] getindex(A::Vector{Any}, i1::Int64)
    @ Base ./essentials.jl:13
  [2] streamerfor(f::ROOTFile, branch::UnROOT.TBranchElement_10)
    @ UnROOT ~/Dev/UnROOT.jl/src/root.jl:161

Yes, that negative fID is weird. I have some notes on it but I have no solution yet.

EDIT: and yes, if you go to the deepest split level and there is an interpretation (like the one for vector<unsigned int>) you will not hit the logic with the fID

@tamasgal
Copy link
Member

tamasgal commented Dec 20, 2023

In this case the UnROOT.streamerfor needs to figure out the parser logic from the actual streamer, which is there, but fails due to the lookup. The lookup in this case is not index based (on fID) but can be retrieved via the fName. (below I also printed the available streamers).

It all boils down to take the automatic parser generation into this level so that it works without using the split-branches.

julia> UnROOT.streamerfor(f, "podio::CollectionIDTable")
e.streamer.fName = "TObject"
e.streamer.fName = "TCollection"
e.streamer.fName = "podio::GenericParameters"
e.streamer.fName = "pair<string,vector<int> >"
e.streamer.fName = "pair<string,vector<float> >"
e.streamer.fName = "pair<string,vector<string> >"
e.streamer.fName = "pair<string,vector<double> >"
e.streamer.fName = "vector<int>"
e.streamer.fName = "vector<float>"
e.streamer.fName = "edm4hep::CaloHitContributionData"
e.streamer.fName = "edm4hep::Vector3f"
e.streamer.fName = "podio::ObjectID"
e.streamer.fName = "edm4hep::CalorimeterHitData"
e.streamer.fName = "edm4hep::ClusterData"
e.streamer.fName = "edm4hep::ParticleIDData"
e.streamer.fName = "edm4hep::SimCalorimeterHitData"
e.streamer.fName = "edm4hep::ReconstructedParticleData"
e.streamer.fName = "edm4hep::VertexData"
e.streamer.fName = "edm4hep::EventHeaderData"
e.streamer.fName = "edm4hep::SimTrackerHitData"
e.streamer.fName = "edm4hep::Vector3d"
e.streamer.fName = "edm4hep::MCRecoTrackerHitPlaneAssociationData"
e.streamer.fName = "edm4hep::TrackerHitPlaneData"
e.streamer.fName = "edm4hep::Vector2f"
e.streamer.fName = "edm4hep::ObjectID"
e.streamer.fName = "edm4hep::MCParticleData"
e.streamer.fName = "edm4hep::Vector2i"
e.streamer.fName = "edm4hep::RecoParticleVertexAssociationData"
e.streamer.fName = "edm4hep::MCRecoCaloAssociationData"
e.streamer.fName = "edm4hep::TrackData"
e.streamer.fName = "edm4hep::TrackState"
e.streamer.fName = "edm4hep::Quantity"
e.streamer.fName = "podio::CollectionIDTable"
UnROOT.StreamerInfo(UnROOT.TStreamerInfo{UnROOT.TObjArray}("podio::CollectionIDTable", "", 0xe9251d6f, 1, UnROOT.TObjArray("", 0, Any[UnROOT.TStreamerSTL
  version: UInt16 0x0004
  fOffset: Int64 0
  fName: String "m_collectionIDs"
  fTitle: String ""
  fType: Int32 500
  fSize: Int32 24
  fArrayLength: Int32 0
  fArrayDim: Int32 0
  fMaxIndex: Array{Int32}((5,)) Int32[0, 0, 0, 0, 0]
  fTypeName: String "vector<unsigned int>"
  fXmin: Float64 0.0
  fXmax: Float64 0.0
  fFactor: Float64 0.0
  fSTLtype: Int32 1
  fCtype: Int32 13
, UnROOT.TStreamerSTL
  version: UInt16 0x0004
  fOffset: Int64 0
  fName: String "m_names"
  fTitle: String ""
  fType: Int32 500
  fSize: Int32 24
  fArrayLength: Int32 0
  fArrayDim: Int32 0
  fMaxIndex: Array{Int32}((5,)) Int32[0, 0, 0, 0, 0]
  fTypeName: String "vector<string>"
  fXmin: Float64 0.0
  fXmax: Float64 0.0
  fFactor: Float64 0.0
  fSTLtype: Int32 1
  fCtype: Int32 61
])), Set{Any}())

I need to study what uproot is doing with the negative fID, since it's able to get this right:

>>> import uproot

>>> f = uproot.open("/Users/tamasgal/Downloads/Output_REC.root")

>>> f["podio_metadata/events___idTable"]
<TBranchElement 'events___idTable' (2 subbranches) at 0x00010b58eb20>

>>> f["podio_metadata/events___idTable"].array()
<Array [{m_collectionIDs: [...], ...}] type='1 * {m_collectionIDs: var * ui...'>

@Moelf
Copy link
Member

Moelf commented Dec 20, 2023

yeah, from my very quick look, uproot does not do anything with fID explicitly

@tamasgal
Copy link
Member

Yes... I mean, obviously the information is sitting right in front of us ;) So in that case UnROOT should create the corresponding struct and add a readtype or whatever dynamically. That's what's missing.

@tamasgal
Copy link
Member

It's just a bit weird that this works fine in so many cases 😆 :

return next_streamer.streamer.fElements.elements[fID + 1] # one-based indexing in Julia

@peremato
Copy link
Member Author

Fixed in v0.10.21.

@peremato let me know if it works for you.

Btw. just a little bit of clarification: the custom parsing always applies to a branch and not a tree (or set of branches). It's usually needed when the split-level is low (so that one needs to deserialise compound structures) or if the type for a specific branch is simply not supported.

First, thanks very much @tamasgal. It works great once you know how to do it.

It is very confusing still for me the way to select the branches and leaves (perhaps is a lack of proper documentation or pre-knowledge of the ROOT file organisation). This works nicely:

ulia> meta = UnROOT.LazyTree(tfile, "podio_metadata", [Regex("events___idTable/(.*)") => s"\1"])
 Row │ m_names                                                                                                  m_collectionIDs                                ⋯
     │ SubArray{String                                                                                          SubArray{UInt32                                ⋯
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 1   │ ["AllCaloHitContributionsCombined", "EventHeader", "BeamCalClusters", "BeamCalClusters_particleIDs", "B  [975529821, 3616779153, 4040192989, 488185964, ⋯
                                                                                                                                                1 column omitted

but what I would do naively does not

julia> meta = UnROOT.LazyTree(tfile, "podio_metadata", ["m_names", "m_collectionIDs"])
ERROR: MethodError: no method matching LazyBranch(::ROOTFile, ::Missing)

Closest candidates are:
  LazyBranch(::ROOTFile, ::AbstractString)
   @ UnROOT ~/Development/UnROOT.jl/src/iteration.jl:134
  LazyBranch(::ROOTFile, ::Union{UnROOT.TBranch, UnROOT.TBranchElement})
   @ UnROOT ~/Development/UnROOT.jl/src/iteration.jl:116

Stacktrace:
 [1] LazyBranch(f::ROOTFile, s::String)
   @ UnROOT ~/Development/UnROOT.jl/src/iteration.jl:134
 [2] LazyTree(f::ROOTFile, tree::UnROOT.TTree, treepath::String, branches::Vector{String}; sink::Type{LazyTree})
   @ UnROOT ~/Development/UnROOT.jl/src/iteration.jl:450
 [3] LazyTree
   @ ~/Development/UnROOT.jl/src/iteration.jl:432 [inlined]
 [4] LazyTree(f::ROOTFile, s::String, branches::Vector{String}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ UnROOT ~/Development/UnROOT.jl/src/iteration.jl:393
 [5] LazyTree(f::ROOTFile, s::String, branches::Vector{String})
   @ UnROOT ~/Development/UnROOT.jl/src/iteration.jl:390
 [6] top-level scope
   @ REPL[6]:1

the flowing works but the names of the columns are wrong

julia> meta = UnROOT.LazyTree(tfile, "podio_metadata", ["events___idTable/m_names", "events___idTable/m_collectionIDs"])
 Row │ events___idTabl                                                                                          events___idTabl                                ⋯
     │ SubArray{UInt32                                                                                          SubArray{String                                ⋯
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 1   │ [975529821, 3616779153, 4040192989, 488185964, 3264783176, 3264442578, 961852771, 3534855803, 424489518  ["AllCaloHitContributionsCombined", "EventHead ⋯
                                                                                                                                                1 column omitted

I did also try the naming convention that was used for the other tree "events" with <branch>_<leaf> but also does not work. I see that for the LazyBranch the convention is <branch>/<leaf>. Overall is very confusing.

@tamasgal
Copy link
Member

Yes, the problem is indeed that you need to know a little bit about the ROOT structure's subtleties. As you can see, uproot also requires you to point to events___idTable but then does the automatic RecArrat-creation from the sub-branches. This is of course something I'd like to have in UnROOT as well but it requires a lot of restructuring. As always, you learn ROOT iteratively and early design decisions need to be changed quite often (I had so many iterations in UnROOT already 😆 ).

I really hope that I will find a longer time slot (2-4 weeks) next year to spend a significant amount of time on refactoring UnROOT.

>>> import uproot

>>> f = uproot.open("/Users/tamasgal/Downloads/Output_REC.root")

>>> f["podio_metadata/events___idTable"]
<TBranchElement 'events___idTable' (2 subbranches) at 0x00010b58eb20>

>>> f["podio_metadata/events___idTable"].array()
<Array [{m_collectionIDs: [...], ...}] type='1 * {m_collectionIDs: var * ui...'>

@tamasgal
Copy link
Member

tamasgal commented Dec 21, 2023

Regarding the events tree, you do the same, but also here you need to provide the full path to the sub-branches:

julia> LazyTree(f, "events", [r"BeamCal_Hits/BeamCal_Hits.*\.(\w+)$" => s"\1"])
 Row │ time             x                energyError      energy           y   
     │ SubArray{Float3  SubArray{Float3  SubArray{Float3  SubArray{Float3  Sub 
─────┼──────────────────────────────────────────────────────────────────────────
 1   │ []               []               []               []               []  
 2   │ []               []               []               []               []  
 3   │ []               []               []               []               []  
 4   │ []               []               []               []               []  
 5   │ []               []               []               []               []  
 6   │ []               []               []               []               []  
 7   │ []               []               []               []               []  
 8   │ []               []               []               []               []  
 9   │ []               []               []               []               []  
 10  │ []               []               []               []               []  
 11  │ []               []               []               []               []  
 12  │ []               []               []               []               []  
 13  │ [0.0, 0.0,       [-8.2, -8.       [0.0, 0.0,       [0.0267, 0       [63 
 14  │ []               []               []               []               []  
 15  │ []               []               []               []               []  
 16  │ []               []               []               []               []  
 17  │ []               []               []               []               []  
 18  │ []               []               []               []               []  
 19  │ [0.0, 0.0]       [3.17, 3.2       [0.0, 0.0]       [0.0305, 0       [-1 
 20  │ []               []               []               []               []  
 21  │ []               []               []               []               []  
 22  │ [0.0, 0.0]       [151.0, 15       [0.0, 0.0]       [0.0128, 0       [-8 
                                                               
                                                    4 columns and 3 rows omitted

@peremato
Copy link
Member Author

I was not doing this. If I do

julia> events = LazyTree(f, "events", ["BeamCal_Hits"])
 Row │ BeamCal_Hits_en            BeamCal_Hits_ti            BeamCal_Hits_en            BeamCal_Hits_po            BeamCal_Hits_po            BeamCal_Hits_po  ⋯
     │ SubArray{Float3            SubArray{Float3            SubArray{Float3            SubArray{Float3            SubArray{Float3            SubArray{Float3  ⋯
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 1   │ []                         []                         []                         []                         []                         []               ⋯
 2   │ []                         []                         []                         []                         []                         []               ⋯
 3   │ []                         []                         []                         []                         []                         []               ⋯
 4   │ []                         []                         []                         []                         []                         []               ⋯
 5   │ []                         []                         []                         []                         []                         []               ⋯
 6   │ []                         []                         []                         []                         []                         []               ⋯
 7   │ []                         []                         []                         []                         []                         []               ⋯
 8   │ []                         []                         []                         []                         []                         []               ⋯
 9   │ []                         []                         []                         []                         []                         []               ⋯
 10  │ []                         []                         []                         []                         []                         []               ⋯
 11  │ []                         []                         []                         []                         []                         []               ⋯
 12  │ []                         []                         []                         []                         []                         []               ⋯
 13  │ [0.0, 0.0, 0.0, 0.0, 0.0,  [0.0, 0.0, 0.0, 0.0, 0.0,  [0.0267, 0.0214, 0.0853,   [3290.0, 3290.0, 3290.0,   [-8.2, -8.16, -1.92, 31.1  [63.1, 63.1, 66. ⋯
 14  │ []                         []                         []                         []                         []                         []               ⋯
 15  │ []                         []                         []                         []                         []                         []               ⋯
 16  │ []                         []                         []                         []                         []                         []               ⋯
 17  │ []                         []                         []                         []                         []                         []               ⋯
 18  │ []                         []                         []                         []                         []                         []               ⋯
 19  │ [0.0, 0.0]                 [0.0, 0.0]                 [0.0305, 0.0754]           [-3350.0, -3360.0]         [3.17, 3.21]               [-19.2, -19.2]   ⋯
 20  │ []                         []                         []                         []                         []                         []               ⋯
 21  │ []                         []                         []                         []                         []                         []               ⋯
 22  │ [0.0, 0.0]                 [0.0, 0.0]                 [0.0128, 0.00132]          [3360.0, 3380.0]           [151.0, 151.0]             [-86.8, -86.8]   ⋯
 23  │ [0.0]                      [0.0]                      [2.02f-6]                  [3390.0]                   [-62.9]                    [61.3]           ⋯

and the leaves get the name <branch>_<leaf>

ulia> names(events)
8-element Vector{String}:
 "BeamCal_Hits_energyError"
 "BeamCal_Hits_time"
 "BeamCal_Hits_energy"
 "BeamCal_Hits_position_z"
 "BeamCal_Hits_position_x"
 "BeamCal_Hits_position_y"
 "BeamCal_Hits_cellID"
 "BeamCal_Hits_type"

@tamasgal
Copy link
Member

I mean, technically we can do this LazyTree creation on the fly automatically but I could not come up with a way which works reliably, especially with all those funny (read weird) namings and dot-madness. So eventually we need to ask the user to provide the regex to help UnROOT make reasonable fieldnames like x instead of BeamCal_Hits.position.x which would anyways not be valid due to the dots, so it needs to be translated to BeamCal_Hits_position_x or so, but notice here that BeamCal_Hits is redundant, since the branch is already called like that. ROOT however still stores that with that prefix. BUT not always and I still don't know why. We have some logic in UnROOT which works quite OK but it will still give you funny names in many cases. That's why I introduced that regex-thing, which I highly abuse 😉 see here:

https://github.com/KM3NeT/KM3io.jl/blob/65318a1265fd6bfa064b06a5c4721711160e50f1/src/root/offline.jl#L164-L193

Actually that is basically the place where we would need to incorporate the original streamer which tells you how to name them and how the hierarchy is structures, but it's quite complex and UnROOT then really would have to define those structs at runtime, which brings us to the...

...painful fact: if you let UnROOT define the structs, you will not be able to use those types in your own analysis code explicitly. Which means that of course Julia will happily pass you the instances, and your function will eat those types as well and everything is fine (and type-stable) but you will not be able to restrict or use those types to utilise multiple dispatch features since they are created on the fly and attached to the UnROOT namespace (that would technically be type piracy) and of course you will have to deal with dynamic dispatch all(?) the time.

That's why I kind of like the that we simply use LazyTree, which is a highly parametric type, signalling that it's a universal thing (like a named tuple) but it allows you to hide your data in some container type and/or reinterpret it to your own own types. So we force to use a barrier in order to be able to make use of a solid type system. That's what I have shown in KM3io jl Making UnROOT jl comfortable for KM3NeT - Tamas Gal

On the other hand, you can of course provide your custom structs and make UnROOT utilise those, so you have full control and maximum efficiency. That's also shown in the presentation above, but of course requires more understanding of the underlying structures.

I use both techniques with great performance.

@tamasgal
Copy link
Member

I was not doing this. If I do

Yes that works too, if you are fine with the UnROOT naming ;)

@peremato
Copy link
Member Author

Hi Tom. I agree we can do several things and hide the UnROOT level. I you want have a look at what I have been doing with EDM4hep.jl. I am mapping a simple Julia type (isbits) to a set of columns in the LazyTree within a StructArray in a recursive manner. This is very convenient and good performance for some use cases. There are some examples like ttbar_digits.jl to illustrate what you can do. I have given a presentation this week to the team developing this event model. It is very encouraging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants