Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for shape touching in Dask #966

Merged
merged 30 commits into from
Oct 11, 2023

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Oct 3, 2023

Warning
Depends upon #980

This PR adds support for the shape_touched tracking performed by Awkward typetracers in order to avoid reading all data buffers.

High level overview of this PR:

  1. _UprootRead et al. implement the new buffer projection interface
  2. Implements non-remapped reads as a trivial form mapping
  3. Replaces rehydration call with from_buffers

@agoose77 agoose77 marked this pull request as ready for review October 3, 2023 16:30
Comment on lines 783 to 806
def build_form_key_to_key(form: Form) -> dict[str, str | None]:
form_key_to_path: dict[str, str | None] = {}

def impl(form, column_path):
# Store columnar path
form_key_to_path[form.form_key] = column_path[-1] if column_path else None

if form.is_union:
for _i, entry in enumerate(form.contents):
impl(entry, column_path)
elif form.is_indexed:
impl(form.content, column_path)
elif form.is_list:
impl(form.content, column_path)
elif form.is_option:
impl(form.content, column_path)
elif form.is_record:
for field in form.fields:
impl(form.content(field), (*column_path, field))
elif form.is_unknown or form.is_numpy:
pass
else:
actual_form = self.rendered_form
raise AssertionError(form)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This simple recursion builds a mapping form form_key to the key (depth-0 field) in the TTree

Comment on lines 811 to 816
buffer_key: Final[str] = "{form_key}-{attribute}"

def parse_buffer_key(self, buffer_key: str) -> tuple[str, str]:
form_key, attribute = buffer_key.rsplit("-", maxsplit=1)
return form_key, attribute
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we decide upon a trivial form_key ←→ buffer_key encoding.

Comment on lines 817 to 825
def keys_for_buffer_keys(self, buffer_keys: set[str]) -> set[str]:
keys: set[str] = set()
for buffer_key in buffer_keys:
# Identify form key
form_key, attribute = buffer_key.rsplit("-", maxsplit=1)
# Identify key from form_key
keys.add(self._form_key_to_key[form_key])
return keys
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the new dask-awkward buffer projection, we implement a function that maps buffer keys (of our own choosing!) to the TTree keys.

Comment on lines 825 to 862

def load_buffers(
self, tree: HasBranches, keys: set[str], start: int, stop: int, options: Any
) -> Mapping[str, AwkArray]:
# First, let's read the arrays as a tuple (to associate with each key)
arrays = tree.arrays(
keys,
entry_start=start,
entry_stop=stop,
ak_add_doc=self.interp_options["ak_add_doc"],
ak_add_doc=options["ak_add_doc"],
how=tuple,
)

if self.original_form is not None:
awkward = uproot.extras.awkward()
dask_awkward = uproot.extras.dask_awkward()
awkward = uproot.extras.awkward()

# The subform generated by awkward.to_buffers() has different form keys
# from those used to perform buffer projection. However, the subform
# structure should be identical to the projection optimisation
# subform, as they're derived from `branch.interpretation.awkward_form`
# Therefore, we can correlate the subform keys using `expected_from_buffers`
container = {}
for key, array in zip(keys, arrays):
# First, convert the sub-array into buffers
ttree_subform, length, ttree_container = awkward.to_buffers(array)

# Load the associated projection subform
projection_subform = self._form.content(key)

# Correlate each TTree form key with the projection form key
for (src, src_dtype), (dst, dst_dtype) in zip(
ttree_subform.expected_from_buffers().items(),
projection_subform.expected_from_buffers(self.buffer_key).items(),
):
assert src_dtype == dst_dtype # Sanity check!
container[dst] = ttree_container[src]

return container
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the dask-awkward buffer projection interface, this function populates a dictionary of buffer keys (that match our remapped form, and hence, our buffer-key scheme) with the corresponding buffers loaded from the TTree.

The caller will fill in the blanks for non-read buffers.

Comment on lines 864 to 868
class TrivialFormMapping(ImplementsFormMapping):
def __call__(self, form: Form) -> tuple[Form, TrivialFormMappingInfo]:
new_form = form_with_unique_keys(form, "<root>")
return new_form, TrivialFormMappingInfo(new_form)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ImplementsFormMapping interface just provides us with a way to couple some state with a remapped form, hence it is very thin.

Comment on lines 926 to 934
@property
def meta(self) -> AwkArray:
awkward = uproot.extras.awkward()
high_level_form, form_info = self.form_mapping(self.base_form)
return awkward.typetracer.typetracer_from_form(
high_level_form,
highlevel=True,
behavior=form_info.behavior,
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meta is separate to the column projection, but enables the from_map function to derive the meta from a single source of truth.

@agoose77 agoose77 requested review from jpivarski and ioanaif October 3, 2023 16:38
@agoose77
Copy link
Collaborator Author

agoose77 commented Oct 3, 2023

Hey team (@jpivarski and @ioanaif), would you mind casting your eyes over this PR and see whether anything stands out. Particular with respect to the abstractions being built here.

I chose to avoid branching, and instead have form remapping be the base case, and treat non-remapping as a trivial remapper.

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you've completely refactored the uproot.dask implementation. That's good—it needed it—but there isn't an easy way to review this from looking at the code. We'll have to do it by extensive tests.

Getting #968 and #969 into main and then merging them into this PR will be a first step, so that we have a good baseline to start with. Then we can get all of the required pieces into the same environment.

I take it there isn't an order in which we can apply these changes in successive releases of awkward, dask-awkward, uproot, and coffea to not break things while releasing one new version of each at a time? Does it need to be all at once? If so, then we should plan the new version numbers simultaneously, so that we can make the new version of each package require (at minimum) the new version of each of the other packages, so that either the packaging system or the runtime-ImportError message specifies the versions needed.

Here's a first attempt at that:

  • awkward (root dependency): 2.4.5
  • dask-awkward (depends on awkward): 2023.10.0
  • uproot (runtime dependency on dask-awkward): 5.1.0 (a new minor version because it will be easier to keep track of than a large number in the third digit, like 5.0.13)
  • coffea (depends on awkward, uproot, and dask-awkward): 2023.10.0.rc0

When everything tests well in a common environment, we should release the new, mutually compatible versions of all four packages in a short window of time, and all of their minimum requirements on the other packages in the group should be these numbers at the time of release.

During the window, it will be possible for someone to get a bad combination because the currently latest versions of the dependent packages don't put an upper cap on their requirements for the others. I think the best way to deal with that is just to make that window short (pull off the band-aid quickly).

@agoose77
Copy link
Collaborator Author

agoose77 commented Oct 3, 2023

I take it there isn't an order in which we can apply these changes in successive releases of awkward, dask-awkward, uproot, and coffea to not break things while releasing one new version of each at a time?

We can

  1. release coffea and bump the minimum version of dask-awkward
  2. release dask-awkward and bump the minimum version of awkward
  3. release awkward and uproot.

That should minimise the pain, I hope, as pip shouldn't give users newer coffea or dask-awkward until these dependencies exist. That is, assuming that our various release mechanisms don't e.g. require these packages to be on PyPI already.

@agoose77 agoose77 force-pushed the agoose77/wip-shape-tracking branch from 3860ffb to f63ca09 Compare October 4, 2023 11:30
@agoose77
Copy link
Collaborator Author

agoose77 commented Oct 6, 2023

@jpivarski and @ioanaif I've bumped the awkward version to 2.4.5+, which is strict but necessary to ensure that uproot.dask works. My view is awkward 2 users should not have any upper caps at this stage, do you feel OK with this change? If so, let's merge!

@agoose77
Copy link
Collaborator Author

agoose77 commented Oct 8, 2023

@jpivarski could we make another uproot pre-release at some point?

@jpivarski
Copy link
Member

5.1.0rc2 is another pre-release: https://pypi.org/project/uproot/#history

@jpivarski jpivarski changed the base branch from main to main-v510 October 11, 2023 13:14
@jpivarski jpivarski merged commit 6abbdc3 into main-v510 Oct 11, 2023
@jpivarski jpivarski deleted the agoose77/wip-shape-tracking branch October 11, 2023 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants