Skip to content

lincc-frameworks/nested-pandas

Repository files navigation

nested-pandas

Template

PyPI Conda

GitHub Workflow Status codecov Read the Docs benchmarks

An extension of pandas for efficient representation of nested associated datasets.

Nested-Pandas extends the pandas package with tooling and support for nested dataframes packed into values of top-level dataframe columns. Pyarrow is used internally to aid in scalability and performance.

Nested-Pandas allows data like this:

pandas dataframes

To instead be represented like this:

nestedframe

Where the nested data is represented as nested dataframes:

   # Each row of "object_nf" now has it's own sub-dataframe of matched rows from "source_df"
   object_nf.loc[0]["nested_sources"]

sub-dataframe

Allowing powerful and straightforward operations, like:

   # Compute the mean flux for each row of "object_nf"
   import numpy as np
   object_nf.reduce(np.mean, "nested_sources.flux")

using reduce

Nested-Pandas is motivated by time-domain astronomy use cases, where we see typically two levels of information, information about astronomical objects and then an associated set of N measurements of those objects. Nested-Pandas offers a performant and memory-efficient package for working with these types of datasets.

Core advantages being:

  • hierarchical column access
  • efficient packing of nested information into inputs to custom user functions
  • avoiding costly groupby operations

This is a LINCC Frameworks project - find more information about LINCC Frameworks here.

Acknowledgements

This project is supported by Schmidt Sciences.