Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: In-memory sparse array interchange #840

Open
hameerabbasi opened this issue Sep 6, 2024 · 26 comments
Open

RFC: In-memory sparse array interchange #840

hameerabbasi opened this issue Sep 6, 2024 · 26 comments
Labels
API extension Adds new functions or objects to the API. Needs Discussion Needs further discussion. RFC Request for comments. Feature requests and proposed changes. topic: DLPack DLPack.

Comments

@hameerabbasi
Copy link
Contributor

hameerabbasi commented Sep 6, 2024

Motivation

The sparse (also called PyData/Sparse) development team have been working on integration efforts with the ecosystem, most notably with SciPy, scikit-learn and others, with CuPy, PyTorch, JAX and TensorFlow also on the radar. One of the challenges we were facing was the lack of (possibly zero-copy) interchange between the different sparse array implementations. We believe this may be a pain point for many sparse array implementations moving forward.

This mirrors an issue seen for dense arrays previously, where the DLPack protocol was the one of the first things to be standardised. We're hoping to achieve community consensus for a similar problem.

Luckily, all sparse array formats (with the possible exception of DOK) are usually collections of dense arrays underneath. In addition, this problem has been solved for on-disk arrays before by the binsparse specification. @willow-ahrens is a co-author of that spec, and is also a collaborator for the sparse work.

Proposal

We propose introducing two new methods to the array-API compliant sparse array objects (such as those in sparse), which are described below.

__binsparse__

Returns a 2-tuple (binsparse_descriptor, constituent_arrays).

The first item is a dict equivalent to a parsed JSON binsparse descriptor of an array.

The second item is a list of __dlpack__ compatible arrays, which are the constituent arrays of the sparse array.

Introduction of from_binsparse function.

If a library supports sparse arrays, its from_binsparse method should support accepting (when possible, zero-copy) versions of objects that follow this __binsparse__ protocol, and have an equivalent sparse format within the library.

Psuedocode implementation

Here's a psuedocode example using two libraries, xp1 and xp2, both supporting sparse arrays:

# In library code:
xp2_sparray = xp2.from_binsparse(xp1_sparray, ...)

# This psuedocode impl is common between `xp1` and `xp2`
def from_binsparse(x: object, /, *, device: device | None = None, copy: bool | None = None) -> array:
    binsparse_impl = getattr(x, "__binsparse__", None)
    if binsparse_impl is None:
        raise TypeError(...)
    
    binsparse_descriptor, constituent_arrays = binsparse_impl()
    # Will raise an error if the format/descriptor is unsupported.
    sparse_type = _type_from_binsparse_descriptor(binsparse_descriptor)
    # Some reordering may need to happen somewhere to match internal order of constituent arrays
    my_constituent_arrays = tuple(
        from_dlpack(arr, device=device, copy=copy) for arr in constituent_arrays
    )
    return sparse_type.from_strided_arrays(my_constituent_arrays, shape=...)

Parallel implementation in sparse: pydata/sparse#764

Alternative solutions

There are formats for on-disk sparse-array interchange [1] [2]; but none for in-memory interchange. binsparse is the one that comes closest to offering in-memory interchange.

Pinging possibly interested parties:

Updated on 2024.10.09 as agreed in #840 (comment).

@kgryte kgryte added RFC Request for comments. Feature requests and proposed changes. API extension Adds new functions or objects to the API. topic: DLPack DLPack. Needs Discussion Needs further discussion. labels Sep 6, 2024
@data-apis data-apis deleted a comment Sep 6, 2024
@pearu
Copy link

pearu commented Sep 6, 2024

A couple of quick notes.

First, dlpack and binsparse are self-contained specifications where dlpack provides a protocol for sharing strided arrays between different array libraries and binsparse provides a unified description of various sparse formats. This proposal tries to glue these together by implicitly extending DLPack protocol but that contradicts with the semantics of DLPack that "describes the memory layout of dense, strided, n-dimensional arrays". In addition, the Python Array API standard specifies that DLPack should not support sparse arrays.

My suggestion is to cook up a sparse array interchange protocol that may use dlpack and binsparse (as these are obviously relevant pieces of this problem) but in a cleaner way.

For instance, binsparse specifies that the binsparse-compatible object provides a 2-tuple (<sparse array descriptor>, <list of sparse array data as strided arrays>). So, the appropriate place to deploy the dlpack protocol is to require that the arrays in the <list of sparse array data as strided arrays> must support the dlpack protocol.

For instance 2, introduce "Python specification of binsparse" (similar to one in dlpack) that consists of

  1. a from_binsparse function that accepts objects that implement the binsparse protocol per 2. below
  2. a __binsparse__ method on the objects representing sparse arrays that will return a 2-tuple (<sparse array descriptor>, <list of sparse array data as strided arrays>) which is used by the from_binsparse function of the consumer library.

Also, the specification should include the C-side interface as well.

What do you think?

@hameerabbasi
Copy link
Contributor Author

While I'm on board with the proposed __binsparse__ protocol, I'm of the opinion that sparse arrays should have as little user-facing API differences as possible as compared to strided arrays, i.e. using sparse arrays should be as frictionless from the Array API side as using strided arrays. Allocating a different function based on sparse and dense arrays (from_dlpack vs from_binsparse) seems counter to that.

I'm not opposed to a C-side specification, I'd welcome that so that the benefits can extend beyond the Python ecosystem. We do have to be mindful that the binsparse specification requires key-value maps, and we would need to maintain their relative order. Perhaps an iterable of 2-tuples makes more sense in this case.

@pearu
Copy link

pearu commented Sep 7, 2024

While I'm on board with the proposed __binsparse__ protocol, I'm of the opinion that sparse arrays should have as little user-facing API differences as possible as compared to strided arrays, i.e. using sparse arrays should be as frictionless from the Array API side as using strided arrays.

Absolutely. I presume that users of sparse arrays are aware of advantages and disadvantages of using a particular sparse format over other sparse formats or strided arrays. Using a particular sparse format is a choice of optimization method that must be supported by user-facing API.

Allocating a different function based on sparse and dense arrays (from_dlpack vs from_binsparse) seems counter to that.

from_dlpack is specifically defined for strided arrays, both by the Python Array API and dlpack. So, I find abusing the from_dlpack function for sparse arrays more confusing than providing a new API function which would follow the same design pattern that dlpack introduced: the pair (from_foo, __foo__) provides a minimal API for sharing data between provider and consumer libraries that support a protocol foo.

Btw, if one considers strided and sparse arrays semantically equivalent (as in PyTorch, for instance), a more intuitive approach would be to use asarray instead of from_dlpack as a common API function for constructing strided or sparse arrays from provider library objects. Although, Python Array API does not support that and existing implementations (PyTorch, JAX, ...) do not mix the functionalities of asarray and from_dlpack. I don't suggest to do that here either.

@hameerabbasi
Copy link
Contributor Author

hameerabbasi commented Sep 9, 2024

Btw, if one considers strided and sparse arrays semantically equivalent (as in PyTorch, for instance), a more intuitive approach would be to use asarray instead of from_dlpack as a common API function for constructing strided or sparse arrays from provider library objects.

While I agree with a common API; the disadvantage I find in this approach is that since from_dlpack requires the producer to pass ownership to the consumer; asarray requires one to maintain the ownership of the consumed array. This, in turn, means that there's no way to have a zero-copy transfer using asarray using the from_dlpack function; i.e. one would have to do something like def asarray(x, ...): return from_dlpack(copy(x), ...) somewhere to maintain ownership semantics, incurring a copy. See #840 (comment)

My intention was to have an API that essentially could support zero-copy interchange across libraries, regardless of whether the consumed array was strided, sparse or something else.

@pearu
Copy link

pearu commented Sep 9, 2024

While I agree with a common API; the disadvantage I find in this approach is that since from_dlpack requires the producer to pass ownership to the consumer;

This is incorrect. When using from_dlpack(x), the producer keeps owning the memory of x. See https://dmlc.github.io/dlpack/latest/python_spec.html#semantics.

@hameerabbasi
Copy link
Contributor Author

While I agree with a common API; the disadvantage I find in this approach is that since from_dlpack requires the producer to pass ownership to the consumer;

This is incorrect. When using from_dlpack(x), the producer keeps owning the memory of x. See https://dmlc.github.io/dlpack/latest/python_spec.html#semantics.

Ah, in that case; yes, asarray can use from_dlpack and a possible from_binsparse, and still be zero-copy. Thanks for the correction; asarray seems like the right format-agnostic API for this.

@hameerabbasi
Copy link
Contributor Author

I've updated the issue with @pearu's feedback.

@kgryte kgryte added this to Proposals Sep 16, 2024
@kgryte kgryte moved this to Stage 0 in Proposals Sep 16, 2024
@kgryte kgryte moved this from Stage 0 to Stage 1 in Proposals Sep 16, 2024
@willow-ahrens
Copy link

I'd like to additionally CC the binsparse contributors so they're aware of this effort.
@BenBrock
@eriknw
@ivirshup
@jim22k
@DrTimothyAldenDavis

@kgryte
Copy link
Contributor

kgryte commented Sep 16, 2024

@hameerabbasi Are there any other alternative sparse interchange protocols that we should be considering, or is binsparse the only game in town? Given that this RFC proposes standardizing a particular approach, I'd be interested in at least covering our bases and having an understanding of other prior art.

@hameerabbasi
Copy link
Contributor Author

hameerabbasi commented Sep 16, 2024

There are other on-disk/text formats: 1 2

Both of these are not designed for in-memory interchange of data -- and have no support for multiple sparse formats. I'll add these to the issue.

@willow-ahrens
Copy link

willow-ahrens commented Sep 18, 2024

There's also the FROSTT format, which stores tensors in a minimal text encoding and has the same limitations of the above.

http://frostt.io/

@rgommers
Copy link
Member

rgommers commented Sep 19, 2024

The relevant issue on the binsparse repo for an in-memory representation is GraphBLAS/binsparse-specification#4. That also mentions that Arrow has support for 2-D CSR/CSC/CSF/COO (no n-D, no chunking).

There's of course a bit of a chicken-and-egg problem here: it's hard to standardize anything that doesn't exist yet, but library implementors want to move ahead and implement something that is likely to be standardized later to avoid fragmentation or bc-breaking changes.

So here is what I think what should happen:

  • This issue should work out what the desired Python API will be for sparse in-memory interchange (the proposed analogue to DLPack with a dunder method like __binsparse__ sounds about right; whether more is needed a la __dlpack_device__ is TBD I think).
  • The actual binsparse in-memory spec should be worked out on GraphBLAS/binsparse-specification#4.
    • Seems more appropriate to discuss there rather than here, since that's the project that aims to define this and has more history and I assume will have additional constraints that aren't relevant for this repo, like aligment between the on-disk and in-memory metadata.
    • DLPack seems like a more appropriate choice than Arrow from the perspective of array libraries to me, due to broader support of relevant devices and dtypes, and being already in the array API standard and supported by all relevant array libraries. I can imagine that binsparse could decide to support both though, in case there's a real-world need for Arrow support.
  • Libraries should then go ahead and implement that.
  • Once we have multiple implementations that seem to work well, we move the Python API worked out in this issue into the standard.

@pearu
Copy link

pearu commented Sep 19, 2024

PyTorch has sparse formats that binsparse specifications does not specify as pre-defined formats. For instance, there are BSR/BSC (blocked CSR/CSC) formats, hybrid sparse formats (COO/CSR/CSC/BSR/BSC with values being strided tensors), sparse formats with batch dimensions (CSR/CSC/BSR/BSC with indices being multidimensional), and batch/hybrid sparse formats (combinations of hybrid and batched sparse formats).

So, for PyTorch, the usefulness of the in-memory sparse array interchange format based on binsparse specification will depend on how easy is to define these blocked/hybrid/batch sparse formats using binsparse custom format support, or better yet, on adding blocked/hybrid/batch sparse formats as pre-defined to binsparse specification.

@pearu
Copy link

pearu commented Sep 19, 2024

CC: @amjames (for torch.sparse)

@hameerabbasi
Copy link
Contributor Author

hameerabbasi commented Sep 19, 2024

So, for PyTorch, the usefulness of the in-memory sparse array interchange format based on binsparse specification will depend on how easy is to define these blocked/hybrid/batch sparse formats using binsparse custom format support, or better yet, on adding blocked/hybrid/batch sparse formats as pre-defined to binsparse specification.

Since __binsparse__ is meant to be a mainly cross-library format, the usefulness is limited also by other implementations that support blocked/hybrid/batch formats. If PyTorch is the only implementation of these formats, then adding it to binsparse makes little sense, as there won't be another library to do the exchange with.

Incidentally, I do agree with the notion of blocked formats being supported. They are supported in the MLIR sparse_tensor dialect, and in SciPy as well. The cleanest way to support these seems to be an index map: A mapping from the 4-D virtual indices to 2-D indices using the operators +, -, *, //, % only, although they aren't supported yet. This also leads to supporting the diagonal format. I'll raise an issue on the binsparse repo and link it here.

xref GraphBLAS/binsparse-specification#56

@amjames
Copy link

amjames commented Sep 23, 2024

PyTorch has sparse formats that binsparse specifications does not specify as pre-defined formats. For instance, there are BSR/BSC (blocked CSR/CSC) formats, hybrid sparse formats (COO/CSR/CSC/BSR/BSC with values being strided tensors), sparse formats with batch dimensions (CSR/CSC/BSR/BSC with indices being multidimensional), and batch/hybrid sparse formats (combinations of hybrid and batched sparse formats).

I want to say that I don't think there is a huge amount of value in specifically naming and tabulating different pre-defined formats like this. It is vastly more important that the standard be able to support generic layout definitions which have these features: non-scalar values (hybrid), non-scalar index lookup(blocked), and various dense/sparse levels.

Looking at the binsparse spec it seems pretty close with the ability to define custom, and I have some comments to add on discussion there.

I in general support the idea of a sparse interchange protocol based on some kind of standard like binsparse however I do not think it is generic enough in its current form, that is a separate issue from the proposal here though.

@hameerabbasi
Copy link
Contributor Author

I have two things to add here:

  1. The GraphBLAS folks are now explicitly considering in-memory use-cases when designing the binsparse protocol; in particular the reshaping-transposition proposal is being considered, but it is as-yet unclear what form that will take.
  2. I feel that __binsparse__ and __binsparse_format__ should be separate methods; the reasoning behind this is exactly the same as __dlpack__ and __dlpack_device__:
    a. One might need to query the format before getting the constituent arrays out, as getting the arrays out could be an expensive operation for some cases.
    b. There are many parallels in design for needing to query the format (device) independent of the actual arrays in __dlpack__, such as erroring out early and requesting a specific format.

Please let me know what folks think of 2, especially @pearu who proposed the original move to a 1-method design.

@DrTimothyAldenDavis
Copy link

This is an extremely important issue:

"One might need to query the format before getting the constituent arrays out, as getting the arrays out could be an expensive operation for some cases."

@pearu
Copy link

pearu commented Dec 3, 2024

Recall, __dlpack__ method returns a capsule that holds a structure that contains all the information about the array: shape, dtype, strides, pointer to storage, etc. This makes previous array interface protocols (__array_interface__, __cuda_array_interface__) that expose the array info explicitly, unnecessary.

I'd consider __binsparse_format__ a similar protocol to __array_interface__ that exposes various information about the sparse format: shape information (sparse and dense dimensions), dtype, format name, and format specific data that supports dlpack protocol.

Assuming that __binsparse__ protocol implementation will be similar to __dlpack__ protocol implementations, that is, __binsparse__ method returns a capsule that holds a structure that contains information about the sparse array. Hence, __binsparse_format__ is unnecessary: the query about the sparse array format can be implemented as a query to the binsparse capsule object, similar to how https://github.com/pearu/pydlpack/ implements a query to dlpack capsule, without the need to access the storage of data.

@hameerabbasi
Copy link
Contributor Author

the query about the sparse array format can be implemented as a query to the binsparse capsule object

Right, this assumes an O(1) conversion to a binsparse capsule (or equivalent) is always possible. However, this may not be the case: Libraries may support more formats than are supported by the binsparse protocol, and may need to perform an additional conversion.

@pearu
Copy link

pearu commented Dec 3, 2024

Libraries may support more formats than are supported by the binsparse protocol, and may need to perform an additional conversion.

I am not sure how __binsparse_format__ will be helpful here. If a library implements a sparse format that binsparse protocol does not support, then the binsparse protocol will be useless anyway: binsparse protocol cannot implement conversion about a format that it does not support, any necessary conversion should be done by the provider library.
If you'd disagree, please provide more details about the __binsparse_format__ specification as I may miss some points here.

@hameerabbasi
Copy link
Contributor Author

hameerabbasi commented Dec 3, 2024

any necessary conversion should be done by the provider library

I think you've hit the nail on the head with this part -- a conversion might be necessary, which may make the cost of constructing a capsule O(n), and therefore we're left with two options if we want to guarantee an O(1) conversion (both require a .asformat which can take binsparse descriptors):

  • Raise on unsupported formats. The consumer must choose whether to call .asformat on error.
  • Allow the consumer to query if they support the format via __binsparse_format__ and then raise or call .asformat as necessary.

@pearu
Copy link

pearu commented Dec 3, 2024

An exception with a helpful exception message is better than a silent expensive conversion, imho.

@rgommers
Copy link
Member

rgommers commented Dec 4, 2024

The two-method format seems clearly preferred, indeed for the same reason as __dlpack_device__: allowing to introspect metadata for compatibility and "hand-shaking" (it allows the consumer to change its request when it supports multiple devices/formats/etc.).

@pearu the analogy with __array_interface__ is not a good one. That method is numpy-specific, and the actual data is hence always already in memory and always in a single format (strided array on CPU). A __binsparse__ call, just like a __dlpack__ call, requires the data to be in memory in a supported format, so for (for example) lazily evaluated formats or custom formats not part of the protocol, a call to __binsparse__ may indeed require an expensive operation (and no, always raising an exception instead is definitely not the correct answer, there are always needs to do the expensive thing if no cheap thing is possible).

If anything, we've so far found a few corner cases where it would be helpful for more parts of the DLPack protocol to be introspectable separately from the actual "give me a capsule with a pointer to data in memory".

@pearu
Copy link

pearu commented Dec 4, 2024

A binsparse call, just like a dlpack call, requires the data to be in memory in a supported format

Sparse tensors of any format can be modeled as a pair

(shape-dtype-format specs, (<a list of dlpack-compatible objects that hold the indices data and values>))

(this is how we think about sparse tensors in PyTorch, for instance). Notice that __binsparse__ structure does not (need to) hold raw pointers to memory which indeed is the case of the __dlpack__ structure. Hence, __binsparse__ structure would be lazy by definition when using this mental model.

Lazy access to sparse tensors is about accessing shape-dtype-format specs without accessing the data of indices and values. Notice that the __dlpack__ capsule that holds pointers to raw memory is constructed when the __dlpack__ method is called (this is the point where data needs to be materialized in memory, not earlier). So, lazy access to sparse tensors shape-dtype-format data is possible if one will not trigger the creation of __dlpack__ capsules (read: materialize data in memory) earlier than necessary.

In any case, the laziness property should be provided at the DLpack protocol level, not in binsparse protocol that just uses the dlpack protocol for accessing sparse tensor data.

@rgommers
Copy link
Member

rgommers commented Dec 4, 2024

Lazy access to sparse tensors is about accessing shape-dtype-format specs without accessing the data of indices and values.

This is the entire point of why two separate methods have to exist. Your answer agrees that there should be this separation, you're just moving the separation between accessing metadata and data to a place it doesn't exist (__dlpack__). Coupling the introduction of a binsparse protocol to not-yet-proposed and probably infeasible changes to __dlpack__ isn't a good idea. Separation of concerns is nice:)

Please either go with the 2-method approach for binsparse itself instead, or propose a way of what a 1-method __binsparse__ should return to keep things lazy without any changes to DLPack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API extension Adds new functions or objects to the API. Needs Discussion Needs further discussion. RFC Request for comments. Feature requests and proposed changes. topic: DLPack DLPack.
Projects
Status: Stage 1
Development

No branches or pull requests

7 participants