-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: In-memory sparse array interchange #840
Comments
A couple of quick notes. First, dlpack and binsparse are self-contained specifications where dlpack provides a protocol for sharing strided arrays between different array libraries and binsparse provides a unified description of various sparse formats. This proposal tries to glue these together by implicitly extending DLPack protocol but that contradicts with the semantics of DLPack that "describes the memory layout of dense, strided, n-dimensional arrays". In addition, the Python Array API standard specifies that DLPack should not support sparse arrays. My suggestion is to cook up a sparse array interchange protocol that may use dlpack and binsparse (as these are obviously relevant pieces of this problem) but in a cleaner way. For instance, binsparse specifies that the binsparse-compatible object provides a 2-tuple For instance 2, introduce "Python specification of binsparse" (similar to one in dlpack) that consists of
Also, the specification should include the C-side interface as well. What do you think? |
While I'm on board with the proposed I'm not opposed to a C-side specification, I'd welcome that so that the benefits can extend beyond the Python ecosystem. We do have to be mindful that the binsparse specification requires key-value maps, and we would need to maintain their relative order. Perhaps an iterable of 2-tuples makes more sense in this case. |
Absolutely. I presume that users of sparse arrays are aware of advantages and disadvantages of using a particular sparse format over other sparse formats or strided arrays. Using a particular sparse format is a choice of optimization method that must be supported by user-facing API.
Btw, if one considers strided and sparse arrays semantically equivalent (as in PyTorch, for instance), a more intuitive approach would be to use |
My intention was to have an API that essentially could support zero-copy interchange across libraries, regardless of whether the consumed array was strided, sparse or something else. |
This is incorrect. When using |
Ah, in that case; yes, |
I've updated the issue with @pearu's feedback. |
I'd like to additionally CC the binsparse contributors so they're aware of this effort. |
@hameerabbasi Are there any other alternative sparse interchange protocols that we should be considering, or is |
There's also the FROSTT format, which stores tensors in a minimal text encoding and has the same limitations of the above. |
The relevant issue on the There's of course a bit of a chicken-and-egg problem here: it's hard to standardize anything that doesn't exist yet, but library implementors want to move ahead and implement something that is likely to be standardized later to avoid fragmentation or bc-breaking changes. So here is what I think what should happen:
|
PyTorch has sparse formats that binsparse specifications does not specify as pre-defined formats. For instance, there are BSR/BSC (blocked CSR/CSC) formats, hybrid sparse formats (COO/CSR/CSC/BSR/BSC with values being strided tensors), sparse formats with batch dimensions (CSR/CSC/BSR/BSC with indices being multidimensional), and batch/hybrid sparse formats (combinations of hybrid and batched sparse formats). So, for PyTorch, the usefulness of the in-memory sparse array interchange format based on binsparse specification will depend on how easy is to define these blocked/hybrid/batch sparse formats using binsparse custom format support, or better yet, on adding blocked/hybrid/batch sparse formats as pre-defined to binsparse specification. |
CC: @amjames (for torch.sparse) |
Since Incidentally, I do agree with the notion of blocked formats being supported. They are supported in the MLIR |
I want to say that I don't think there is a huge amount of value in specifically naming and tabulating different pre-defined formats like this. It is vastly more important that the standard be able to support generic layout definitions which have these features: non-scalar values (hybrid), non-scalar index lookup(blocked), and various dense/sparse levels. Looking at the I in general support the idea of a sparse interchange protocol based on some kind of standard like |
I have two things to add here:
Please let me know what folks think of 2, especially @pearu who proposed the original move to a 1-method design. |
This is an extremely important issue: "One might need to query the format before getting the constituent arrays out, as getting the arrays out could be an expensive operation for some cases." |
Recall, I'd consider Assuming that |
Right, this assumes an O(1) conversion to a binsparse capsule (or equivalent) is always possible. However, this may not be the case: Libraries may support more formats than are supported by the binsparse protocol, and may need to perform an additional conversion. |
I am not sure how |
I think you've hit the nail on the head with this part -- a conversion might be necessary, which may make the cost of constructing a capsule O(n), and therefore we're left with two options if we want to guarantee an O(1) conversion (both require a
|
An exception with a helpful exception message is better than a silent expensive conversion, imho. |
The two-method format seems clearly preferred, indeed for the same reason as @pearu the analogy with If anything, we've so far found a few corner cases where it would be helpful for more parts of the DLPack protocol to be introspectable separately from the actual "give me a capsule with a pointer to data in memory". |
Sparse tensors of any format can be modeled as a pair
(this is how we think about sparse tensors in PyTorch, for instance). Notice that Lazy access to sparse tensors is about accessing In any case, the laziness property should be provided at the DLpack protocol level, not in binsparse protocol that just uses the dlpack protocol for accessing sparse tensor data. |
This is the entire point of why two separate methods have to exist. Your answer agrees that there should be this separation, you're just moving the separation between accessing metadata and data to a place it doesn't exist ( Please either go with the 2-method approach for binsparse itself instead, or propose a way of what a 1-method |
Motivation
The
sparse
(also called PyData/Sparse) development team have been working on integration efforts with the ecosystem, most notably with SciPy, scikit-learn and others, with CuPy, PyTorch, JAX and TensorFlow also on the radar. One of the challenges we were facing was the lack of (possibly zero-copy) interchange between the different sparse array implementations. We believe this may be a pain point for many sparse array implementations moving forward.This mirrors an issue seen for dense arrays previously, where the DLPack protocol was the one of the first things to be standardised. We're hoping to achieve community consensus for a similar problem.
Luckily, all sparse array formats (with the possible exception of DOK) are usually collections of dense arrays underneath. In addition, this problem has been solved for on-disk arrays before by the binsparse specification. @willow-ahrens is a co-author of that spec, and is also a collaborator for the
sparse
work.Proposal
We propose introducing two new methods to the array-API compliant sparse array objects (such as those in
sparse
), which are described below.__binsparse__
Returns a 2-tuple
(binsparse_descriptor, constituent_arrays)
.The first item is a
dict
equivalent to a parsed JSONbinsparse
descriptor of an array.The second item is a
list
of__dlpack__
compatible arrays, which are the constituent arrays of the sparse array.Introduction of
from_binsparse
function.If a library supports sparse arrays, its
from_binsparse
method should support accepting (when possible, zero-copy) versions of objects that follow this__binsparse__
protocol, and have an equivalent sparse format within the library.Psuedocode implementation
Here's a psuedocode example using two libraries,
xp1
andxp2
, both supporting sparse arrays:Parallel implementation in
sparse
: pydata/sparse#764Alternative solutions
There are formats for on-disk sparse-array interchange [1] [2]; but none for in-memory interchange.
binsparse
is the one that comes closest to offering in-memory interchange.Pinging possibly interested parties:
scipy.sparse
)binsparse
andfinch-tensor
/sparse
)cupyx.sparse
)torch.sparse
)Updated on 2024.10.09 as agreed in #840 (comment).
The text was updated successfully, but these errors were encountered: