-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixed precision functionality #778
Comments
I think qfunction variation is the hardest part. Do we need to support different arguments to the qfunction having different precisions or can they be homogeneous (all arguments in matching precision)? Let's defer this part for later. As a first step, I would add functions like We can have a macro We then add a I kinda think We can have |
Just to make sure I'm clear: are you referring to the user-provided per-point QFunction, the libCEED QFunction interface, or both? Regarding |
We could make CeedElemRestriction have no opinion about types, in which case we would convert as needed. Then L-vectors, CeedBasis, and CeedQFunction would have a designated precision. The precision of E-vectors would be inferred to match the compute precision of CeedBasis (converting as needed to/from the L-vector precision). A simplification would be to make To make this incremental, I think it will be simpler to start with the L-vectors and get that merged, then work inward. |
I'm not sure I understand what your list is intended to represent. Cases where we use I like the idea of the element restriction being able to infer the output (or input, in transpose mode) type, but I'm not sure how it would work in practice. The |
int CeedElemRestrictionApply(CeedElemRestriction rstr,
CeedTransposeMode t_mode, CeedVector u, CeedVector ru, CeedRequest *request); The |
Right, I guess this raises another important question: how much do we want to limit the multiprecision functionality to things that happen inside a Though even if we are only considering the case where the Operator is setting the preferred element restriction output precision based on all the information it has, the function that lets you set the element restriction precision might still make more sense than relying on the input/output |
I figured I'd think cuda-gen could still propagate the desired (ephemeral) E-vector precisions from the Basis. One thing I'd like to facilitate is having a residual in double precision and Jacobian action in single, or the Jacobian in double and a preconditioner in single. |
Yes, it could, I just thought it might be preferable to have more consistency across the CUDA/HIP backends, such that they could all get/set this information in the same way. But cuda(hip)-gen is so different from the other backends already that perhaps that's not really a selling point. |
Do you think it'll be hard to handle consistently if CeedElemRestriction is precision-agnostic (uses its arguments rather than set as state)? My concern is that when we wire up several different operators, we'd like for the data (integers) of a |
I've been thinking about this, and I haven't yet come up with any problems (not that that means there couldn't be any, of course). I was initially concerned about |
Yeah, I don't see harm in a separate |
Okay, I'm trying to get a clear picture here of what a " Some potential "general principles":
Internally in the library, in backends, etc. we of course use these For the "sub"-operators, we could potentially treat them differently:
Does this follow more or less what you were thinking? |
Yeah, that's about right. I think I think the implementation becomes much simpler of Basis and QFunction always see homogeneous types. This saves needing to convert within those compute kernels and limits the number of variants that need to be written/maintained/compiled. |
I'm not sure I follow. Wouldn't this mean we have to have three interfaces for
By homogeneous types, do you mean just that input/output vectors are all the same type, or that input/output vectors also match the precision of the computation, as well? |
For vectors, I mean that we can make By homogeneous types, I mean that the input and output vectors would all have the same precision. A QFunction is free to convert internally (promoting or demoting) as it likes, though the cost of doing so may be nontrivial. |
But we'd still have to have separate interfaces, correct? And a way for the user to specify at compile time whether to compile the C11 or C++ interface? And I'm still not sure how that would work within the library itself (other library functions that call |
We'd need to implement both the #define CeedVectorGetArray(ceed, mtype, x) _Generic((x), \
double **: CeedVectorGetArray_f64, \
float **: CeedVectorGetArray_f32 \
)(ceed, mtype, x) and we'd provide inline overloads for C++. Whether to provide these within It's true that we can't call these within the library (so long as we don't want a hard dependency on C11), but I think it's okay because we can be type-agnostic except for the numerical kernels, which need to specialize anyway. |
Oh! I think "inline" was the crucial thing I was missing for C++. Otherwise, I couldn't figure out how we could use |
This will make things more complicated, but I've been thinking a lot more about where the precision conversion should take place and whether or not we should limit the input/output types of the basis to match -- strictly from the GPU point of view (this is all different from CPU, as mentioned above). This all mostly applies to the non-fused backends. First, some ground rules: we'll assume we're memory bound, since we want to optimize performance for the tensor case. Thus, our main goal is reducing the high precision read/write actions to main memory, which will matter more than the lower-precision computation in terms of speedup. We will also consider two main cases of mixed-precision computation:
For clarity, we'll always use double and float to represent the high and low precisions. Let's consider the two types of input vectors to a
However, this won't necessarily be the best choice in at least two cases: one, when the input data can change between applications of the operator. Then, with every application, we pay the cost to read from double, convert to float, write to float. When it's time to use the data -- likely in the QFunction -- we read again in float. Whereas, if we didn't officially convert the "whole" vector, but had the QFunction kernel itself convert local data to float, then we only pay one double read cost at the time of QFunction application. (Again, this only applies if we can't just use the vector we already converted to float, because the data changed.) The second case is the "low-high" case, where the input to the QFunction will be float, but we want to convert it locally to double for the QFunction computation. If the This is assuming that we are still allowed to have different computation and input/output precisions for the QFunction, even if we specify that all inputs and outputs should be the same precision.
Let's consider the case where everything will be in double precision, except the QFunction. The output of the basis kernel for the active input will be an input for the QFunction, so it needs to be in float. If we apply the basis kernel as usual, call a conversion kernel/function (through the libCEED API), then call the QFunction kernel, we have a write in double + read in double/write in float + read in float. If the QFunction handles the conversion locally into workspace data, we still have write in double + read in float. If, however, the internal Q-vector created for the output of this input's E-vector already has its data allocated for float, and the basis kernel performs the write to float at its last step (converting from local double workspace that was used in the computation), then we only have write in float + read in float. Of course, if we want the basis and QFunction to be done in single precision, then we'd want the write to float happening on the output of the element restriction. Also consider the "low-high" case: here, we never want double precision This setup comes with several drawbacks, like the proliferation of kernels (compute precision + input precision + output precision) and needing extra local workspace memory. It also means a lot of code changes in every backend. In the MAGMA backend, we have tensor basis kernels with a lot of template parameters, so we're building a lot of kernels already... However, I'm not sure how we can see much gain on GPUs without letting the kernels handle the conversion locally, which requires doing it in the kernels themselves, unless I am missing something. All of this raises the question, I suppose, on how much this functionality matters for the non-fused GPU backends, since it's all considered in terms of the tensor basis case, where one would generally want to use cuda-gen, if possible. |
Thanks for these detailed thoughts. My opinion is that conversion should happen in restriction and that when a vector (active or passive) does not match the compute precision, the restriction (or transpose) should be explicit even if it's otherwise the identity. We could have an interface to make a I think low-high is most important for speed with simple qfunctions. For very expensive qfunctions, there may be benefit of lower precision (but numerical stability is usually not well thought out for very expensive qfunctions). Also, global vectors of If we were to decouple Basis and QFunction precision as a libCEED feature, we'd want to be able to do it automatically, not by forcing the conversion into the user's QFunction. But for the specific experiments I'd like to run to test/demonstrate benefits of stable numerics in QFunctions, I'm happy to manage the code that converts inside the QFunction. (That is, QFunction author is entirely responsible for conversions, and libCEED doesn't need to do anything.) I think ability to convert in Restriction is tractable without code/template explosions and would open up significant research and production options. So I would accept that scope and only consider further extensions if/when a compelling case is mode for them. |
One potential issue with always doing the conversion in the restriction is that minimized memory movement will depend on the mode: for high-low, it's better to convert in the element restriction (write to float, then read from float for basis), but for low-high, it'd be better to wait until the basis application and do a local conversion (write to float, read from float + convert to double in registers vs. write to double, read from double). That said, I have gone ahead and created two branches where I am playing around with conversion in the restriction for cuda-gen -- not in a way we'd want to merge, but to compare performance. The two versions are low-high and high-low, with everything after the conversion in the restriction happening in the same precision (and with basis data stored in the computation precision, i.e., the "second" precision). I think the exact location of the conversion to higher precision matters less for cuda-gen than the non-fused CUDA backends due to how the element restrictions work (reading into registers, not doing entire elem restriction as a separate kernel). The one snag here is that technically, not quite all computation happens in the "second" precision, due to the atomicAdd calls in the transpose element restriction needing to match the output (L-vector) precision. The addition/reduction of E-vector into L-vector isn't much computation compared to the basis or QFunction, but it is still a different situation than the input element restrictions, which just convert the data. |
Good points. I'd say the conversion is logically part of the restriction, though an implementation could choose to associate/fuse it differently. I think before we put too much thought into optimizing this pattern, we'll want some real-world tests. On Gauss-Lobatto nodes to Gauss quadrature points, the condition number of the basis interpolation matrix is less than 3 and the basis derivative matrix is around 10. So this part gives up about one digit of accuracy to perform in single precision versus double. I ultimately think we may consider methods that use fixed-point 32-bit integers for some of the intermediate representations, but I think it's premature until we have use cases showing where the extra bits are useful. And I think associating with restriction is entirely reasonable for these tests and performance with cuda-gen won't care. |
After some informal discussions on Slack, I have the following proposal for mixed-precision "rules" for
However, I realize this will cause problems, perhaps, with memory management. Right now we have a clear distinction in the backend data structs between arrays that have been allocated by libCEED vs given as a pointer, to make sure and deallocate as necessary when the vector is destroyed. A memory management system like the one above could lead to a |
I think this is a good model. Yes, the |
Rough code for what I was thinking of. typedef enum {
CEED_PRECISION_HALF = 0,
CEED_PRECISION_SINGLE = 1,
CEED_PRECISION_DOUBLE = 2,
CEED_PRECISION_DOUBLE_DOUBLE = 3,
} CeedPrecisionType;
...
char *data_ptrs[4];
...
double *double_ptr = data_ptrs[CEED_PRECISION_DOUBLE]; |
// Set two pointers of user arrays to use
CeedVectorSetArray(v, CEED_MEM_HOST, CEED_PRECISION_HALF, CEED_USE_POINTER, &h_ptr);
CeedVectorSetArray(v, CEED_MEM_HOST, CEED_PRECISION_DOUBLE, CEED_USE_POINTER, &d_ptr);
// Copy to half precision
// Would contain data from d_ptr converted to half
CeedVectorTakeArray(v, CEED_MEM_HOST, CEED_PRECISION_HALF, &h_ptr); |
Did you want to create a separate Would a I was thinking something like
so the One reason I thought it would be better to stick with something like the (And maybe we should pick a different name for |
Oh, I was just scratching down some ideas. I wouldn't replicate two enums for basically the same thing, I just didn't remember the enum names offhand. I don't think we would want to pass around an object of different precision pointers divorced from their validity/staleness information, right? There is a lot of state management logic, and I'd think we would want to hide that inside of the CeedVector so that logic doesn't have to leak into other parts of the interface. Do we want to adopt the convention that all QFunction data will be in the same precision? That would simplify our internal needs. |
I think it's reasonable for QFunctions to use homogeneous precision because
|
We should probably expect the user to explicitly set the precision of the context data? |
The context is just bytes. It can contain integers, strings, whatever. I was thinking of the input and output args. |
Right. It just occurred to me though that we have examples of using CeedScalar in context struct definitions all over the place, but that won't work if the user has context doubles but wants to run experiments with using Operator and QFunction evaluation in single precision. The codegen backends would reasonably expect to recycle the same source for any precision operators, right? |
Do you mean re: my comment about |
If I request a single pointer, then the CeedVector's state management logic knows exactly what I what I have access to and if I am reading or writing. But if we hand back multiple pointers, then how does the CeedVector's state management logic know which one I ultimately changed and which one is valid? I suppose we would be I guess I'm not seeing what passing this struct around gives us that we wouldn't get from just using an interface with a |
Hand back to what, where? Sorry, I was previously experimenting with an interface passing Though, I do like the idea of being able to access the proper precision from a variable rather than a name...(meaning the value given in |
I made some updates to the document based on our discussions today: I plan to use this branch for further tests or experiments in implementation. Also, anyone else should feel free to add/edit thoughts in the document on that branch. It's still a very rough draft of ideas and things we've talked about thus far. |
Are you thinking of this
That probably shouldn't have been put into this struct in the first place. It is a legacy from a hacky approach I had in the Active/Passive refactor years ago before we added reference counting on CeedVector access. Really it should be eleminated. |
To be clear, @jeremylt, was your suggestion of |
At least in my mind it makes sense to leave the data wrapped in CeedVector objects as much as possible and only handle raw pointers when we must. It isn't clear to me where this |
Well, two reasons, I guess. Originally, I thought we may have a case where the user or backend functions would want to interact directly with the multiprecision data "thing," whatever that was, in a way that would work for the other languages (not just relying on But being able to access the data through a variable (a value from the type enum) as you showed above would be very helpful for the purposes of type-agnostic backend code. So, I do see that as an advantage to |
I don't think we should ever allow a user to have more than one pointer at a time for write access. That breaks our knowledge of what data is valid. Internally, I could see it being useful to arrange how we store the 12 pointers we would need to track inside of a vector in some consistent way or if we wanted to repeat this idea inside of basis data for interp and grad matrices, but I really think that data should only pass outside of the CeedVector in a single pointer to a single array at a time. For our other languages, I would expect us to wrap the C API in whatever is ideomatic for that language so the user doesn't have to touch the C at all. For Rust we have to be explicit about the types so I would expect separate interfaces for getting a float or double array. I don't know about Python, but I'd expect using overloading on array types in Julia. |
Yeah, I was thinking it might be more useful in the
I was indeed thinking that we could use whatever storage is in |
Just a small note that it can be a struct containing an array, but shouldn't be a raw array (less type checking/harder to extend). |
@jedbrown do you mean you prefer sticking with something like
then if we have a member Or some other option? |
Like this is good in my opinion, I just want it inside a struct (like you suggest) rather than a raw array of pointers. Open question of whether these should be char *p = array.values[scalar_type];
p + 100 * bytes_per_scalar; // valid
void *q = array.values[scalar_type];
q + 100 * bytes_per_scalar; // invalid!
|
That's a good point about pointer arithmetic. Maybe an array of |
Based on our discussion yesterday (and previous discussion), I decided to make up some schematics to hopefully clearly show my interpretation of some options that have been proposed, and the pros/cons. In the following schematics:
Let's assume we can set the following precision options at an Operator level in libCEED:
If we start with the simpler option for computation precision, that all Basis operations match the QFunction precision, this is what would happen to float and double inputs/outputs with float computation precision: And this would be the double precision computation case: However, as mentioned in the discussion, the basis computations are generally very well-conditioned, and don't need higher precision for the sake of accuracy if the operator input is already in lower precision anyway. My experience is mostly in the GPU backends, where something like this could potentially have a benefit (especially if all inputs/outputs are float, but we want to do QFunction in double due to conditioning concerns): In this, case the little However, is this too confusing/surprising to have these different options based on configuration combinations of input/output precisions and computation precision? It would also be "asymmetric," as the float computation precision case would never worry about this (we'd always want to convert to float as soon as possible anyway). |
This issue is for discussions related to adding mixed precision capabilities to libCEED, including:
An idea from @jedbrown was to handle precision conversions through the
CeedVector
objects, which could be modified to store multiple pointers for different precisions, along with a tag to indicate which precision that vector's data currently uses.The text was updated successfully, but these errors were encountered: