Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Venado optimizations #755

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Venado optimizations #755

wants to merge 12 commits into from

Conversation

mewall
Copy link
Collaborator

@mewall mewall commented Jan 14, 2025

  • Modify bml_transpose() fortran API to match the C API
    o Add bml_transpose_new()
    o Change the tests to use the new API
  • Add methods to get the pointer for MAGMA arrays, for use in Fortran OpenACC and OpenMP offload
    o Write fortran wrapper for existing bml_get_data_ptr_dense()
    o Add new bml_get_ld_dense()
  • Add bml_set_N_dense() to change the size of a bml array that's already been allocated
    o This avoids unnecessary allocations and leads to substantial speedups
    o Unsafe method that's exposed in fortran for dense matrices only

Copy link
Collaborator

@nicolasbock nicolasbock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nicolasbock nicolasbock enabled auto-merge January 14, 2025 18:45
scripts/build_bml_cray.sh Outdated Show resolved Hide resolved
#include "magma_v2.h"
#endif

void bml_set_N_dense(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a use case for setting N after construction. Is it to avoid constructing a new matrix?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeanlucf22 Yes. This was addressed in the commit comment but the bullets needed fixing to clarify (now corrected). This is to avoid repeated memory allocations which became a bottleneck in MD simulations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks dangerous to me: you modify N without modifying anything else will lead to matrices in inconsistent state!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mention in the comments, it is unsafe. It's possible to change N so that the matrix will exceed its original allocation. It's exposed only as bml_set_N_dense(), for experts. It works. Can you explain how the matrix will be inconsistent? It's needed for an application, we need to figure out how to make this method available in some form.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If needed we can protect against exceeding initial size by adding an extra variable to the struct keeping track of the size of the originally allocated matrix. But it's less intrusive to leave the struct alone...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the performance issue comes from the allocation of the struct members domain and domain2:

bml_domain_t *domain;

I looked into it at some point, but the fix I tried was breaking qmd-progress.
These two struct members could be static or initialized as needed.
Increasing N would obviously lead to a memory allocation too small to hold an NxN matrix.

What is the use case? Is N decreasing and you want to reuse the allocation? Whatever it is, I think looking into domain and domain2 is a safer and better alternative.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is needing to use a MAGMA working array of differing size on different MD iterations, where max N is known. The problem is that creating a new array each time is slow, due to the GPU allocation. The solution is to allocate using max N once and to resize the array using bml_set_N_dense() as needed. How can I use domain and domain2 to do this? I'm not familiar with those.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand now.

I was not suggesting to use domain and domain2. I was just saying their allocation may be the main culprit when it comes to allocation time for a dense matrix. Maybe an issue to deal with another time.

Another suggestion: having a function resizeNoAlloc(int n) that would just change N if n<=N, otherwise would change N and reallocate memory? Having an extra struct member keeping track of allocated memory size would be good in that case.

@mewall
Copy link
Collaborator Author

mewall commented Jan 28, 2025 via email

@jeanlucf22
Copy link
Collaborator

Yes, I think such a function could be good. The dense case is essentially done, even for magma build, we’d just need to add the allocated size to the struct and add a check to make it work. Need to figure out the other matrix types, if they’ll be supported. Meanwhile, what do you think about merging the current function? Get Outlook for iOShttps://aka.ms/o0ukef
________________________________ From: Jean-Luc Fattebert @.> Sent: Monday, January 27, 2025 4:21:48 PM To: lanl/bml @.> Cc: Wall, Michael E @.>; Author @.> Subject: [EXTERNAL] Re: [lanl/bml] Venado optimizations (PR #755) @jeanlucf22 commented on this pull request.
________________________________ In src/C-interface/dense/bml_setters_dense.chttps://urldefense.com/v3/__https://github.com/lanl/bml/pull/755*discussion_r1931312603__;Iw!!Bt8fGhp8LhKGRg!FbQ06GCuZLlZD10ZOodErg20Q_XH3WAFVKL0R0Oc8HqI9Aed_3wlkpMOHRdkO-20l936fvvr2iZGDdyW8vBVVBy3$:
@@ -3,6 +3,22 @@
#include "bml_setters_dense.h" #include "bml_types_dense.h" +#ifdef BML_USE_MAGMA +#include "magma_v2.h" +#endif + +void bml_set_N_dense( I understand now. I was not suggesting to use domain and domain2. I was just saying their allocation may be the main culprit when it comes to allocation time for a dense matrix. Maybe an issue to deal with another time. Another suggestion: having a function resizeNoAlloc(int n) that would just change N if n<=N, otherwise would change N and reallocate memory? Having an extra struct member keeping track of allocated memory size would be good in that case. — Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/lanl/bml/pull/755*discussion_r1931312603__;Iw!!Bt8fGhp8LhKGRg!FbQ06GCuZLlZD10ZOodErg20Q_XH3WAFVKL0R0Oc8HqI9Aed_3wlkpMOHRdkO-20l936fvvr2iZGDdyW8vBVVBy3$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AA67VEIOEH32HB66PF4BJQ32M25QZAVCNFSM6AAAAABVFKYFIOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDKNZWG42TCNBWGM__;!!Bt8fGhp8LhKGRg!FbQ06GCuZLlZD10ZOodErg20Q_XH3WAFVKL0R0Oc8HqI9Aed_3wlkpMOHRdkO-20l936fvvr2iZGDdyW8iWVuCBR$. You are receiving this because you authored the thread.Message ID: @.***>

I don't think it is a good idea to merge as is. That's what branches are for if you need it as is right away.

@jmohdyusof
Copy link
Collaborator

FWIW, for the ellpack format, you can probably make this work simply by making nnz(i) = 0 for i > n. Obviously M is only an upper bound already, and the loops would continue to run over all N rows, unless you decided to introduce another variable to truncate them.

@mewall
Copy link
Collaborator Author

mewall commented Jan 29, 2025 via email

@mewall
Copy link
Collaborator Author

mewall commented Jan 29, 2025 via email

@jeanlucf22
Copy link
Collaborator

Can you please provide a list of requirements for merging this into master?

________________________________ From: Jean-Luc Fattebert @.> Sent: Wednesday, January 29, 2025 12:25 PM To: lanl/bml @.> Cc: Wall, Michael E @.>; Author @.> Subject: [EXTERNAL] Re: [lanl/bml] Venado optimizations (PR #755) Yes, I think such a function could be good. The dense case is essentially done, even for magma build, we’d just need to add the allocated size to the struct and add a check to make it work. Need to figure out the other matrix types, if they’ll be supported. Meanwhile, what do you think about merging the current function? Get Outlook for iOShttps://aka.ms/o0ukefhttps://urldefense.com/v3/__https://aka.ms/o0ukef__;!!Bt8fGhp8LhKGRg!Cjdib0KzKOuP85xu8ga136DukYOoPPKVgcctLCsr3CufV8WwR9W6zwe798bqE7qu0gpHiural0miy5b47KSGeMqG$
________________________________ From: Jean-Luc Fattebert @.> Sent: Monday, January 27, 2025 4:21:48 PM To: lanl/bml @.> Cc: Wall, Michael E @.>; Author @.> Subject: [EXTERNAL] Re: [lanl/bml] Venado optimizations (PR #755https://urldefense.com/v3/__https://github.com/lanl/bml/pull/755__;!!Bt8fGhp8LhKGRg!Cjdib0KzKOuP85xu8ga136DukYOoPPKVgcctLCsr3CufV8WwR9W6zwe798bqE7qu0gpHiural0miy5b47HtHaUCk$) @jeanlucf22https://urldefense.com/v3/__https://github.com/jeanlucf22__;!!Bt8fGhp8LhKGRg!Cjdib0KzKOuP85xu8ga136DukYOoPPKVgcctLCsr3CufV8WwR9W6zwe798bqE7qu0gpHiural0miy5b47HKzc6_0$ commented on this pull request.
________________________________ In src/C-interface/dense/bml_setters_dense.chttps://urldefense.com/v3/#755*discussion_r1931312603;Iw!!Bt8fGhp8LhKGRg!FbQ06GCuZLlZD10ZOodErg20Q_XH3WAFVKL0R0Oc8HqI9Aed_3wlkpMOHRdkO-20l936fvvr2iZGDdyW8vBVVBy3$: @@ -3,6 +3,22 @@ #include "bml_setters_dense.h" #include "bml_types_dense.h" +#ifdef BML_USE_MAGMA +#include "magma_v2.h" +#endif + +void bml_set_N_dense( I understand now. I was not suggesting to use domain and domain2. I was just saying their allocation may be the main culprit when it comes to allocation time for a dense matrix. Maybe an issue to deal with another time. Another suggestion: having a function resizeNoAlloc(int n) that would just change N if n<=N, otherwise would change N and reallocate memory? Having an extra struct member keeping track of allocated memory size would be good in that case. — Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/#755*discussion_r1931312603;Iw!!Bt8fGhp8LhKGRg!FbQ06GCuZLlZD10ZOodErg20Q_XH3WAFVKL0R0Oc8HqI9Aed_3wlkpMOHRdkO-20l936fvvr2iZGDdyW8vBVVBy3$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AA67VEIOEH32HB66PF4BJQ32M25QZAVCNFSM6AAAAABVFKYFIOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDKNZWG42TCNBWGM__;!!Bt8fGhp8LhKGRg!FbQ06GCuZLlZD10ZOodErg20Q_XH3WAFVKL0R0Oc8HqI9Aed_3wlkpMOHRdkO-20l936fvvr2iZGDdyW8iWVuCBR$. You are receiving this because you authored the thread.Message ID: @.> I don't think it is a good idea to merge as is. That's what branches are for if you need it as is right away. — Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/lanl/bml/pull/755*issuecomment-2622639216__;Iw!!Bt8fGhp8LhKGRg!Cjdib0KzKOuP85xu8ga136DukYOoPPKVgcctLCsr3CufV8WwR9W6zwe798bqE7qu0gpHiural0miy5b47BQYtol7$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AA67VEK7MTI6M5EYAESKTYT2NETI7AVCNFSM6AAAAABVFKYFIOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRSGYZTSMRRGY__;!!Bt8fGhp8LhKGRg!Cjdib0KzKOuP85xu8ga136DukYOoPPKVgcctLCsr3CufV8WwR9W6zwe798bqE7qu0gpHiural0miy5b47E1cI_bk$. You are receiving this because you authored the thread.Message ID: @.>

Have the function keep the matrix in a consistent state: memory allocated <= N*ld

@mewall
Copy link
Collaborator Author

mewall commented Jan 30, 2025

Have the function keep the matrix in a consistent state: memory allocated <= N*ld

OK here's what I propose

A add a new variable "num_elems_allocated" to the following struct that holds the total number of elements in the dense matrix

struct bml_matrix_dense_t

If you can foresee any consequences for this modification elsewhere in the code, please let me know

Leave bml_set_N_dense() as a function to change N.

Fail using LOG_ERROR()

I will be unable to merge our PROGRESS optimizations into master without merging this into master, so please, if there is anything else that will hold this up, let me know now.

@jeanlucf22
Copy link
Collaborator

Have the function keep the matrix in a consistent state: memory allocated <= N*ld

OK here's what I propose

A add a new variable "num_elems_allocated" to the following struct that holds the total number of elements in the dense matrix

struct bml_matrix_dense_t

If you can foresee any consequences for this modification elsewhere in the code, please let me know

Leave bml_set_N_dense() as a function to change N.

Fail using LOG_ERROR()

I will be unable to merge our PROGRESS optimizations into master without merging this into master, so please, if there is anything else that will hold this up, let me know now.

Personally, I'd prefer another name for the function, such as "resize" which indicates the new matrix is actually in a working state, just in a different size. I would also prefer to reallocate when N larger, instead of failing.

Any other opinion?

@jmohdyusof
Copy link
Collaborator

jmohdyusof commented Jan 30, 2025

My 2c:
num_elements_allocated is somewhat confusing, is that NxN or nxn? It sounds like NxN, but your description sounds like the latter (which is more like n_in_use)

Is the proposal that resize is an asymmetric function that reallocates if N_new > N_old but just reuses the old data allocation if N_new <= N_old?

@mewall
Copy link
Collaborator Author

mewall commented Jan 30, 2025

My 2c: num_elements_allocated is somewhat confusing, is that NxN or nxn? It sounds like NxN, but your description sounds like the latter (which is more like n_in_use)

It's NxN, or Nxld, depending on CPU vs MAGMA. I'm not sure what lower case n refers to though...is there another complexity I'm not seeing?

Is the proposal that resize is an asymmetric function that reallocated in N_new > N_old but just reuses the old data allocation if N_new <= N_old?

That's my understanding. My preference would be to just do the thing asked for and fail otherwise. If there are two different things done under the hood depending on how it's called that creates more complexity when optimizing at the application level.

FWIW this is why I would prefer to merge what I've committed already: it does what it says, works, is completely isolated in the code, and nobody but an expert will use it. Now we're into different territory...

@mewall
Copy link
Collaborator Author

mewall commented Jan 30, 2025

Have the function keep the matrix in a consistent state: memory allocated <= N*ld

OK here's what I propose
A add a new variable "num_elems_allocated" to the following struct that holds the total number of elements in the dense matrix

struct bml_matrix_dense_t

If you can foresee any consequences for this modification elsewhere in the code, please let me know
Leave bml_set_N_dense() as a function to change N.
Fail using LOG_ERROR()
I will be unable to merge our PROGRESS optimizations into master without merging this into master, so please, if there is anything else that will hold this up, let me know now.

Personally, I'd prefer another name for the function, such as "resize" which indicates the new matrix is actually in a working state, just in a different size. I would also prefer to reallocate when N larger, instead of failing.

Any other opinion?

I like bml_set_N_dense() because that's what it does, and would prefer that the behavior is consistent from call to call...but I will do whatever is needed to get this merged!

@jmohdyusof
Copy link
Collaborator

In my comment above, n refers to the number of rows actually used (of the N allocated).
The confusion (to me) arises because in the past N==n, so things like set_N_dense are ambiguous unless we differentiate between them. Does set_N_dense currently always reallocate even if N_new <= N_old?

I have no particular opinion on this particular merge, just trying to avoid coming back to a code with confusing (to me) nomenclature.

@mewall
Copy link
Collaborator Author

mewall commented Jan 30, 2025

In my comment above, n refers to the number of rows actually used (of the N allocated). The confusion (to me) arises because in the past N==n, so things like set_N_dense are ambiguous unless we differentiate between them. Does set_N_dense currently always reallocate even if N_new <= N_old?

I have no particular opinion on this particular merge, just trying to avoid coming back to a code with confusing (to me) nomenclature.

There is no current bml_set_N, bml_set_N_dense it's a new addition. There's a bml_get_N() in the introspection code, I just reasoned it was natural to use bml_set_N for the setting. I think it's an OK choice as the variable is N in the struct. I'm not aware of a way to select a fewer number of rows/columns to use in a currently allocated matrix, but I might have missed something when I poked around looking for a way to do what I needed to.

@jmohdyusof
Copy link
Collaborator

I generally agree with the set/get duality for naming.
So in the case of reusing an allocation, you just set N_new < N_old but retain the same pointer to allocated memory, and keep track of N_original somewhere to make sure N_new <= N_original, otherwise reallocate?

@mewall
Copy link
Collaborator Author

mewall commented Jan 30, 2025

I generally agree with the set/get duality for naming. So in the case of reusing an allocation, you just set N_new < N_old but retain the same pointer to allocated memory, and keep track of N_original somewhere to make sure N_new <= N_original, otherwise reallocate?

That's right. The burden is currently on the programmer to ensure the method isn't called when N_new > N_original. Nothing in BML helps with that, AFAIK it would require keeping track of N_original in the struct (the call can happen many times with different values of N_new, so just comparing to the current N doesn't work).

@jmohdyusof
Copy link
Collaborator

In that case I agree that adding N_original (or N_allocated if you prefer) to the struct is the correct way to go.

@mewall
Copy link
Collaborator Author

mewall commented Jan 30, 2025

In that case I agree that adding N_original (or N_allocated if you prefer) to the struct is the correct way to go.

I'm OK with using N_original (linear size) instead of num_elems_allocated (total size of the array). That should work for either the CPU of MAGMA code path.

@jeanlucf22?

@jmohdyusof
Copy link
Collaborator

I definitely prefer N specification over num_elements.

@jeanlucf22
Copy link
Collaborator

In that case I agree that adding N_original (or N_allocated if you prefer) to the struct is the correct way to go.

I'm OK with using N_original (linear size) instead of num_elems_allocated (total size of the array). That should work for either the CPU of MAGMA code path.

@jeanlucf22?

Fine with N_allocated.

@mewall
Copy link
Collaborator Author

mewall commented Jan 30, 2025

In that case I agree that adding N_original (or N_allocated if you prefer) to the struct is the correct way to go.

I'm OK with using N_original (linear size) instead of num_elems_allocated (total size of the array). That should work for either the CPU of MAGMA code path.
@jeanlucf22?

Fine with N_allocated.

OK, it sounds like we have a consensus around N_allocated. I'll make the change and push a revision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants