Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Role of LSYNC_TRANS #174

Open
samhatfield opened this issue Nov 19, 2024 · 1 comment
Open

Role of LSYNC_TRANS #174

samhatfield opened this issue Nov 19, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@samhatfield
Copy link
Collaborator

samhatfield commented Nov 19, 2024

SETUP_TRANS0 has an option LSYNC_TRANS which is defined as activate barriers in trmtol and trltom in TPM_GEN.

In the cpu subtree, this variable does nothing. The barriers in TRMTOL and TRLTOM have been commented out now for 11 years.

In the gpu subtree there are many uses for LSYNC_TRANS, e.g. in TRGTOL, LEINV, LTINV etc. In each case the variable controls whether an MPL_BARRIER is executed, and the time taken to satisfy the barrier is measured using GSTATS.

However it's not clear what all the different uses have in common. I thought initially that LSYNC_TRANS might be used to measure device<->host transfer times, but that doesn't seem to be right.

@lukasm91 could you give us some advice here? Do we need to review the uses of LSYNC_TRANS?

At the very least we should document this option properly, as the description in TPM_GEN is no longer valid.

@samhatfield samhatfield added the bug Something isn't working label Nov 19, 2024
@lukasm91
Copy link
Collaborator

lukasm91 commented Nov 19, 2024

On the GPU, this option exists since before I started using ectrans, and I really appreciate having it. It is very likely not an option you want to have in operations, but beside that, whenever you are doing performance testing, this LSYNC_TRANS, and the related barrier are extremely useful because you can be sure that you do not attribute load imbalance to the wrong counters.

On GPU I tried to make the option as useful as possible: The idea is that I want to make it possible to understand performance of
a) The communications (TRLTOG, TRGTOL, TRLTOM, TRMTOL), and if possible really only the communication, such that the expectation can be that those scale as much as possible. Packing/unpacking does not belong to the communication.
b) FFTs - because they are a major component, and they can have significant load imbalance
c) GEMMs - because they are a major component, and they can have significant load imbalance
d) The whole rest, usually relatively small, and not the major source of load imbalance.

On the CPU, I see that b and c are maybe problematic because they are inside the OpenMP loop, but rather than removing LSYNC_TRANS from GPU, I would rather suggest to make those as meaningful as possible for the CPU.

There is also NTRANS_SYNC_LEVEL. No strong opinion on this - I think on reasonable solution it is not needed, we can either do synchornizations, or we don't, there is no need for a level.

Does that help? If you have more questions let me know :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants