-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v2 pkpool #20
base: v2
Are you sure you want to change the base?
Conversation
steal id = (rand (mod n-1)) + id + 1 (mod n) Current implementation has a lot of repeated calls to lc_pool_get_local that the compiler doesn't seem to fully get rid of. Maybe marking it with __attribute__((pure)) will help? Need to re-read docs to ensure we don't violate its constraints.
Populate p->context.poolid when allocating a packet and use lc_pool_put_to when returning it so that packets don't get stuck in the same pool. This reduces the number of retries needed, but may have other effects - needs testing.
src/medium.c
Outdated
@@ -6,8 +6,7 @@ | |||
lc_status lc_sendm(void* src, size_t size, int rank, int tag, lc_ep ep) | |||
{ | |||
LC_POOL_GET_OR_RETN(ep->pkpool, p); | |||
lci_pk_init(ep, (size > 1024) ? lc_pool_get_local(ep->pkpool) : -1, | |||
LC_PROTO_DATA, p); | |||
lci_pk_init(ep, lc_pool_get_local(ep->pkpool), LC_PROTO_DATA, p); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that in some benchmark it can be an issue since the cost of putting the data on a remote pool can be significant i.e. it's always guarantee a cache miss vs. returning to the its own pool which is very cheap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point. In our use case however, there is only ever one thread (per device I guess) that can return packets. Is it then better to cause a cache miss or to allow a pool to empty, guaranteeing a cache miss later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, there is no right or wrong here. Things like this need to be tuned and measured against application behavior. Hence giving some flex over it for tuning is better, I would change 1024 to a const, and if you set to 0, compiler will get rid of it.
Btw note also cache miss on the progress thread trying to return to the original pool can be worst than on sender stealing. Firstly it delay progress, secondly assuming we have large number of thread doing send, cost are amortized as you may have other threads doing useful works. Again may not matter in your case now which you have only one sender thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I see your point.
The branch of ParSEC I'm currently on is old and funnels all sends via the communication thread; but more recent versions can send from other threads as well, so this will become more relevant in the future once I rebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've remembered that actually, the communication and progress threads are supposed to run on the same core (they're bound). This means that the should end up using the same pool. In PaRSEC/LCI, the user's main thread is actually the one to initialize LCI and thus fill the initial pool; the communication and progress threads are started later (and presumably run on a different core) and therefore start with an empty pool.
I'll do some more tests that discard the changes for medium and long sends, but keep the remaining changes (most notably those in server_psm2.h
).
Long-term, I'll probably introduce some compile-time parameters for a) limit for medium send packet return and b) whether to do long send packet return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange if they are using the same pool, please verify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll verify tomorrow, but my intuition is that my above statement is correct. Both threads should be bound to the same core and so sched_getcpu
should return the same value, leading to using the same pool. This is PaRSEC-specific though, and in the future if/when we enable using more than one device we'll need to have better logic for thread placement (in PaRSEC).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on the activation mode, PaRSEC communication thread might not be bound to a core, but instead be allowed to freely move around. You can be more restrictive and force a binding to the communication thread, in which case the resource will be assumed reserved, and no computation thread will be allowed to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the default (which I've been testing with) is to have it be bound. When we have a single progress thread, it makes sense to bind it to the communication thread, but if we have more (multiple NICs/hardware queues/"LCI devices") then the decision is less obvious. At that point, we'll want to introduce an MCA parameter to control the binding.
Vu noted that this should be tunable at compile time, as different applications/systems may want different behavior. Adds two compile-time parameter definitions in config.h: LC_PKT_RET_MED_SIZE: min size of med send to return to sender pool LC_PKT_RET_LONG: whether to return long send packet to sender pool
If pool->npools == 1, there is no valid steal target: return self pid. Also includes formatting fixes and changes discussed in #20.
Maybe reformulate this as |
This lgtm now! Thanks, feel free to merge |
081bab4 renames |
Rename lc_pool_get_local to better reflect what it actually does.
Something here seems to break collectives (or at least barrier). I'm working on debugging this... |
Collective in LCI v2? Do you configure it with sync? Do you reproduce it
with any example code?
…On Wed, Jun 10, 2020 at 7:49 PM Omri Mor ***@***.***> wrote:
Something here seems to break collectives (or at least barrier). I'm
working on debugging this...
The sends and receives seem to complete, but the synchronizer is never
triggered. Let me know if you have any immediate thoughts?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIZNSS3SOHHKKV4XZBG5LDRWBA4NANCNFSM4NTDTEZA>
.
|
Yes I did, and I just found the bug. It's a mistake of my own making: after finding a valid steal target, it would go on and try to steal from itself (line 114 in The collective examples all use a progress thread, which gets starved for packets and therefore the receive never completes. |
461b6ed prevents steal attempts to the same core and, excepting implementation issues, should be OK.
5eb5330 ends up with a better packet distribution, but maybe causes other issues—might need more testing. I've only modified
server_psm2.h
sinceserver_ibv.h
/server_ibv_helper.h
already setp->context.poolid
andserver_ofi.h
is out-of-date and I'm not sure it compiles.