Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2 pkpool #20

Open
wants to merge 7 commits into
base: v2
Choose a base branch
from
Open

v2 pkpool #20

wants to merge 7 commits into from

Conversation

omor1
Copy link
Member

@omor1 omor1 commented Jun 5, 2020

461b6ed prevents steal attempts to the same core and, excepting implementation issues, should be OK.
5eb5330 ends up with a better packet distribution, but maybe causes other issues—might need more testing. I've only modified server_psm2.h since server_ibv.h/server_ibv_helper.h already set p->context.poolid and server_ofi.h is out-of-date and I'm not sure it compiles.

steal id = (rand (mod n-1)) + id + 1 (mod n)
Current implementation has a lot of repeated calls to lc_pool_get_local
that the compiler doesn't seem to fully get rid of.
Maybe marking it with __attribute__((pure)) will help?
Need to re-read docs to ensure we don't violate its constraints.
@omor1
Copy link
Member Author

omor1 commented Jun 5, 2020

Whoops, I just realized that a bit of #19 snuck into 0af5316, let me fix that...

Populate p->context.poolid when allocating a packet
and use lc_pool_put_to when returning it
so that packets don't get stuck in the same pool.
This reduces the number of retries needed,
but may have other effects - needs testing.
include/lc/pool.h Outdated Show resolved Hide resolved
src/medium.c Outdated
@@ -6,8 +6,7 @@
lc_status lc_sendm(void* src, size_t size, int rank, int tag, lc_ep ep)
{
LC_POOL_GET_OR_RETN(ep->pkpool, p);
lci_pk_init(ep, (size > 1024) ? lc_pool_get_local(ep->pkpool) : -1,
LC_PROTO_DATA, p);
lci_pk_init(ep, lc_pool_get_local(ep->pkpool), LC_PROTO_DATA, p);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in some benchmark it can be an issue since the cost of putting the data on a remote pool can be significant i.e. it's always guarantee a cache miss vs. returning to the its own pool which is very cheap.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point. In our use case however, there is only ever one thread (per device I guess) that can return packets. Is it then better to cause a cache miss or to allow a pool to empty, guaranteeing a cache miss later?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, there is no right or wrong here. Things like this need to be tuned and measured against application behavior. Hence giving some flex over it for tuning is better, I would change 1024 to a const, and if you set to 0, compiler will get rid of it.

Btw note also cache miss on the progress thread trying to return to the original pool can be worst than on sender stealing. Firstly it delay progress, secondly assuming we have large number of thread doing send, cost are amortized as you may have other threads doing useful works. Again may not matter in your case now which you have only one sender thread.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I see your point.

The branch of ParSEC I'm currently on is old and funnels all sends via the communication thread; but more recent versions can send from other threads as well, so this will become more relevant in the future once I rebase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've remembered that actually, the communication and progress threads are supposed to run on the same core (they're bound). This means that the should end up using the same pool. In PaRSEC/LCI, the user's main thread is actually the one to initialize LCI and thus fill the initial pool; the communication and progress threads are started later (and presumably run on a different core) and therefore start with an empty pool.

I'll do some more tests that discard the changes for medium and long sends, but keep the remaining changes (most notably those in server_psm2.h).

Long-term, I'll probably introduce some compile-time parameters for a) limit for medium send packet return and b) whether to do long send packet return.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange if they are using the same pool, please verify.

Copy link
Member Author

@omor1 omor1 Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll verify tomorrow, but my intuition is that my above statement is correct. Both threads should be bound to the same core and so sched_getcpu should return the same value, leading to using the same pool. This is PaRSEC-specific though, and in the future if/when we enable using more than one device we'll need to have better logic for thread placement (in PaRSEC).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the activation mode, PaRSEC communication thread might not be bound to a core, but instead be allowed to freely move around. You can be more restrictive and force a binding to the communication thread, in which case the resource will be assumed reserved, and no computation thread will be allowed to use it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default (which I've been testing with) is to have it be bound. When we have a single progress thread, it makes sense to bind it to the communication thread, but if we have more (multiple NICs/hardware queues/"LCI devices") then the decision is less obvious. At that point, we'll want to introduce an MCA parameter to control the binding.

include/lc/pool.h Outdated Show resolved Hide resolved
omor1 added 2 commits June 5, 2020 17:39
Vu noted that this should be tunable at compile time,
as different applications/systems may want different behavior.
Adds two compile-time parameter definitions in config.h:
  LC_PKT_RET_MED_SIZE: min size of med send to return to sender pool
  LC_PKT_RET_LONG: whether to return long send packet to sender pool
If pool->npools == 1, there is no valid steal target: return self pid.
Also includes formatting fixes and changes discussed in #20.
@omor1
Copy link
Member Author

omor1 commented Jun 9, 2020

Maybe reformulate this as lc_pool_steal_from(lc_pool* pool, int32_t pid) and lc_pool_get_steal_id(int32_t npools, int32_t pid)?

@danghvu
Copy link
Member

danghvu commented Jun 9, 2020

This lgtm now! Thanks, feel free to merge

@omor1
Copy link
Member Author

omor1 commented Jun 10, 2020

081bab4 renames lc_pool_get_local to lc_pool_get_local_id
7ee652c implements #20 (comment)

omor1 added 2 commits June 10, 2020 17:10
Rename lc_pool_get_local to better reflect what it actually does.
@omor1
Copy link
Member Author

omor1 commented Jun 11, 2020

Something here seems to break collectives (or at least barrier). I'm working on debugging this...
The sends and receives seem to complete, but the synchronizer is never triggered. Let me know if you have any immediate thoughts?

@danghvu
Copy link
Member

danghvu commented Jun 11, 2020 via email

@omor1
Copy link
Member Author

omor1 commented Jun 11, 2020

Yes I did, and I just found the bug. It's a mistake of my own making: after finding a valid steal target, it would go on and try to steal from itself (line 114 in pool.h).

The collective examples all use a progress thread, which gets starved for packets and therefore the receive never completes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants