-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize Evolution at the level of evolven3fit
#2168
Conversation
Please, let's fix it in EKO. Doing this just hides the problem (and potentially creates a huge problem if you happen to have many cores and not a lot of memory, this should definitely not use every core of the computer) |
n3fit/src/evolven3fit/evolve.py
Outdated
@@ -22,6 +24,8 @@ | |||
"level": logging.DEBUG, | |||
} | |||
|
|||
NUM_CORES = multiprocessing.cpu_count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not be the default and it should not use every available core. At the very least it should be limited by the number of cores set in the runcard and controlled by a --parallel
flag.
But, in any case, please let fix this in EKO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not be the default and it should not use every available core. At the very least it should be limited by the number of cores set in the runcard and controlled by a --parallel flag.
For sure this can't be because the number of cores in the runcard are usually set as low as possible on the cluster (in order to run as many replicas as possible) while for the evolution one would want to run on as many cores as possible.
Anyway this was not really intended to be merged in the first place but I'd keep it around as it is handy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the cluster of course but the optimal number in my experience is actually the maximum that allows you to maximize the memory of a node.
I agree that ultimately this should be fixed in EKO but I don't see that happening soon for many reasons but mainly:
The number of cores to use (depending on the memory) can always be easily set by the user by defining some environment variable or using |
The next change of eko will probably be a disaster for pineko but I think it will be fine to just evolve a PDF I would hope.
The memory problem is driven by the operators. Doing it in eko ensures that you do one value of Q at a time. Here yo uend up with (cores * size of the full eko) in the worst case scenario, vectorizing the convolution means that the worst case scenario is (size of the full eko).
The default should never be taking all cores. Programs that do that should be forbidden from running in any kind of shared system. |
Actually no. The operator is only loaded once into memory and then shared by all the replicas. So it is not a function of the number of cores.
Yes, I agree with this. |
I agreee EKO is the correct place
actually, the main reason is lack of man power, so please get involved in eko if that is possible 😇
there are a few breaking changes pending, that's why we will need some testing. My strategy so far is to delay the next tag of eko until it is actually required for this reason. However, if we now needed a tag, we make a tag.
the proposed solution is just to adjust the
Unfortunately I have to admit that EKO is such a case 🙈 (and it does create problems - see Como). We should change that. Alessandro was always arguing in favour of dropping this parallelization level and maybe, maybe with Rust we may be able to do so |
Is it? I though that EKO took only one if nothing was put in the runcard. Maybe that was the default that Andrea put in pineko/evolve then. In any case, for the purposes of the evolution EKO certainly uses only one so it is fine.
We need it only if this is implemented so that we can use it asap, but otherwise not really I believe. |
I'll give a try to implement the proper fix. |
As I told @scarlehoff in some private exchanges, with @giacomomagni, we initially wanted to implement the changes directly in EKO but didn't do so yet because of this:
and everything it will imply. If it was just about adding this change that'd be straightforward and quick. So this is just a really needed dirty alternative for all the fits we need ASAP. |
As already mentioned (here or somewhere else), I'm happy with merging this as long as it is not default (a --parallel flag or something) and it doesn't take all cores. However, I would be extra happy with the proper fix :P |
Just added a
Yes, of course! |
Hi @Radonirinaunimi is it fine if I rebase, squash and merge this PR? That way it will be easier to rollback once #2181 can be merged |
Yes, please do so! |
Add `joblib` as a dependency Add `joblib` to `conda-recipe` Add `--ncores` flag to specify the nb of cores used to parallelize evolution Re-use `n-cores` flag and make sure negative value of `n_jobs` are not permitted
a1f5abd
to
196201e
Compare
Greetings from your nice fit 🤖 !
Check the report carefully, and please buy me a ☕ , or better, a GPU 😉! |
This is not really in the direction of NNPDF/eko#408 (specifically this comment) but the following naive parallelization already provides a huge improvements.