Parallelize Evolution at the level of `evolven3fit` #2168

Radonirinaunimi · 2024-10-10T13:56:30Z

This is not really in the direction of NNPDF/eko#408 (specifically this comment) but the following naive parallelization already provides a huge improvements.

A fit with 120 replicas: ~5mn vs. ~20mn (numbers provided by @giacomomagni, not sure on how many cores)
A fit with 1500 replicas: ~20mn vs. ~4hrs on 32 cores

scarlehoff · 2024-10-10T14:07:02Z

Please, let's fix it in EKO. Doing this just hides the problem (and potentially creates a huge problem if you happen to have many cores and not a lot of memory, this should definitely not use every core of the computer)

scarlehoff · 2024-10-10T14:08:22Z

n3fit/src/evolven3fit/evolve.py

@@ -22,6 +24,8 @@
    "level": logging.DEBUG,
 }

+NUM_CORES = multiprocessing.cpu_count()


It should not be the default and it should not use every available core. At the very least it should be limited by the number of cores set in the runcard and controlled by a --parallel flag.

But, in any case, please let fix this in EKO.

It should not be the default and it should not use every available core. At the very least it should be limited by the number of cores set in the runcard and controlled by a --parallel flag.

For sure this can't be because the number of cores in the runcard are usually set as low as possible on the cluster (in order to run as many replicas as possible) while for the evolution one would want to run on as many cores as possible.

Anyway this was not really intended to be merged in the first place but I'd keep it around as it is handy.

It depends on the cluster of course but the optimal number in my experience is actually the maximum that allows you to maximize the memory of a node.

Radonirinaunimi · 2024-10-10T14:18:28Z

Please, let's fix it in EKO. Doing this just hides the problem (and potentially creates a huge problem if you happen to have many cores and not a lot of memory, this should definitely not use every core of the computer)

I agree that ultimately this should be fixed in EKO but I don't see that happening soon for many reasons but mainly:

the next release of EKO would likely introduce a few breaking changes with all the refactoring
it is not giving that one will not run with a similar memory problem so the the number of parallel evolution would have to be controlled anyway by the end user

The number of cores to use (depending on the memory) can always be easily set by the user by defining some environment variable or using taskset.

scarlehoff · 2024-10-10T14:22:25Z

The next change of eko will probably be a disaster for pineko but I think it will be fine to just evolve a PDF I would hope.

it is not giving that one will not run with a similar memory problem so the the number of parallel evolution would have to be controlled anyway by the end user

The memory problem is driven by the operators. Doing it in eko ensures that you do one value of Q at a time. Here yo uend up with (cores * size of the full eko) in the worst case scenario, vectorizing the convolution means that the worst case scenario is (size of the full eko).

The number of cores to use (depending on the memory) can always be easily set by the user by defining some environment variable or using taskset.

The default should never be taking all cores. Programs that do that should be forbidden from running in any kind of shared system.

Radonirinaunimi · 2024-10-10T14:29:28Z

The memory problem is driven by the operators. Doing it in eko ensures that you do one value of Q at a time. Here yo uend up with (cores * size of the full eko) in the worst case scenario, vectorizing the convolution means that the worst case scenario is (size of the full eko).

Actually no. The operator is only loaded once into memory and then shared by all the replicas. So it is not a function of the number of cores.

The default should never be taking all cores. Programs that do that should be forbidden from running in any kind of shared system.

Yes, I agree with this.

felixhekhorn · 2024-10-11T07:20:49Z

Please, let's fix it in EKO. Doing this just hides the problem (and potentially creates a huge problem if you happen to have many cores and not a lot of memory, this should definitely not use every core of the computer)

I agreee EKO is the correct place

I agree that ultimately this should be fixed in EKO but I don't see that happening soon for many reasons but mainly:

actually, the main reason is lack of man power, so please get involved in eko if that is possible 😇

* the next release of EKO would likely introduce a few breaking changes with all the refactoring

there are a few breaking changes pending, that's why we will need some testing. My strategy so far is to delay the next tag of eko until it is actually required for this reason. However, if we now needed a tag, we make a tag.

* it is not giving that one will not run with a similar memory problem so the the number of parallel evolution would have to be controlled anyway by the end user

the proposed solution is just to adjust the einsum call - i.e. if(!) there is no copying going on somewhere the memory footprint is the same as before: holding all replicas and holding the eko

The default should never be taking all cores. Programs that do that should be forbidden from running in any kind of shared system.

Yes, I agree with this.

Unfortunately I have to admit that EKO is such a case 🙈 (and it does create problems - see Como). We should change that. Alessandro was always arguing in favour of dropping this parallelization level and maybe, maybe with Rust we may be able to do so

scarlehoff · 2024-10-11T07:24:26Z

Unfortunately I have to admit that EKO is such a case

Is it? I though that EKO took only one if nothing was put in the runcard. Maybe that was the default that Andrea put in pineko/evolve then.

In any case, for the purposes of the evolution EKO certainly uses only one so it is fine.

However, if we now needed a tag, we make a tag.

We need it only if this is implemented so that we can use it asap, but otherwise not really I believe.

giacomomagni · 2024-10-11T08:16:39Z

I'll give a try to implement the proper fix.

Radonirinaunimi · 2024-10-11T08:21:23Z

actually, the main reason is lack of man power, so please get involved in eko if that is possible 😇

As I told @scarlehoff in some private exchanges, with @giacomomagni, we initially wanted to implement the changes directly in EKO but didn't do so yet because of this:

there are a few breaking changes pending, that's why we will need some testing. My strategy so far is to delay the next tag of eko until it is actually required for this reason.

and everything it will imply. If it was just about adding this change that'd be straightforward and quick.

So this is just a really needed dirty alternative for all the fits we need ASAP.

scarlehoff · 2024-10-11T08:25:10Z

So this is just a really needed dirty alternative for all the fits we need ASAP.

As already mentioned (here or somewhere else), I'm happy with merging this as long as it is not default (a --parallel flag or something) and it doesn't take all cores.

However, I would be extra happy with the proper fix :P

Radonirinaunimi · 2024-10-11T08:40:07Z

As already mentioned (here or somewhere else), I'm happy with merging this as long as it is not default (a --parallel flag or something) and it doesn't take all cores.

Just added a --ncores flag to evolve to control this.

However, I would be extra happy with the proper fix :P

Yes, of course!

n3fit/src/n3fit/scripts/evolven3fit.py

n3fit/src/evolven3fit/evolve.py

scarlehoff · 2024-10-21T10:36:34Z

Hi @Radonirinaunimi is it fine if I rebase, squash and merge this PR?

That way it will be easier to rollback once #2181 can be merged

Radonirinaunimi · 2024-10-21T10:38:27Z

Hi @Radonirinaunimi is it fine if I rebase, squash and merge this PR?

That way it will be easier to rollback once #2181 can be merged

Yes, please do so!

Add `joblib` as a dependency Add `joblib` to `conda-recipe` Add `--ncores` flag to specify the nb of cores used to parallelize evolution Re-use `n-cores` flag and make sure negative value of `n_jobs` are not permitted

github-actions · 2024-10-21T17:11:47Z

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Fit Name: NNBOT-494bb2ea6-2024-10-21
Fit Report wrt master: https://vp.nnpdf.science/960SJwihSrKYXAsg_sn4_g==
Fit Report wrt latest stable reference: https://vp.nnpdf.science/iTrJl3yPRGiXr_7UStQtsw==
Fit Data: https://data.nnpdf.science/fits/NNBOT-494bb2ea6-2024-10-21.tar.gz

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

Radonirinaunimi added the enhancement New feature or request label Oct 10, 2024

scarlehoff reviewed Oct 10, 2024

View reviewed changes

Radonirinaunimi marked this pull request as draft October 10, 2024 14:10

Radonirinaunimi added the dont-merge label Oct 10, 2024

scarlehoff reviewed Oct 11, 2024

View reviewed changes

n3fit/src/n3fit/scripts/evolven3fit.py Outdated Show resolved Hide resolved

n3fit/src/evolven3fit/evolve.py Outdated Show resolved Hide resolved

scarlehoff removed the dont-merge label Oct 16, 2024

scarlehoff marked this pull request as ready for review October 16, 2024 14:41

scarlehoff mentioned this pull request Oct 17, 2024

Use eko v0.15 #2181

Draft

Parallelize Evolution at the level of evolven3fit

196201e

Add `joblib` as a dependency Add `joblib` to `conda-recipe` Add `--ncores` flag to specify the nb of cores used to parallelize evolution Re-use `n-cores` flag and make sure negative value of `n_jobs` are not permitted

scarlehoff force-pushed the parallelize-evolution branch from a1f5abd to 196201e Compare October 21, 2024 12:54

scarlehoff added the run-fit-bot Starts fit bot from a PR. label Oct 21, 2024

scarlehoff merged commit 87daeb1 into master Oct 21, 2024
8 checks passed

scarlehoff deleted the parallelize-evolution branch October 21, 2024 18:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize Evolution at the level of `evolven3fit` #2168

Parallelize Evolution at the level of `evolven3fit` #2168

Radonirinaunimi commented Oct 10, 2024

scarlehoff commented Oct 10, 2024

scarlehoff Oct 10, 2024

Radonirinaunimi Oct 10, 2024

scarlehoff Oct 10, 2024

Radonirinaunimi commented Oct 10, 2024

scarlehoff commented Oct 10, 2024

Radonirinaunimi commented Oct 10, 2024

felixhekhorn commented Oct 11, 2024

scarlehoff commented Oct 11, 2024

giacomomagni commented Oct 11, 2024

Radonirinaunimi commented Oct 11, 2024 •

edited

Loading

scarlehoff commented Oct 11, 2024 •

edited

Loading

Radonirinaunimi commented Oct 11, 2024

scarlehoff commented Oct 21, 2024

Radonirinaunimi commented Oct 21, 2024

github-actions bot commented Oct 21, 2024

Parallelize Evolution at the level of evolven3fit #2168

Parallelize Evolution at the level of evolven3fit #2168

Conversation

Radonirinaunimi commented Oct 10, 2024

scarlehoff commented Oct 10, 2024

scarlehoff Oct 10, 2024

Choose a reason for hiding this comment

Radonirinaunimi Oct 10, 2024

Choose a reason for hiding this comment

scarlehoff Oct 10, 2024

Choose a reason for hiding this comment

Radonirinaunimi commented Oct 10, 2024

scarlehoff commented Oct 10, 2024

Radonirinaunimi commented Oct 10, 2024

felixhekhorn commented Oct 11, 2024

scarlehoff commented Oct 11, 2024

giacomomagni commented Oct 11, 2024

Radonirinaunimi commented Oct 11, 2024 • edited Loading

scarlehoff commented Oct 11, 2024 • edited Loading

Radonirinaunimi commented Oct 11, 2024

scarlehoff commented Oct 21, 2024

Radonirinaunimi commented Oct 21, 2024

github-actions bot commented Oct 21, 2024

Parallelize Evolution at the level of `evolven3fit` #2168

Parallelize Evolution at the level of `evolven3fit` #2168

Radonirinaunimi commented Oct 11, 2024 •

edited

Loading

scarlehoff commented Oct 11, 2024 •

edited

Loading