Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize Evolution at the level of evolven3fit #2168

Merged
merged 1 commit into from
Oct 21, 2024

Conversation

Radonirinaunimi
Copy link
Member

This is not really in the direction of NNPDF/eko#408 (specifically this comment) but the following naive parallelization already provides a huge improvements.

  • A fit with 120 replicas: ~5mn vs. ~20mn (numbers provided by @giacomomagni, not sure on how many cores)
  • A fit with 1500 replicas: ~20mn vs. ~4hrs on 32 cores

@Radonirinaunimi Radonirinaunimi added the enhancement New feature or request label Oct 10, 2024
@scarlehoff
Copy link
Member

Please, let's fix it in EKO. Doing this just hides the problem (and potentially creates a huge problem if you happen to have many cores and not a lot of memory, this should definitely not use every core of the computer)

@@ -22,6 +24,8 @@
"level": logging.DEBUG,
}

NUM_CORES = multiprocessing.cpu_count()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not be the default and it should not use every available core. At the very least it should be limited by the number of cores set in the runcard and controlled by a --parallel flag.

But, in any case, please let fix this in EKO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not be the default and it should not use every available core. At the very least it should be limited by the number of cores set in the runcard and controlled by a --parallel flag.

For sure this can't be because the number of cores in the runcard are usually set as low as possible on the cluster (in order to run as many replicas as possible) while for the evolution one would want to run on as many cores as possible.

Anyway this was not really intended to be merged in the first place but I'd keep it around as it is handy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the cluster of course but the optimal number in my experience is actually the maximum that allows you to maximize the memory of a node.

@Radonirinaunimi Radonirinaunimi marked this pull request as draft October 10, 2024 14:10
@Radonirinaunimi
Copy link
Member Author

Please, let's fix it in EKO. Doing this just hides the problem (and potentially creates a huge problem if you happen to have many cores and not a lot of memory, this should definitely not use every core of the computer)

I agree that ultimately this should be fixed in EKO but I don't see that happening soon for many reasons but mainly:

  • the next release of EKO would likely introduce a few breaking changes with all the refactoring
  • it is not giving that one will not run with a similar memory problem so the the number of parallel evolution would have to be controlled anyway by the end user

The number of cores to use (depending on the memory) can always be easily set by the user by defining some environment variable or using taskset.

@scarlehoff
Copy link
Member

The next change of eko will probably be a disaster for pineko but I think it will be fine to just evolve a PDF I would hope.

it is not giving that one will not run with a similar memory problem so the the number of parallel evolution would have to be controlled anyway by the end user

The memory problem is driven by the operators. Doing it in eko ensures that you do one value of Q at a time. Here yo uend up with (cores * size of the full eko) in the worst case scenario, vectorizing the convolution means that the worst case scenario is (size of the full eko).

The number of cores to use (depending on the memory) can always be easily set by the user by defining some environment variable or using taskset.

The default should never be taking all cores. Programs that do that should be forbidden from running in any kind of shared system.

@Radonirinaunimi
Copy link
Member Author

The memory problem is driven by the operators. Doing it in eko ensures that you do one value of Q at a time. Here yo uend up with (cores * size of the full eko) in the worst case scenario, vectorizing the convolution means that the worst case scenario is (size of the full eko).

Actually no. The operator is only loaded once into memory and then shared by all the replicas. So it is not a function of the number of cores.

The default should never be taking all cores. Programs that do that should be forbidden from running in any kind of shared system.

Yes, I agree with this.

@felixhekhorn
Copy link
Contributor

Please, let's fix it in EKO. Doing this just hides the problem (and potentially creates a huge problem if you happen to have many cores and not a lot of memory, this should definitely not use every core of the computer)

I agreee EKO is the correct place

I agree that ultimately this should be fixed in EKO but I don't see that happening soon for many reasons but mainly:

actually, the main reason is lack of man power, so please get involved in eko if that is possible 😇

* the next release of EKO would likely introduce a few breaking changes with all the refactoring

there are a few breaking changes pending, that's why we will need some testing. My strategy so far is to delay the next tag of eko until it is actually required for this reason. However, if we now needed a tag, we make a tag.

* it is not giving that one will not run with a similar memory problem so the the number of parallel evolution would have to be controlled anyway by the end user

the proposed solution is just to adjust the einsum call - i.e. if(!) there is no copying going on somewhere the memory footprint is the same as before: holding all replicas and holding the eko

The default should never be taking all cores. Programs that do that should be forbidden from running in any kind of shared system.

Yes, I agree with this.

Unfortunately I have to admit that EKO is such a case 🙈 (and it does create problems - see Como). We should change that. Alessandro was always arguing in favour of dropping this parallelization level and maybe, maybe with Rust we may be able to do so

@scarlehoff
Copy link
Member

Unfortunately I have to admit that EKO is such a case

Is it? I though that EKO took only one if nothing was put in the runcard. Maybe that was the default that Andrea put in pineko/evolve then.

In any case, for the purposes of the evolution EKO certainly uses only one so it is fine.

However, if we now needed a tag, we make a tag.

We need it only if this is implemented so that we can use it asap, but otherwise not really I believe.

@giacomomagni
Copy link
Contributor

I'll give a try to implement the proper fix.

@Radonirinaunimi
Copy link
Member Author

Radonirinaunimi commented Oct 11, 2024

actually, the main reason is lack of man power, so please get involved in eko if that is possible 😇

As I told @scarlehoff in some private exchanges, with @giacomomagni, we initially wanted to implement the changes directly in EKO but didn't do so yet because of this:

there are a few breaking changes pending, that's why we will need some testing. My strategy so far is to delay the next tag of eko until it is actually required for this reason.

and everything it will imply. If it was just about adding this change that'd be straightforward and quick.

So this is just a really needed dirty alternative for all the fits we need ASAP.

@scarlehoff
Copy link
Member

scarlehoff commented Oct 11, 2024

So this is just a really needed dirty alternative for all the fits we need ASAP.

As already mentioned (here or somewhere else), I'm happy with merging this as long as it is not default (a --parallel flag or something) and it doesn't take all cores.

However, I would be extra happy with the proper fix :P

@Radonirinaunimi
Copy link
Member Author

As already mentioned (here or somewhere else), I'm happy with merging this as long as it is not default (a --parallel flag or something) and it doesn't take all cores.

Just added a --ncores flag to evolve to control this.

However, I would be extra happy with the proper fix :P

Yes, of course!

n3fit/src/n3fit/scripts/evolven3fit.py Outdated Show resolved Hide resolved
n3fit/src/evolven3fit/evolve.py Outdated Show resolved Hide resolved
@scarlehoff scarlehoff marked this pull request as ready for review October 16, 2024 14:41
@scarlehoff scarlehoff mentioned this pull request Oct 17, 2024
@scarlehoff
Copy link
Member

Hi @Radonirinaunimi is it fine if I rebase, squash and merge this PR?

That way it will be easier to rollback once #2181 can be merged

@Radonirinaunimi
Copy link
Member Author

Hi @Radonirinaunimi is it fine if I rebase, squash and merge this PR?

That way it will be easier to rollback once #2181 can be merged

Yes, please do so!

Add `joblib` as a dependency

Add `joblib` to `conda-recipe`

Add `--ncores` flag to specify the nb of cores used to parallelize evolution

Re-use `n-cores` flag and make sure negative value of `n_jobs` are not permitted
@scarlehoff scarlehoff force-pushed the parallelize-evolution branch from a1f5abd to 196201e Compare October 21, 2024 12:54
@scarlehoff scarlehoff added the run-fit-bot Starts fit bot from a PR. label Oct 21, 2024
Copy link

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

@scarlehoff scarlehoff merged commit 87daeb1 into master Oct 21, 2024
8 checks passed
@scarlehoff scarlehoff deleted the parallelize-evolution branch October 21, 2024 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request run-fit-bot Starts fit bot from a PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants