Make weight initialization reproducible #1923

APJansen · 2024-01-29T12:26:35Z

This is an attempt to address #1916, branching off of #1905 because it uses the MultiInitializer to initialize the preprocessing weights.

I have checked that when doing any of:

n3fit runcard.yaml 1 -r 3
n3fit runcard.yaml 2 -r 3
n3fit runcard.yaml 1 -r 2

the preprocessing weights, NN weights and PDF output right after creation are identical for replica number 2.

Even after rebasing on trvl-mask-layers though, results do not remain identical.

I don't know where the difference is coming from, as the tr/vl masks, and the invcovmats, are also the same for replica 2.

scarlehoff · 2024-01-29T12:28:34Z

Rebase all these branches on current master. There was a bug in #1881 that I've corrected in #1922 and that would make the final fit be different.

APJansen · 2024-02-07T13:41:26Z

The reason this isn't working yet is of course precisely because it needs both #1788 and #1905, for proper seeds of the train/val splits and the weights respectively. Will continue once those are merged.

APJansen · 2024-03-05T12:39:50Z

I've added a test, do you agree that this is what we want to test here? And do you have an idea why it's not finding the runcards? It works locally (if I remove the linux mark)

scarlehoff · 2024-03-05T13:09:00Z

They should be in the regression folder I think. Locally it works because I guess you are running pytest in the folder where the tests are.

scarlehoff

I have checked that when doing any of:
n3fit runcard.yaml 1 -r 3
n3fit runcard.yaml 2 -r 3
n3fit runcard.yaml 1 -r 2
the preprocessing weights, NN weights and PDF output right after creation are identical for replica number 2.

I'd say the check missing is that n3fit runcard.yaml 2 is also identical.

However, to avoid the "replica 1 problem" I would do instead:

n3fit runcard.yaml 1 -r 3
n3fit runcard.yaml 2 -r 3
n3fit runcard.yaml 3

And check that replica 3 is the same.

With the test_fit itself... What about using save: weights.h5 and checking that the weights for replica 2 are the same? If epochs = 1 or epochs = 0 (not sure) they will be the initial ones...

n3fit/src/n3fit/tests/quickcard-parallel.yml

APJansen · 2024-03-05T13:39:27Z

What is the "replica 1 problem"? I suppose it's related to why you started using varying replica numbers in the regression tests, but I never knew the reason for that?

scarlehoff · 2024-03-05T13:51:38Z

What is the "replica 1 problem"?

Turns out that we were only testing the first replica so after some of the multireplica stuff was merged it actually only worked for the first replica when running sequentially.

Also when seeding, if we are missing some seed we might not notice for replica one and only realise from replica 2 onwards.

APJansen · 2024-03-05T14:15:05Z

With the test_fit itself... What about using save: weights.h5 and checking that the weights for replica 2 are the same? If epochs = 1 or epochs = 0 (not sure) they will be the initial ones...

This is tricky, running with 1 epoch will actually run 1 epoch, running with 0 epochs will generate lots of errors (first there's a check that stops you, turning this into a warning the timer callback errors, fixing that the stopping errors, etc).

Apart from your newly suggested n3fit runcard.yaml 3, the other 3 should really be identical, as they follow the same branches. So checking on the results should be ok, and in fact it passes for the sequential runs, not for the parallel one yet unfortuantely.

APJansen · 2024-03-05T15:08:44Z

I found the issue: it's the constraints on the preprocessing weights. Removing them makes all the weights identical to like 1e-8, with constraints the trainable preprocessing coefficients are completely different.
I don't understand why or how to solve it though, this is just supposed to clip the weights to the specified range right?

edit: I guess what is happening is that the constraint is applied across replicas, so one replica can influence the others. I'll try to rewrite it to be per replica.

APJansen · 2024-03-05T15:38:34Z

The constraint was a simple fix, but still not passing.

Checking the weight differences between my 3 ways of running (in a temporary script below for now), I find relative differences below 1e-6 after 1 epoch, with the exception of the biases, which are initialized to 0 and can have relative differences even above 0.1. I guess this is not surprising given that they start at 0 and there are some numerical differences, though I hoped there wouldn't be any differences.

script

import h5py
import numpy as np

TEMPDIR = f"/private/var/folders/lt/xy7j0k1j4tdf_k8p_87tb6300000gn/T/pytest-of-aronjansen/pytest-{testnr}/test_multireplica_runs_quickca1"

weights = {}
for name in ['a', 'b', 'c']:
    weight_path = f"{TEMPDIR}/{name}/quickcard-parallel/nnfit/replica_2/weights.h5"
    weights[name] = h5py.File(weight_path, 'r')


def extract_all_weights(file):
    weights = {}
    for key in file.keys():
        if isinstance(file[key], h5py.Group):
            weights[key] = extract_all_weights(file[key])
        else:
            weights[key] = file[key][()]
    return weights

# compute diff of nested dicts
def diff(d1, d2):
    d = {}
    for key in d1:
        if isinstance(d1[key], dict):
            d[key] = diff(d1[key], d2[key])
        else:
            reldiff = np.mean(np.abs(
                (d1[key] - d2[key]) / (d1[key] + d2[key])
                 ))
            d[key] = reldiff
            if reldiff > 1e-5:
                print(f"key: {key}, relative diff: {reldiff}, first: {d1[key]}, second: {d2[key]}")
            else:
                print(f"key: {key}, relative diff: {reldiff}")
    return d

# Comparing a and b:
a = extract_all_weights(weights['a'])
b = extract_all_weights(weights['b'])
c = extract_all_weights(weights['c'])

print("Comparing a and b:")
diff(a, b)
print("Comparing a and c:")
diff(a, c)
print("Comparing b and c:")
diff(b, c)

scarlehoff · 2024-03-05T19:11:01Z

What if you set the learning rate to 0.? (but anyway, having the weights, even without biases, being the same at epoch 1 is a good enough test for the initialization being the same, which is the goal here)

scarlehoff

Not sure whether it makes sense or is it needed to change those 0 to 1. It might be that they never enter keras alone (or that you wanted 0 so that it the user uses 0 they indeed get the keras behaviour)

Leaving the comment just to make sure that those 0 are intended.

n3fit/src/n3fit/backends/keras_backend/base_layers.py

n3fit/src/n3fit/backends/keras_backend/multi_dense.py

n3fit/src/n3fit/layers/preprocessing.py

APJansen · 2024-03-07T10:23:08Z

The base seed is always added to the replica seeds, which never results in a 0. I'm also ok with changing it to 1, but it will change all the regressions of course.

scarlehoff · 2024-03-07T10:24:17Z

The base seed is always added to the replica seeds, which never results in a 0.

Then it is fine. I just wanted to make sure it was intended and not an oversight!

APJansen · 2024-03-07T11:52:25Z

Ok, yes, by passing the base seed separately rather than taking it from the single replica initializer, we now just override whatever the random seed was that Keras chose (rather than adding the replica seed to it).

scarlehoff

thanks, lgtm

n3fit/src/n3fit/layers/preprocessing.py

n3fit/src/n3fit/tests/test_fit.py

scarlehoff · 2024-03-07T13:15:50Z

Update also the fitbot with the latest one once it finishes running.

APJansen · 2024-03-07T15:19:34Z

I don't see the fitbot results?

scarlehoff · 2024-03-07T15:21:12Z

It is still running but in a previous commit. It seems it cannor un in the commit of another bot.

github-actions · 2024-03-07T16:29:31Z

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Fit Name: NNBOT-80553d777-2024-03-07
Fit Report wrt master: https://vp.nnpdf.science/tEAvxt2RTwGIHxWxgPbU_Q==
Fit Report wrt latest stable reference: https://vp.nnpdf.science/RdGNMyakRjS6vs747KbnTg==
Fit Data: https://data.nnpdf.science/fits/NNBOT-80553d777-2024-03-07.tar.gz

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

Remove option for replica seeds to be None in Preprocessing Uniformize naming of layers in NN Add test comparing 3 ways of running darwin->linux Simplify quickcards increase tolerance Change axis in weight constraint Add test on constraint Simplify constraint, tighten test Test weights only Revert "Simplify constraint, tighten test" This reverts commit df8781f. Clarify error message Avoid duplicate checks Change test cases Avoid seed=0 issue with Keras 3

Co-authored-by: Juan M. Cruz-Martinez <[email protected]>

…ility.

APJansen · 2024-03-08T10:47:17Z

We're having some silly issue with one hyperopt test only on python 3.11, where the test fails on the phi2 hyperopt loss. In the CI it's 0.0, locally it's 1e-10 and locally on master it's 1e7...
Practically we can just make the test pass by either changing the assert from > 0 to >= 0.0 or removing the option to have phi2 as a loss completely.
But I have no idea where this change is coming from. Running a 1 trial hyperopt with 200 epochs locally on master vs this branch, differences in phi and everything else are order 1 (which is fine as the preprocessing initialization changed), not order 17..

scarlehoff · 2024-03-08T11:21:01Z

And why is it only python 3.11 with the python installation?

It might be that merging the mongodb stuff (which changed dependencies) introduced a dependency change between conda and pip that in turn creates this difference?

APJansen · 2024-03-08T11:41:38Z

No idea.. that could be but it doesn't explain that locally I also get very different results (with the same environment, in python 3.9). Also checked several seeds, the order of magnitude remains +7 for master and -10 here..

scarlehoff · 2024-03-08T11:50:29Z

Is this the value of phi2? Does it start in the same point?
There might be a 1/(almost 0) division somewhere? That in one case it gets removed / set to 0 and in the other it goes through?

I agree that's very weird.

APJansen · 2024-03-08T12:02:34Z

I found the explanation. The seed was being set as an int, which then uses the same for all replicas. Didn't look into details but I assume this causes the phi2 statistic to be vanishingly small. That must happen for any version. Still that's larger than 0 so it's ok. For some reason, maybe as you mention Juan the addition of packages causing some version differences, in python 3.11 it was exactly 0, thus failing the test, but the actual issue was that it was practically zero anywhere. I just changed the seed from and int to a list of 2 different ints, and locally it's now again order +7.

APJansen self-assigned this Jan 29, 2024

APJansen force-pushed the multi-dense-layer branch from 4dca563 to d8f28ff Compare January 29, 2024 13:51

APJansen force-pushed the reproducibility branch from 8adc0ae to 726af48 Compare January 29, 2024 13:51

APJansen force-pushed the multi-dense-layer branch from d8f28ff to 3873cd1 Compare January 30, 2024 13:48

APJansen force-pushed the reproducibility branch from 726af48 to 803a816 Compare January 30, 2024 14:08

APJansen mentioned this pull request Jan 30, 2024

Multi dense layer #1905

Merged

APJansen force-pushed the multi-dense-layer branch 2 times, most recently from 6b0f22d to 5fdf9ed Compare February 7, 2024 13:18

APJansen force-pushed the reproducibility branch from 803a816 to 073a027 Compare February 7, 2024 13:40

APJansen force-pushed the multi-dense-layer branch 2 times, most recently from f6529ab to 234efa0 Compare February 16, 2024 16:25

Base automatically changed from multi-dense-layer to master February 19, 2024 10:05

RoyStegeman added the escience label Feb 19, 2024

APJansen force-pushed the reproducibility branch 2 times, most recently from 2b4168a to f3d79bf Compare February 22, 2024 12:39

APJansen force-pushed the reproducibility branch from f3d79bf to e441a5a Compare March 4, 2024 09:29

APJansen mentioned this pull request Mar 4, 2024

Finalizing eScience contributions #1977

Closed

APJansen force-pushed the reproducibility branch from e441a5a to 3465fd0 Compare March 5, 2024 10:47

scarlehoff reviewed Mar 5, 2024

View reviewed changes

n3fit/src/n3fit/tests/quickcard-parallel.yml Outdated Show resolved Hide resolved

n3fit/src/n3fit/tests/quickcard-parallel.yml Outdated Show resolved Hide resolved

n3fit/src/n3fit/tests/quickcard-parallel.yml Outdated Show resolved Hide resolved

APJansen force-pushed the reproducibility branch from 067d7ad to 775aedf Compare March 5, 2024 17:34

scarlehoff reviewed Mar 7, 2024

View reviewed changes

scarlehoff approved these changes Mar 7, 2024

View reviewed changes

APJansen added the redo-regressions Recompute the regression data label Mar 7, 2024

scarlehoff added the run-fit-bot Starts fit bot from a PR. label Mar 7, 2024

APJansen force-pushed the reproducibility branch from 8aad965 to 7d584ca Compare March 8, 2024 07:32

APJansen and others added 8 commits March 8, 2024 09:00

Remove outdated comment

5f2f5d7

Remove Optional for replica_seeds

75a749e

Co-authored-by: Juan M. Cruz-Martinez <[email protected]>

replica_seeds -> _replica_seeds

862678c

Remove separate numpy import

166e5eb

Automatically regenerated regressions from PR 1923, branch reproducib…

24af718

…ility.

Automatically regenerated regressions from PR 1923, branch reproducib…

8d7cb85

…ility.

Update fitbot reference

cad20ff

APJansen force-pushed the reproducibility branch from 7d584ca to cad20ff Compare March 8, 2024 08:00

Prevent phi2 from vanishing by using different seeds for replicas

5936290

APJansen merged commit 39ff111 into master Mar 8, 2024
7 checks passed

APJansen deleted the reproducibility branch March 8, 2024 12:57

scarlehoff mentioned this pull request Mar 13, 2024

When running replicas in parallel, make it so the initial state of the network does not depend on the number of replicas. #1916

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make weight initialization reproducible #1923

Make weight initialization reproducible #1923

APJansen commented Jan 29, 2024 •

edited

Loading

scarlehoff commented Jan 29, 2024

APJansen commented Feb 7, 2024

APJansen commented Mar 5, 2024

scarlehoff commented Mar 5, 2024

scarlehoff left a comment

APJansen commented Mar 5, 2024

scarlehoff commented Mar 5, 2024

APJansen commented Mar 5, 2024

APJansen commented Mar 5, 2024 •

edited

Loading

APJansen commented Mar 5, 2024

scarlehoff commented Mar 5, 2024

scarlehoff left a comment

APJansen commented Mar 7, 2024

scarlehoff commented Mar 7, 2024

APJansen commented Mar 7, 2024

scarlehoff left a comment

scarlehoff commented Mar 7, 2024

APJansen commented Mar 7, 2024

scarlehoff commented Mar 7, 2024

github-actions bot commented Mar 7, 2024

APJansen commented Mar 8, 2024

scarlehoff commented Mar 8, 2024

APJansen commented Mar 8, 2024

scarlehoff commented Mar 8, 2024

APJansen commented Mar 8, 2024

Make weight initialization reproducible #1923

Make weight initialization reproducible #1923

Conversation

APJansen commented Jan 29, 2024 • edited Loading

scarlehoff commented Jan 29, 2024

APJansen commented Feb 7, 2024

APJansen commented Mar 5, 2024

scarlehoff commented Mar 5, 2024

scarlehoff left a comment

Choose a reason for hiding this comment

APJansen commented Mar 5, 2024

scarlehoff commented Mar 5, 2024

APJansen commented Mar 5, 2024

APJansen commented Mar 5, 2024 • edited Loading

APJansen commented Mar 5, 2024

scarlehoff commented Mar 5, 2024

scarlehoff left a comment

Choose a reason for hiding this comment

APJansen commented Mar 7, 2024

scarlehoff commented Mar 7, 2024

APJansen commented Mar 7, 2024

scarlehoff left a comment

Choose a reason for hiding this comment

scarlehoff commented Mar 7, 2024

APJansen commented Mar 7, 2024

scarlehoff commented Mar 7, 2024

github-actions bot commented Mar 7, 2024

APJansen commented Mar 8, 2024

scarlehoff commented Mar 8, 2024

APJansen commented Mar 8, 2024

scarlehoff commented Mar 8, 2024

APJansen commented Mar 8, 2024

APJansen commented Jan 29, 2024 •

edited

Loading

APJansen commented Mar 5, 2024 •

edited

Loading