-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make weight initialization reproducible #1923
Conversation
4dca563
to
d8f28ff
Compare
8adc0ae
to
726af48
Compare
d8f28ff
to
3873cd1
Compare
726af48
to
803a816
Compare
6b0f22d
to
5fdf9ed
Compare
803a816
to
073a027
Compare
f6529ab
to
234efa0
Compare
2b4168a
to
f3d79bf
Compare
f3d79bf
to
e441a5a
Compare
e441a5a
to
3465fd0
Compare
I've added a test, do you agree that this is what we want to test here? And do you have an idea why it's not finding the runcards? It works locally (if I remove the linux mark) |
They should be in the regression folder I think. Locally it works because I guess you are running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have checked that when doing any of:
n3fit runcard.yaml 1 -r 3 n3fit runcard.yaml 2 -r 3 n3fit runcard.yaml 1 -r 2
the preprocessing weights, NN weights and PDF output right after creation are identical for replica number 2.
I'd say the check missing is that n3fit runcard.yaml 2
is also identical.
However, to avoid the "replica 1 problem" I would do instead:
n3fit runcard.yaml 1 -r 3
n3fit runcard.yaml 2 -r 3
n3fit runcard.yaml 3
And check that replica 3 is the same.
With the test_fit itself... What about using save: weights.h5
and checking that the weights for replica 2 are the same? If epochs = 1
or epochs = 0
(not sure) they will be the initial ones...
What is the "replica 1 problem"? I suppose it's related to why you started using varying replica numbers in the regression tests, but I never knew the reason for that? |
Turns out that we were only testing the first replica so after some of the multireplica stuff was merged it actually only worked for the first replica when running sequentially. Also when seeding, if we are missing some seed we might not notice for replica one and only realise from replica 2 onwards. |
This is tricky, running with 1 epoch will actually run 1 epoch, running with 0 epochs will generate lots of errors (first there's a check that stops you, turning this into a warning the timer callback errors, fixing that the stopping errors, etc). Apart from your newly suggested |
I found the issue: it's the constraints on the preprocessing weights. Removing them makes all the weights identical to like edit: I guess what is happening is that the constraint is applied across replicas, so one replica can influence the others. I'll try to rewrite it to be per replica. |
The constraint was a simple fix, but still not passing. Checking the weight differences between my 3 ways of running (in a temporary script below for now), I find relative differences below scriptimport h5py
import numpy as np
TEMPDIR = f"/private/var/folders/lt/xy7j0k1j4tdf_k8p_87tb6300000gn/T/pytest-of-aronjansen/pytest-{testnr}/test_multireplica_runs_quickca1"
weights = {}
for name in ['a', 'b', 'c']:
weight_path = f"{TEMPDIR}/{name}/quickcard-parallel/nnfit/replica_2/weights.h5"
weights[name] = h5py.File(weight_path, 'r')
def extract_all_weights(file):
weights = {}
for key in file.keys():
if isinstance(file[key], h5py.Group):
weights[key] = extract_all_weights(file[key])
else:
weights[key] = file[key][()]
return weights
# compute diff of nested dicts
def diff(d1, d2):
d = {}
for key in d1:
if isinstance(d1[key], dict):
d[key] = diff(d1[key], d2[key])
else:
reldiff = np.mean(np.abs(
(d1[key] - d2[key]) / (d1[key] + d2[key])
))
d[key] = reldiff
if reldiff > 1e-5:
print(f"key: {key}, relative diff: {reldiff}, first: {d1[key]}, second: {d2[key]}")
else:
print(f"key: {key}, relative diff: {reldiff}")
return d
# Comparing a and b:
a = extract_all_weights(weights['a'])
b = extract_all_weights(weights['b'])
c = extract_all_weights(weights['c'])
print("Comparing a and b:")
diff(a, b)
print("Comparing a and c:")
diff(a, c)
print("Comparing b and c:")
diff(b, c) |
067d7ad
to
775aedf
Compare
What if you set the learning rate to 0.? (but anyway, having the weights, even without biases, being the same at epoch 1 is a good enough test for the initialization being the same, which is the goal here) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether it makes sense or is it needed to change those 0 to 1. It might be that they never enter keras alone (or that you wanted 0 so that it the user uses 0 they indeed get the keras behaviour)
Leaving the comment just to make sure that those 0 are intended.
The base seed is always added to the replica seeds, which never results in a 0. I'm also ok with changing it to 1, but it will change all the regressions of course. |
Then it is fine. I just wanted to make sure it was intended and not an oversight! |
Ok, yes, by passing the base seed separately rather than taking it from the single replica initializer, we now just override whatever the random seed was that Keras chose (rather than adding the replica seed to it). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, lgtm
Update also the fitbot with the latest one once it finishes running. |
I don't see the fitbot results? |
It is still running but in a previous commit. It seems it cannor un in the commit of another bot. |
Greetings from your nice fit 🤖 !
Check the report carefully, and please buy me a ☕ , or better, a GPU 😉! |
8aad965
to
7d584ca
Compare
Remove option for replica seeds to be None in Preprocessing Uniformize naming of layers in NN Add test comparing 3 ways of running darwin->linux Simplify quickcards increase tolerance Change axis in weight constraint Add test on constraint Simplify constraint, tighten test Test weights only Revert "Simplify constraint, tighten test" This reverts commit df8781f. Clarify error message Avoid duplicate checks Change test cases Avoid seed=0 issue with Keras 3
Co-authored-by: Juan M. Cruz-Martinez <[email protected]>
7d584ca
to
cad20ff
Compare
We're having some silly issue with one hyperopt test only on python 3.11, where the test fails on the phi2 hyperopt loss. In the CI it's |
And why is it only python 3.11 with the python installation? It might be that merging the mongodb stuff (which changed dependencies) introduced a dependency change between conda and pip that in turn creates this difference? |
No idea.. that could be but it doesn't explain that locally I also get very different results (with the same environment, in python 3.9). Also checked several seeds, the order of magnitude remains +7 for master and -10 here.. |
Is this the value of phi2? Does it start in the same point? I agree that's very weird. |
I found the explanation. The seed was being set as an int, which then uses the same for all replicas. Didn't look into details but I assume this causes the phi2 statistic to be vanishingly small. That must happen for any version. Still that's larger than 0 so it's ok. For some reason, maybe as you mention Juan the addition of packages causing some version differences, in python 3.11 it was exactly 0, thus failing the test, but the actual issue was that it was practically zero anywhere. I just changed the seed from and int to a list of 2 different ints, and locally it's now again order +7. |
This is an attempt to address #1916, branching off of #1905 because it uses the
MultiInitializer
to initialize the preprocessing weights.I have checked that when doing any of:
n3fit runcard.yaml 1 -r 3
n3fit runcard.yaml 2 -r 3
n3fit runcard.yaml 1 -r 2
the preprocessing weights, NN weights and PDF output right after creation are identical for replica number 2.
Even after rebasing on trvl-mask-layers though, results do not remain identical.
I don't know where the difference is coming from, as the tr/vl masks, and the invcovmats, are also the same for replica 2.