-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel replicas with varying tr-vl masks #1788
Conversation
validphys2/src/validphys/utils.py
Outdated
if is_hashable(obj): | ||
return obj | ||
elif isinstance(obj, Mapping): | ||
return frozenset([(immute(k), immute(v)) for k, v in obj.items()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dict keys are already hashable.
return frozenset([(immute(k), immute(v)) for k, v in obj.items()]) | |
return frozenset([k, immute(v)) for k, v in obj.items()]) |
validphys2/src/validphys/utils.py
Outdated
def wrapped(*args, **kwargs): | ||
args = tuple([immute(arg) for arg in args]) | ||
kwargs = {k: immute(v) for k, v in kwargs.items()} | ||
return func(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this is rather dangerous as a general decorator, as it changes the meaning of the inputs in unexpected ways.
It may make more sense to do this closer to the runcard input. In which case the freezer becomes
def freeze(obj):
if isinstance(obj, list):
return tuple(freeze(ele) for ele in obj)
if isinstance(obj, dict):
return ((k, freeze(v)) for k, v in obj.items())
if isinstance(obj, set):
return frozenset(obj)
return obj
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but where in the code should the freezer be called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An input to a vp action is either:
- Some input coming from the runcard, which is not hashable in general.
- The ouput of a parsing or a production rule (like a
PDF
object) which really needs to be hashable already. - The output of another action, which you really don't care about being hashable.
Hence you really only need to make sure to make the inputs hashable, which would not change the semantics too much. And the best place to do that is close to where you are parsing the input. It is something I had around for reportengine "1.0" which is why I had the above function lying around.
validphys2/src/validphys/utils.py
Outdated
def immute(obj: Any): | ||
# So that we don't infinitely recurse since frozenset and tuples | ||
# are Sequences. | ||
if is_hashable(obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Computing hashes of potentially deeply nested objects can be expensive. Given that you are failing anyway, you can as well do nothing and deal with effectively the same error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed we could skip the hashing check. As a side note: I have no idea why tests are currently failing in this branch...
Looking through all the
So it seems safe to me. We might want to do a comparefits report when we know we have different cuts to make sure that stays the same. Several examples here (that use different kind of cuts) https://vp.nnpdf.science/sK8VI_HITF-hrg2Cx-_lPQ==/ However, some coments. (2.) Suggests to me that the
I've manually changed the cuts in both fits to make sure they were different. RE the tests, I think you are seeing the problem that was fixed here #1805 |
It looks like something is still wrong. I did a test with a single epoch, 10 parallel replicas, and it seems to me that the difference between baseline (sequential replicas) and the parallel case is growing with the replica number:
I expect some roundoff errors, but certainly not a systematic increase per replica. BTW, for the sequential mode the chi2 is identical to the master branch edit: a quick check with the basic parallel runcard shows no significant differences across replicas, so the problem appears to be only for the NNPDF4.0 runcard... |
As already mentioned, I don't expect the numbers at a given epoch to be fully consistent/close. In the parallel replicas case, even a very small difference at the start of the fit could lead to fluctuations during the training. It is not that all the changes here kept the numerical values to be exactly the same. So, the proper way to compare the results are by running proper fits. At the very least for now we should compare: master sequential vs trvl-mask sequential and master sequential vs trvl-mask parallel. Does this make sense? |
Well I agree with @goord that the pattern is curious. If all the seeds are the same at the start of each replica then the result should be the same for the sequential and parallel cases. It may be that there is some numerical instability, or perhaps the seeds are not the same because e.g. the optimizer/nn initialization/trvl split/whatever seed doesn't get reset for the parallel case(?) but then I still don't understand why the not only the first but also the second and third replicas have quite good agreement while this agreement has clearly deteriorated by replicas 9 and 10. If the seeds were not the same I'd expect the deterioration at replica 2 to be equivalent to all later differences... Also the edit that it's a problem for the NNPDF4.0 runcard but not for the basic runcard is odd. Is |
In this case this could point however to a bug on how the fit is being stopped. The first replica is stopped correctly, the second one is a bit farther away from the optimal point and so on and so forth. The random seeds is the other thing that comes to mind, as @RoyStegeman said, it would make sense that it is the same for the 1st replica but then different for the rest. But I also fail to see why should it be more different for 5 than for 2 (and the first few replicas are very close...) The basic_runcard has |
Well my first suspicion was that somewhere along the losses or observables, tensors get somehow accumulated along the replica dimension, but then again the basic runcard seems to reproduce the sequential fit .. |
But are all the important settings the same for the basic runcard and the NNPDF4.0 runcard? I.e. the changes between the basic runcard you use are limited to choice of dataset, seeds (same seeds just different value), and preprocessing, while keys such as I'd like to understand what the important difference is between the runcards that causes the different behaviour. |
Both have dis and Dy datasets, positivity constraints and same_trvl_per_replica set to false to test the new functionality. The nnpdf4.0 has integrability layers though, which the basic runcard doesn't have |
Regarding the single-epoch reproducibility test:
(edit) Using TensorFlow 2.10 instead of 2.11 solves the reproducibility issue on the cluster too |
Great! Did you understand why so? PS: Could you please run |
Over the weekend I did an extensive set of regression tests between master (rev. db8b790) and trvl-mask-layers branch (latest rev.). To my untrained eye, results look more or less identical:
|
Thanks a lot @goord for these comparisons! These comparisons look great; as you said, the results are statistically identical. I also see that you've found a way around the TF version linked to the reproducibility issue (?). I guess that after some clean ups (and |
TODO: the observation data rotation is currently not working correctly, because it is applying a masked rotation matrix after the masking layer. This should change to applying the unmasked rotation to the observation data before the masking. |
When you try to run DIS_diagonal_l2reg_example.yml it might be that not all datasets declared there exist anymore. You might need to remove some of them to test the runcard (as mentioned earlier today... it would be good to have all examples tested...) |
Refactored the LossInvCovmat to handle the diagonal case in an optimized way. I get no significant numerical changes after 4000 epochs. I was running the black tool, but I get a whole bunch of files I didn't touch that need reformatting... Should I ignore those? |
6bd0fb6
to
c2e4935
Compare
Is this ready for the redo-regressions? |
No, we asked for a lot of stuff to be changed. I'd need to redo the review. |
Co-authored-by: Juan M. Cruz-Martinez <[email protected]>
Co-authored-by: Juan M. Cruz-Martinez <[email protected]>
Co-authored-by: Juan M. Cruz-Martinez <[email protected]>
I've fixed what I could, but I also find anything related to the preprocessing of data very confusing. |
I noticed I introduced a bug in 49904a3, got the number of replicas from the batch axis, just pushed a fix. |
Is this ready now for the redoing of the regressions (once tests pass)? Would be great if we can get this merged, and perhaps even #1936, so that we can turn on some faster hyperopt runs over the weekend to get some more data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a final comment from me :P
@scarlehoff About the flattening, indeed I guess it was the covmat? You wrote that you would change your suggestion to a comment, but then resolved it, should I add a comment there or not? |
Co-authored-by: Juan M. Cruz-Martinez <[email protected]>
This is a continuation of the pull request #1661 that implements the parallel replicas with
same-trvl-seed=false
support. The branch has been migrated to this original repo and the latest master was merged.