-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] handling of random_state
in clone
#279
Comments
In case you are getting deep into the weeds and need to consider independent random numbers in a distributed computing environment, you may need to think about the PRNG itself and not just the seed. FYI https://www.thesalmons.org/john/random123/papers/random123sc11.pdf |
Ouch, not that deep into the weeds. I think we need to deal with the case of single location/env only, and leave pseudo-random seed handling for distributed environments to backends like |
@ericjb, I have come to realize that indeed we will probably need PRNG which guarantees pseudo-random independence with a tree-like hierarchy. As such, the linked paper is exactly of the kind I was looking for. |
Pinging @johnsalmon, @moraesmark, @pbelevich regarding the paper - it would be great if there were a tool, possibly with python bindings which, for a tree-like structure of sampler objects can generate independent pseudo-random seeds such that all samplers end up (mutually) independent pseudo-random, if any node in the tree does (the scope of this package is in-principle all of ML pipelines) What is unfortunate that we do not know the size of the tree in advance, so a node doesn't know the number of its children, nor can it communicate with its parents. Otherwise, the solution would be "fairly trivial" by doing the following at the root: 1. compute the number of nodes, 2. run a PRNG sequence generator, 3. distribute the seeds for any enumeration, across the tree |
Getting a bit deeper in the woods, one option would be to convolve each call to a dependent random seed with:
That would ensure uniqueness, pseudo-randomness, and pseudo-independence, as long as no line of code contains more than a call, and no two files the dependent seed generator is called from are identical (e.g., sth silly like near-empty For reference and potential use in that, here is random code from stackoverflow that produces the line a function was called from: from inspect import currentframe
def get_linenumber():
cf = currentframe()
return cf.f_back.f_lineno |
Opening an issue to discuss API design around a requirement where independent, yet random-state-fixed copies of an estimator need to be obtained.
An example would be the bootstrap clones discussed here: sktime/sktime#5823 - these should be statistically independent pseudo-random.
Currently,
clone
copies therandom_seed
1:1, which results in:random_seed=None
, results in independent copies - but not pseudo-random fixed (each run gives different values)random_seed
is set, results in value-identical copies, not statistically independent pseudo-random copies - but pseudo-random fixed copiesNeither meets the requirement above, because that would ned to be both pseudo-random fixed, and statistically independent (not value-identical).
In light of the rework of
random_seed
functionality (see #268), it is worth a discussion how this should even look like from the API perspective.A key problem arises if multiple clones are needed - it needs to be known in advance, or at least they need to be sampled in a chain, to obtain dependent seeds which give rise to pseudo-random independent copies.
Further, we cannot change the default behaviour of
clone
and its current parameters, as it is an interface point of high importance.Options I can think of:
clone(deep=True, random_seed="exact_copy", n_clones=None)
clone_random(deep=True, n_clones=1)
FYI @ericjb, @jmwhyte, @tpvasconcelos - since we all discussed either
clone
orrandom_seed
recently.The text was updated successfully, but these errors were encountered: