Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Something is wrong with the seed #82

Closed
a-torgovitsky opened this issue Sep 8, 2020 · 8 comments
Closed

Something is wrong with the seed #82

a-torgovitsky opened this issue Sep 8, 2020 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@a-torgovitsky
Copy link
Collaborator

Something inside fsst, perhaps other routines, is messing with the seed in a fundamental way.

rm(list = ls())

devtools::load_all()
source("../lpinfer/example/dgp_mixedlogit.R")

dgp <- mixedlogit_dgp()
lpm <- lpmodel(A.obs = mixedlogit_Aobs(dgp),
               beta.obs = function(d) mixedlogit_betaobs(d, dgp),
               A.shp = rep(1, nrow(dgp$vdist)),
               beta.shp = 1,
               A.tgt = mixedlogit_Atgt_dfelast(dgp, w2eval = 1, eeval = -1))

set.seed(1)
data <- mixedlogit_draw(dgp, n = 2000)
print(estbounds(data = data, lpm))

set.seed(1)
data <- mixedlogit_draw(dgp, n = 2000)
print(estbounds(data = data, lpm)) # same

print(fsst(data = data, lpm, .5)) # the seed gets messed up in here?

set.seed(1)
data <- mixedlogit_draw(dgp, n = 2000)
print(estbounds(data = data, lpm)) # should be the same, but isn't

Gives this

Loading lpinfer
Estimated bounds: [0.38761, 0.54898] 
Estimated bounds: [0.38761, 0.54898] 
p-value (by data-driven 'lambda'): 0.88                                    
Estimated bounds: [0.47243, 0.47504]

I think we don't want anything in lpinfer to mess with the seed.
The user should just set that on their own.
I know we talked about this before, but can't remember what we decided.
Let me know if I am forgetting an important case.

@a-torgovitsky a-torgovitsky added the bug Something isn't working label Sep 8, 2020
@conroylau
Copy link
Owner

I cannot reproduce the exact bounds as above, but I can reproduce the problem where the estimated bounds after running fsst are different from the original ones.

The reason for this is due to the step of extracting .Random.seed.
In this step, I first set RNGkind("L'Ecuyer-CMRG") and then extract .Random.seed.
The reason for doing so is that the future.seed argument in the future_lapply function only accepts a logical, or an integer (of length 1 or 7), or a list of length(X) with pre-generated random seeds.
The default of RNGkind by the user may give a .Random.seed with length that is not equal to either 1 or 7, which would lead to an error if I pass it directly to the future.seed argument of future_lapply.
For instance, if the user uses RNGkind("Wichmann-Hill"), then the length of .Random.seed is 4:

RNGkind("Wichmann-Hill")
set.seed(2)
.Random.seed
#> [1] 10400 21758  7530 10264

Thus, I propose the following method in extracting the seed instead:

  • If the length of .Random.seed is 1 or 7, directly pass it to the future.seed argument of future_lapply.
  • Otherwise, pass the sum of the .Random.seed vector to the future.seed argument of future_lapply.

In this method, I do not need to change RNGkind while making sure that the object passed to the future.seed argument of future_lapply has length either 1 or 7.
May I know do you think this is okay? Thanks!

@a-torgovitsky
Copy link
Collaborator Author

I guess this is ok, but it seems like a hack, which means it is probably prone to causing more unforeseen problems, especially since neither of us is an expert on the specifics of parallel programming.

If we want to go this route, wouldn't it be easier just to generate a set of random seeds in serial first? These should be the same each draw if the user has set the seed. Then these can be passed to future_lapply with set.seed. That seems much easier than trying to pull things from the internal .Random.seed.


As I said in #72, I really think it would be a good idea to dig into the discussion about this issue in the future package. This is a complicated issue, but one that comes up in every parallel implementation, since it is essential for reproducibility. There is almost certainly a good solution that does not require messing around with internals in this way.

I looked at their GitHub issues and found many posts (both in their repo and elsewhere) that seem like they might provide clues:

So reading some of these would show us the right way to do things.

@conroylau
Copy link
Owner

I see. For generating a set of random seeds in serial, do you mean first generating a sequence of numbers in the beginning (say via runif) and then pass each of them to future_lapply as seeds when needed?

I will also read the discussions in the future and related packages to see if there is another way to do it. Thanks!

@a-torgovitsky
Copy link
Collaborator Author

I see. For generating a set of random seeds in serial, do you mean first generating a sequence of numbers in the beginning (say via runif) and then pass each of them to future_lapply as seeds when needed?

Yes exactly

@a-torgovitsky
Copy link
Collaborator Author

Although do they even need to be random numbers?
That would be something to check. Maybe they can be deterministic and the user will still get a different result if they change set.seed before running a procedure such as fsst. That would also be ok and much easier.

conroylau added a commit that referenced this issue Sep 9, 2020
@conroylau
Copy link
Owner

Just fixed the problem! After reading various posts, I find that the problem can be solved in a much straightforward way by simply setting future.seed = TRUE without altering the RNGkind as in my previous approach.

Using the updated code, the following result is obtained by running the above code:

Estimated bounds: [0.39067, 0.55121] 
Estimated bounds: [0.39067, 0.55121] 
p-value (by data-driven 'lambda'): 0.88                                                   
Estimated bounds: [0.39067, 0.55121] 

The three set of bounds will still be the same if I set other RNG types. For instance, if I set RNGkind("Wichmann-Hill") before running the code, the following result can be obtained:

Estimated bounds: [0.4576, 0.45883] 
Estimated bounds: [0.4576, 0.45883] 
p-value (by data-driven 'lambda'): 0.55                                                   
Estimated bounds: [0.4576, 0.45883] 

Thanks!

@a-torgovitsky
Copy link
Collaborator Author

That's great!
Is there a particular post that you read where this was clear?
I would like to check it out to make sure I understand.

@conroylau
Copy link
Owner

I find the third link that you shared above is one of the most relevant and updated posts.
It also mentions about the speed issue, which is quite interesting too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants