Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support blavaan #69

Open
martinmodrak opened this issue Mar 17, 2022 · 4 comments
Open

Support blavaan #69

martinmodrak opened this issue Mar 17, 2022 · 4 comments

Comments

@martinmodrak
Copy link
Collaborator

@maugavilla discussed at #35

I am working to impement SBC with the blavaan package, here we have pre compile Stan model, for usually large models with a lot of parameters. And within blavaan I can generate data sets from priors, so I could skip the generator function for example. But I cant include my list of data sets, as it is not an SBC_datasets type object.

From this 2 questions and possible additions:

how to add a list of data sets that was generated from another function? Or how to make this list into an SBC_dataset object?
how to ask to save only a few parameters? As I am not interested in all the parameters from this large model

Appreciate any guidelines

So it appears both a blavaan-specific generator and a blavaan backend could be useful.

@martinmodrak
Copy link
Collaborator Author

But to get things rolling for @maugavilla:

My answer would differ depending on the end goal: do you to build a general integration of of the blavaan and SBC packages? Or are you just trying to have the simplest possible way to run SBC for a single specific blavaan model?

I'll note that if you have a list where each element is something that can be passed as the data argument to blavaan, than that's exactly what is needed for the generated argument of SBC_datasets(). Additionally, you also need the "true" values for all of the unobserved variables stored as a draws_matrix object (the variables argument to SBC_datasets()). SBC package will directly handle only variables that are included here. Still, the underlying Stan fits will store all variables, so you may want to call compute_SBC with keep_fits = FALSE to reduce memory footprint if you run a lot of simulations.

If a more general support is the goal, then I think the most sensible way would be to wrap the already existing simulation code in a new type of generator object. If you want to generate data using Stan sampling while ignoring the likelihood (which appears to be what is supported by blavaan), then the way SBC_generator_brms is implemented might provide some hints (https://github.com/hyunjimoon/SBC/blob/master/R/datasets.R#L264).

Similarly, one would then create a new backend, as discussed at https://hyunjimoon.github.io/SBC/articles/implementing_backends.html

If you just need a quick hack to get going with a single model, then it might require less effort to use the built-in Stan backend and convert all the simulated datasets to format blavaan uses for Stan (brms has make_standata function for this, not sure if blavaan exposes something like that).

Does that make sense?

I'll definitely be happy to help you move further if something's unclear.

@maugavilla
Copy link

@martinmodrak in the short term we are working on working examples and how to ecaluate priors on blavaan. But, the idea was to integrate the use of the SBC package with blavaan.

I have saved the data sets in a list object as you suggest here. But I have not save the "true" values, will work to integrate this in my example.

With this I should have enough to get my working example running. Once I get this example working, I will try to develop the blavaan generator and bakend.

Will probable get back to this thread when I get start working on the generator and backend

thank you

@maugavilla
Copy link

Hi

As I have started working on this, I ran into an issue/dubt. In blavaan the parameters are estimated in the model/transform parameters blocks. And then they are adjsuted in the generated quantities block, and these are the ones exported for interpretation.

So, when doing SBC, should it be done with the unadjusted parameters, or with the adjusted ones from generated quantities block?

Thanks

@martinmodrak
Copy link
Collaborator Author

That's a good question! In most cases there won't be a general straightforward answer to this, but there are several considerations that should influence the decision. They boil down to what is natural for simulations, how much you trust your GQ code and whether the focus is more directed at testing the blavaan package (which would likely be done somewhat centrally by people having good knowledge of the package, not very frequently and with a lot of computational resources) or whether you mostly trust the package, but want to let your users test whether they didn't introduce some issue in their specific model (which would be less centralized, more frequent and have less computational resources).

There is also one simple case: if the adjustment is monotonous (preserves ordering of all values or reverses the ordering of all values), SBC will give identical results for the adjusted and for the unadjusted value and the choice is thus irrelevant.

Now for the considerations in detail:

What is simulated? It is usually beneficial to keep the simulator code very simple, so that it is unlikely to contain bugs (which would then trigger false positives in SBC). Values that are more directly/easily represented in the way you simulate your data (or in the way you asume others will simulate their data) are thus usually better candidates for SBC.

Do you trust your GQ code? If the GQ code is very simple or you have other reasons to trust it (e.g. there is a simple way to test the GQ code directly), there is no strong motivation to include it in SBC. If SBC is your only shot at verifying the GQ code, it might be better to use the adjusted values.

Who/why will run SBC on your models? When SBC is run as a part of modelling workflow (as opposed to package development), the user is usually quite constrained in their computational resources they can devote to SBC. We advise people to use the empirical_coverage function to see the potential for remaining mismatch in coverage after running a given number of simulations. This value is more useful if the variables used in SBC are directly meaningful to the user and so if modellers should run the SBC, this is an argument for including the adjusted variables. If on the other hand the focus is on package development, one can expect that empirical_coverage is less useful (you want to run a ton of simulations anyway) and that the user understands even the unadjusted quantities quite well.

In some cases, it might also make sense to include both the "raw" and "transformed" or "adjusted" values. There is a potential risk for increasing false positives when including more quantities and not doing multiple testing correction, but if the added quantities are closely correlated with others, the risk is low. If some form of multiple testing correction is used adding highly correlated additional quantities decreases power/increases false negatives. At this point we don't have a good understanding on how to do proper error control when there are some dependencies between the variables used in SBC.

Hope that's enought to help you make a good decision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants