Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on the calculation of ranks #57

Open
maxbiostat opened this issue Sep 24, 2021 · 6 comments
Open

Clarification on the calculation of ranks #57

maxbiostat opened this issue Sep 24, 2021 · 6 comments

Comments

@maxbiostat
Copy link

I'm adapting some of your (awesome!) code for use with trees and I stumbled upon something I don't quite understand. What is

SBC:::calculate_ranks_draws_matrix <-
function (params, dm) 
{
    max_rank <- posterior::ndraws(dm)
    less_matrix <- sweep(dm, MARGIN = 2, STATS = params, FUN = "<")
    rank_min <- colSums(less_matrix)
    equal_matrix <- sweep(dm, MARGIN = 2, STATS = params, FUN = "==")
    rank_range <- colSums(equal_matrix)
    ranks <- rank_min + rdunif(posterior::nvariables(dm), a = 0, 
        b = rank_range)
    attr(ranks, "max_rank") <- max_rank
    ranks
}

doing, exactly?
This seems to imply that ranks would be random, which I don't undestand.

@maxbiostat
Copy link
Author

I'm guessing this does random resolution of ties, and that's fine. I just ran into an example where this leads to essentially random rank histograms though, but we can talk about that later.

@hyunjimoon
Copy link
Owner

hyunjimoon commented Sep 25, 2021

I guess this is related to breaking ties for ranks. This is especially crucial in discrete parameters. See : https://github.com/hyunjimoon/SBC/wiki/SBC-FAQ#rank-smoothing

@martinmodrak
Copy link
Collaborator

@hyunjimoon is right. Also appears you are working with a somewhat old version of the code - the function (and especially the tie breaking) is documented in more recent versions. Could you be more specific about the use case where this causes problems? My understanding was that the tie-breaking is pretty safe as ties just imply lack of information on ordering (and hence that randomizing cannot hurt), but I might easily be mistaken. Would the other approach on tie-braking (link from the FAQ) make more sense for your use case?

Additionally, I just finished a vignette on how to connect new algorithms into the SBC package framework (https://hyunjimoon.github.io/SBC/articles/implementing_backends.html) Maybe trees require some additional support that is currently hard to achieve, but wanted to show, that it IMHO should not be impossible for you to work completely within the framework of this package (if you want to).

@maxbiostat
Copy link
Author

Also appears you are working with a somewhat old version of the code

Weird. I just installed from source and the code in my machine remains the same. But you are right that the version in this repo has more information

Could you be more specific about the use case where this causes problems?

Sure, but I wouldn't call it 'causing problems' so much as 'I'm not sure I understand what this means'

The situation is that I have a discrete functional which has very limited variation in the MCMC draws

image

This ends up leading to loads of ties when one computes the ranks. I attach a file that contains 100 runs with the posterior draws and the simulated (prior) draws of the Robinson-Foulds distance to an anchor tree, which is the functional in question.
Robinson-Foulds_distances.csv

To be clear, I think this is evidence that RF is a poor metric for SBC. But it might be interesting for you to have a look and think about what this means for discrete functionals in general.

We can continue this discussion via email/Discourse too, if you want.

@martinmodrak
Copy link
Collaborator

To be clear, I think this is evidence that RF is a poor metric for SBC. But it might be interesting for you to have a look and think about what this means for discrete functionals in general.

Completely agree - since the RF in this case is almost a binary variable, it is close to the least informative variable one can use (not only for SBC). I think there are potential improvements for SBC with discrete variables if you can either:

  1. get an analytical/exact form of the prior distribution (i.e. the probability for each value)
  2. get the posterior distribution of probabilities for each value (not the discrete variable itself)

Because if both hold, then you could probably use the probabilities for individual categories as continuous values for SBC and thus increase information content. And I think one can do something useful even if just one is true. Both 1) and 2) commonly hold when you marginalize out discrete variables in Stan programs, but from quickly skimming the Wiki page for RF distance, I would be surprised if you could get 2)... Maybe 1) one would be possible with a carefully chosen base tree?

In any case, this is just an unconfirmed hunch and I didn't test the ideas actually work better than ranks in practice. Thinking more thoroughly about this and doing some experiments is on my SBC TODO list, but I admit it is currently in a relatively low position...

If you don't get both probabilities as continous, but the range of the discrete outcomes is bounded, one idea is that one could use something like the chi-squared test to check that prior and all the posteriors bundled together are the same. I have no idea/data whether this would have more "power" than the rank approach. Another idea would be to convert the discrete values into numerical probabilities + uncertainty for each fit and somehow aggregate those and compare to the prior.

@TeemuSailynoja might have more to say about this, but not sure if he's working on discrete variables as well.

@hyunjimoon
Copy link
Owner

hyunjimoon commented Sep 28, 2021

I think this is evidence that RF is a poor metric for SBC.

This is why I've been interested in metrics that compare the prior and posterior samples as a whole. Much information gets lost in rank. Another way could be approximating parameters of interest with continuous hyperparameters (e.g. gaussian mixture) and comparing ranks. I wonder whether hyperparameters' rank can substitute ranks of a parameter in this case. This approach is analogized as comparing p instead of y for y ~ Bin(n, p); compare mu1,2 instead of theta if theta is discrete for mu ~ Gaussian mixture (mu1, mu2, sigma). This pr is pushing forward in this aspect (especially this file). I welcome all forms of feedback and collaboration in here/discourse/mail!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants