Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

latent_values_incremental_logp can call base_relation->incorporate with latent values that are out of distribution's set of valid values #216

Closed
ThomasColthurst opened this issue Sep 30, 2024 · 2 comments
Assignees

Comments

@ThomasColthurst
Copy link

Lines 110-12 of transition_latent_value.hh, inside latent_values_incremental_logp are:

for (const ValueType& v : latent_values) {
// Incorporate the latent/noisy values for this candidate.
base_relation->incorporate_to_cluster(base_items, v);

When the distribution of the base_relation is something like StringCat, this can cause a crash when the latent_value is outside of the distribution's range of valid values. (For example, not on the StringCat's list of valid strings).

The latent values are computed on lines 51-57 of transition_latent_value.hh:
std::vector latent_value_candidates;
for (auto [name, rel] : noisy_relations) {
for (const auto& [i, em] : rel->get_emission_clusters()) {
latent_value_candidates.push_back(
em->propose_clean(all_noisy_observations, prng));
}
}

We need to augment this procedure with something that takes information from the distribution of the base_relation. One possibility would be to add a "nearest" method to distributions/base.hh::Distribution that returns the nearest valid value given a possibly invalid value. This method can default to the identity function, and we can override it for the distributions that need it.

(I believe the distributions that would need it are:
DistributionAdapter
DistributionAdapter
DirichletCategorical
StringCat
StringNat
)

Assigining to Emily to approve this design or suggest another. Assigning to Thomas to implement.

@emilyfertig
Copy link

I think this design sounds good. In the future it might be nice to return top k nearest or a sample from something around nearest, but I think what you proposed is a good initial solution. (For distributions that take a finite number of values, it would be straightforward to take the logp of the noisy observation given each valid distribution value, and sample from that or return the top k). I'll think more about other approaches that we (or someone else) could consider longer term.

@ThomasColthurst
Copy link
Author

Fixed in #224

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants