latent_values_incremental_logp can call base_relation->incorporate with latent values that are out of distribution's set of valid values #216

ThomasColthurst · 2024-09-30T16:30:07Z

Lines 110-12 of transition_latent_value.hh, inside latent_values_incremental_logp are:

for (const ValueType& v : latent_values) {
// Incorporate the latent/noisy values for this candidate.
base_relation->incorporate_to_cluster(base_items, v);

When the distribution of the base_relation is something like StringCat, this can cause a crash when the latent_value is outside of the distribution's range of valid values. (For example, not on the StringCat's list of valid strings).

The latent values are computed on lines 51-57 of transition_latent_value.hh:
std::vector latent_value_candidates;
for (auto [name, rel] : noisy_relations) {
for (const auto& [i, em] : rel->get_emission_clusters()) {
latent_value_candidates.push_back(
em->propose_clean(all_noisy_observations, prng));
}
}

We need to augment this procedure with something that takes information from the distribution of the base_relation. One possibility would be to add a "nearest" method to distributions/base.hh::Distribution that returns the nearest valid value given a possibly invalid value. This method can default to the identity function, and we can override it for the distributions that need it.

(I believe the distributions that would need it are:
DistributionAdapter
DistributionAdapter
DirichletCategorical
StringCat
StringNat
)

Assigining to Emily to approve this design or suggest another. Assigning to Thomas to implement.

emilyfertig · 2024-10-01T18:15:01Z

I think this design sounds good. In the future it might be nice to return top k nearest or a sample from something around nearest, but I think what you proposed is a good initial solution. (For distributions that take a finite number of values, it would be straightforward to take the logp of the noisy observation given each valid distribution value, and sample from that or return the top k). I'll think more about other approaches that we (or someone else) could consider longer term.

ThomasColthurst · 2024-10-04T17:44:08Z

Fixed in #224

ThomasColthurst assigned ThomasColthurst and emilyfertig Sep 30, 2024

ThomasColthurst mentioned this issue Oct 1, 2024

Add nearest method for distributions #218

Merged

ThomasColthurst closed this as completed Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

latent_values_incremental_logp can call base_relation->incorporate with latent values that are out of distribution's set of valid values #216

latent_values_incremental_logp can call base_relation->incorporate with latent values that are out of distribution's set of valid values #216

ThomasColthurst commented Sep 30, 2024

emilyfertig commented Oct 1, 2024

ThomasColthurst commented Oct 4, 2024

latent_values_incremental_logp can call base_relation->incorporate with latent values that are out of distribution's set of valid values #216

latent_values_incremental_logp can call base_relation->incorporate with latent values that are out of distribution's set of valid values #216

Comments

ThomasColthurst commented Sep 30, 2024

emilyfertig commented Oct 1, 2024

ThomasColthurst commented Oct 4, 2024