"Bone" works well when autocomplete=true but breaks when autocomplete=false #142

gaurav · 2024-02-02T21:13:21Z

When autocomplete is set to true, we set the query parameter query to e.g. (water) OR (water*) so that we can include cases where the search query is incomplete (e.g. (bloo) OR (bloo*) lets us find blood).

When autocomplete is false, we set query to just (water). In most cases this works fine, but sometimes this presents very different results. Compare:

We can restore the previous results by repeating the search query twice as before, i.e. (bone) OR (bone). I tried that out in a branch and confirmed that it does work:

NameResolution/api/server.py

Line 301 in 8006348

query = f"({string_lc_escaped}) OR ({string_lc_escaped})"

Looking at the explain output, it looks like both queries set off a search for "bone bone", which might be pulling the correct UBERON term higher up:

    "rawquerystring":"(bone) OR (bone*)",
    "querystring":"(bone) OR (bone*)",
    "parsedquery":"+(DisjunctionMaxQuery((names:bone | (preferred_name_exactish:bone)^10.0 | preferred_name:bone)) DisjunctionMaxQuery((names:bone* | preferred_name:bone* | (preferred_name_exactish:bone*)^10.0))) DisjunctionMaxQuery(((names:\"bone bone\")^2.0 | (preferred_name:\"bone bone\")^3.0))",
    "parsedquery_toString":"+((names:bone | (preferred_name_exactish:bone)^10.0 | preferred_name:bone) (names:bone* | preferred_name:bone* | (preferred_name_exactish:bone*)^10.0)) ((names:\"bone bone\")^2.0 | (preferred_name:\"bone bone\")^3.0)",

    "rawquerystring":"(bone) OR (bone)",
    "querystring":"(bone) OR (bone)",
    "parsedquery":"+(DisjunctionMaxQuery((names:bone | (preferred_name_exactish:bone)^10.0 | preferred_name:bone)) DisjunctionMaxQuery((names:bone | (preferred_name_exactish:bone)^10.0 | preferred_name:bone))) DisjunctionMaxQuery(((names:\"bone bone\")^2.0 | (preferred_name:\"bone bone\")^3.0))",
    "parsedquery_toString":"+((names:bone | (preferred_name_exactish:bone)^10.0 | preferred_name:bone) (names:bone | (preferred_name_exactish:bone)^10.0 | preferred_name:bone)) ((names:\"bone bone\")^2.0 | (preferred_name:\"bone bone\")^3.0)",

So the real solution here would be to improve the search so that we don't need to duplicate terms. The clique count I'm currently testing might help with that, but the scores might also be different enough that that doesn't make a difference. If that's the case, we'll need to be smarter about this.

The text was updated successfully, but these errors were encountered:

This PR combines several improvements to search, results and filtering: * It updates the search query to no longer duplicate the search query when doing an autocomplete query (see #142). * This breaks hyphenated search terms in the autocomplete query, and I can't figure out why. For now, I've set it up so that we replace special characters with spaces in the autocomplete query (i.e. beta-secretase becomes `(beta secretase*)`) but we escape special characters in the non-autocomplete query (i.e. beta-secretase becomes `(beta\-secretase*)` since that still appears to work. I'll dig into this more deeply in #146. * It adds taxon and clique identifier count to values indexed during data loading. * It incorporates clique identifier count into both the returned results as well as the boosting and sorting of the returned results. It also tweaks the boosting values used in query fields and phrase fields. * It adds an `only_taxa` input field that allows filtering results to a list of NCBITaxon taxon identifiers (note that this will only work for terms that have taxon information, which at the moment is only cliques containing NCBIGene identifiers).

gaurav · 2024-10-16T01:57:26Z

Differences between the two searches are still present but are less pronounced.

However, we still do a bad job with "bone", because we return a whole bunch of bone-related diseases before we return UBERON:0002481 "bone tissue". This is presumably because those have a much higher clique identifier count (16 for "solitary bone cyst", 52 for "osteosarcoma", 18 for "bone osteosarcoma" and so on, as compared to 6 for bone tissue).

You can override this by providing a Biolink type (like biolink:AnatomicalEntity), but this is still unideal.

Not sure how to fix this, though.

gaurav mentioned this issue Apr 23, 2024

Improve search, results and filtering by adding taxon and clique identifier count information #143

Merged

gaurav added this to the NameRes May 2024 milestone May 18, 2024

gaurav mentioned this issue May 22, 2024

Searching for BRCA1 in autocomplete=true mode gives a lot of bad matches #149

Closed

gaurav modified the milestones: NameRes May 2024, NameRes - issues needing investigation Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Bone" works well when autocomplete=true but breaks when autocomplete=false #142

"Bone" works well when autocomplete=true but breaks when autocomplete=false #142

gaurav commented Feb 2, 2024

gaurav commented Oct 16, 2024

"Bone" works well when autocomplete=true but breaks when autocomplete=false #142

"Bone" works well when autocomplete=true but breaks when autocomplete=false #142

Comments

gaurav commented Feb 2, 2024

gaurav commented Oct 16, 2024