Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding linear retriever to support weighted sums of sub-retrievers #120222

Merged
merged 78 commits into from
Jan 28, 2025

Conversation

pmpailis
Copy link
Contributor

@pmpailis pmpailis commented Jan 15, 2025

This PR adds a new linear retriever to facilitate hybrid search, that would be able to linearly combine the results of other sub-retrievers and compute the final score of a document based on the weighted sum of each sub-components.

Each sub-component can specify the following elements:

  • retriever -> specifies how we will compute the top documents
  • normalizer -> specifies how we want to normalize the top documents for this retriever (so that we can ensure that all scores fall within the same range)
  • weight -> the weight for the normalized score if the final weighted sum computation

Pagination is similar to that of rrf's retriever, i.e. we compute the global rank_window_size docs and pagination is only available within these bounds.

So, working through an example, let's say that we perform a hybrid search query where:

  • we want to run a simple string query through a standard retriever, and normalize the scores to a [0, 1] range
  • we want to run knn search through the knn retriever, without normalizing the documents as well
  • compute the final score for the retriever as score = 1.5 * standard + 2.5 * knn

Sample syntax:

GET /retrievers_example/_search
{
    "retriever": {
        "linear": {
            "retrievers": [
                {
                        "retriever": {
                            "standard": {
                                "query": {
                                    "simple_query_string": {
                                        "query": "artifical intelligence in medicine",
                                        "fields": [
                                            "text"
                                        ]
                                    }
                                }
                            }
                        },
                        "weight": 1.5,
                        "normalizer": "minmax"
                },
                {
                        "retriever": {
                            "knn": {
                                "field": "vector",
                                "query_vector": [
                                    0.23,
                                    0.67,
                                    0.89
                                ],
                                "k": 3,
                                "num_candidates": 5
                            }
                        },
                        "weight": 2.5
                }
            ],
            "rank_window_size": 10
        }
    }
}

Copy link
Contributor

Documentation preview:

@pmpailis pmpailis added >enhancement :Search Relevance/Ranking Scoring, rescoring, rank evaluation. :Search Relevance/Search Catch all for Search Relevance v8.18.0 labels Jan 16, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @pmpailis, I've created a changelog YAML for you.

@pmpailis pmpailis added the auto-backport Automatically create backport pull requests when merged label Jan 16, 2025
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking much better. I have a concern around testing:

Do we have a test that specifically exercises the path when the different retrievers return different doc IDs? (e.g. they match non-overlapping doc sets).

@pmpailis
Copy link
Contributor Author

Do we have a test that specifically exercises the path when the different retrievers return different doc IDs? (e.g. they match non-overlapping doc sets).

Added a test to account for this in ea1787f

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit: :godmode:

@pmpailis pmpailis merged commit 375814d into elastic:main Jan 28, 2025
16 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 120222

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged backport pending >enhancement :Search Relevance/Ranking Scoring, rescoring, rank evaluation. Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants