Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove unnecessary override code #2695

Conversation

daixque
Copy link

@daixque daixque commented Nov 11, 2024

I tested below code to test DenseVectorStrategy:

strategy = DenseVectorStrategy(
    model_id=".multilingual-e5-small_linux-x86_64",
    hybrid=True,
    text_field="title",   
)
store = ElasticsearchStore(
    es_connection=elasticsearch_client,
    index_name=INDEX,
    vector_query_field="content_semantic.inference.chunks.embeddings",
    query_field="content",
    strategy=strategy,
)

But I realized text_field="title" parameter in the DenseVectorStrategy constructor doesn't affect since it is overridded in:

https://github.com/elastic/elasticsearch-py/blob/main/elasticsearch/helpers/vectorstore/_sync/vectorstore.py#L83-L86

And also there is no vector_field in all of strategy classes.

This PR intends to:

  • Override text_field field only when it is not provided as part of constructor
  • Remove unnecessary override code since if hasattr(retrieval_strategy, "vector_field") is always False

Copy link

A documentation preview will be available soon.

Request a new doc build by commenting
  • Rebuild this PR: run docs-build
  • Rebuild this PR and all Elastic docs: run docs-build rebuild

run docs-build is much faster than run docs-build rebuild. A rebuild should only be needed in rare situations.

If your PR continues to fail for an unknown reason, the doc build pipeline may be broken. Elastic employees can check the pipeline status here.

@miguelgrinberg
Copy link
Contributor

miguelgrinberg commented Nov 11, 2024

Wouldn't your problem be addressed if you move the text field to the vector store?

strategy = DenseVectorStrategy(
    model_id=".multilingual-e5-small_linux-x86_64",
    hybrid=True,
)
store = ElasticsearchStore(
    es_connection=elasticsearch_client,
    index_name=INDEX,
    vector_query_field="content_semantic.inference.chunks.embeddings",
    query_field="title",   
    strategy=strategy,
)

And also, note that query_field does not exist in our vector store class, it only appears in the Langchain integration, and is in fact converted to text_field when is passed up to the base class. Basically in this example you were running a hybrid search in which the BM25 and KNN sides were using different fields.

So I think you need to first of all decide if you want to use title or content for your search, and then pass the chosen field in the store class as query_field.

Unfortunately this way these constructors were designed is extremely convoluted, which is really the source of the problem you are having. Your fix addresses the way in which you are using this integration but does not do much to improve the overall design, in fact you may not have realized that you are searching two different fields.

@daixque
Copy link
Author

daixque commented Nov 12, 2024

Hi @miguelgrinberg , thanks for your comment. It is understood. Currently I configured with the same way with your suggestion. Below is just my comments.

If I set query_field="title" in ElasticsearchStore, it will pass to the LLM as part of the prompt and it is not my preference. I'd like to keep the content for the prompt.

My first intention is DenseVectorStrategy provides text_field parameter but its value which set in the constructor never used. I was confused by this behavior, so I'd like to avoid it if possible. If DenseVectorStrategy is only for ElasticsearchStore and it always override text_field, then I feel the constructor parameter doesn't make sense.

Second, if we can set the text_field for DenseVectorStrategy, then we can tune the query as I show in first example, title for BM25, embedding field which comes from content for knn, and full content for query for the LLM prompt. I found this configuration works well. But maybe for further tuning, we should create our own Strategy sub-class.

FYI, here is my (partial) mapping

{
  "properties": {
    "content": {
      "type": "text",
      "copy_to": [
        "content_semantic"
      ],
      "analyzer": "kuromoji"
    },
    "content_semantic": {
      "type": "semantic_text",
      "inference_id": "my-e5-endpoint",
      "model_settings": {
        "task_type": "text_embedding",
        "dimensions": 384,
        "similarity": "cosine",
        "element_type": "float"
      }
    },
    "title": {
      "type": "text",
      "analyzer": "kuromoji"
    }
  }
}

@miguelgrinberg
Copy link
Contributor

miguelgrinberg commented Nov 12, 2024

Right, so you are taking advantage of a design flaw in these classes, and running a hybrid search that uses different fields for the BM25 and kNN portions, which as far as I can see is not an intended usage. You are suggesting that text_field is being ignored, but from my understanding of the design of this integration that is actually the correct behavior, because you are passing query_field in the vectorstore class which takes precendence, so it makes sense for this field to be used in your searches.

If you wanted to keep doing this, then you do not need to change anything, just do this:

strategy = DenseVectorStrategy(
    model_id=".multilingual-e5-small_linux-x86_64",
    hybrid=True,
)
store = ElasticsearchStore(
    es_connection=elasticsearch_client,
    index_name=INDEX,
    vector_query_field="content_semantic.inference.chunks.embeddings",
    query_field="content",   
    strategy=strategy,
)
strategy.text_field = 'title'  # override the field used in the kNN search

@daixque
Copy link
Author

daixque commented Nov 12, 2024

@miguelgrinberg
Ah, indeed it will work but bit tricky 😇
Anyway thank you so much for your feedback! Now I'm clear.

@daixque daixque closed this Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants