Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Hybrid Retrieval #1398

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

feat: Hybrid Retrieval #1398

wants to merge 9 commits into from

Conversation

yiyiyi0817
Copy link
Member

Description

Hybrid Retrieval that combines auto retrieval and BM25 retrieval.

Motivation and Context

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds core functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)
  • Example (update in the folder of example)

More Tasks

In the future, improvements can be made by separating the chunking and processing parts of the original vector-based retrieval and BM25 retrieval code, so that chunks can be uniformly numbered instead of relying on the current version's string matching deduplication operation.

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

  • I have read the CONTRIBUTION guide. (required)
  • My change requires a change to the documentation.
  • I have updated the tests accordingly. (required for a bug fix or a new feature)
  • I have updated the documentation accordingly.

@Wendong-Fan Wendong-Fan added this to the Sprint 20 milestone Jan 5, 2025
@Wendong-Fan Wendong-Fan changed the title RAG: Hybrid Retrieval feat: Hybrid Retrieval Jan 5, 2025
@Wendong-Fan Wendong-Fan modified the milestones: Sprint 20, Sprint 19 Jan 5, 2025
from camel.types import EmbeddingModelType, StorageType


class HybridRetriever:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Maybe extending BaseRetriever here would help maintain consistency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous thought was that HybridRetriever and AutoRetrival were similar classes, not the base retrival component. I noticed that AutoRetrival also does not inherit BaseRetrival, so I'm not sure if they should both inherit BaseRetrival, WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out, what do you think @Wendong-Fan ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @liuxukun2000 and @yiyiyi0817! The BaseRetriever is a minimal, abstract base class, while the HybridRetriever extends both VectorRetriever and BM25Retriever, operating at a higher level. However, we can still inherit from BaseRetriever to implement the process method.
I think it would be better not to include AutoRetriever directly within HybridRetriever. AutoRetriever is a simple implementation designed to allow users to quickly run our RAG pipeline. Its primary purpose is to provide an easy entry point for users to try our RAG functionality, so it should remain at the top level for user interaction. HybridRetriever should depend solely on VectorRetriever and BM25Retriever. Perhaps later, we can consider integrating HybridRetriever into AutoRetriever for an enhanced user experience. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WenDong OK, I agree with you.

contents=self.content_input_path,
top_k=vector_retriever_top_k,
similarity_threshold=vector_retriever_similarity_threshold,
return_detailed_info=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be return_detailed_info=return_detailed_info ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your message. I think the setting of return_detailed_info=True here is intended to obtain detailed results from auto_retriever, and this is fixed. The return_detailed_info parameter of the query function is used to specify whether to finally return detialed results that include the rrf score. However, this part may be refactored later to become vector_retriever.

return assistant_response.msg.content


print(single_agent("What is it like to be a visiting student at KAUST?"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to include the expected output for single_agent("What is it like to be a visiting student at KAUST?") at the end of the file, consistent with other examples.

"Original Query": query,
"Retrieved Context": text_retrieved_info,
}
if return_detailed_info:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic in this section can be simplified to avoid redundancy.

Suggested change
if return_detailed_info:
retrieved_info = {
"Original Query": query,
"Retrieved Context": all_retrieved_info if return_detailed_info else [item['text'] for item in all_retrieved_info],
}
return retrieved_info

from camel.types import EmbeddingModelType, StorageType


class HybridRetriever:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out, what do you think @Wendong-Fan ?

Copy link
Member Author

@yiyiyi0817 yiyiyi0817 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review. And I will recommit a new version soon.

from camel.types import EmbeddingModelType, StorageType


class HybridRetriever:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WenDong OK, I agree with you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants