-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: document search sources string resolver #264
base: main
Are you sure you want to change the base?
feat: document search sources string resolver #264
Conversation
Code Coverage Summary
Diff against main
Results for commit: bf56aea Minimum allowed coverage is ♻️ This comment has been updated with latest results |
- Basic file URI ingestion - Wildcard pattern matching
6be3c3f
to
1b7bd93
Compare
d3b1e38
to
4505521
Compare
Trivy scanning results. .venv/lib/python3.10/site-packages/PyJWT-2.9.0.dist-info/METADATA (secrets)Total: 1 (MEDIUM: 1, HIGH: 0, CRITICAL: 0) MEDIUM: JWT (jwt-token) .venv/lib/python3.10/site-packages/litellm/llms/huggingface/huggingface_llms_metadata/hf_text_generation_models.txt (secrets)Total: 1 (MEDIUM: 0, HIGH: 0, CRITICAL: 1) CRITICAL: HuggingFace (hugging-face-access-token) .venv/lib/python3.10/site-packages/litellm/proxy/_types.py (secrets)Total: 1 (MEDIUM: 1, HIGH: 0, CRITICAL: 0) MEDIUM: Slack (slack-web-hook) |
documents: Either: | ||
- A URI string (e.g., "gcs://bucket/*") to specify source location(s), or | ||
- A sequence of documents or metadata of the documents to ingest | ||
URI format depends on the source type, for example: | ||
- "file:///path/to/files/*.txt" | ||
- "gcs://bucket/folder/*" | ||
- "huggingface://dataset/split/row" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was very hard for me to parse and understand, since it looks like a one long list and the examples are not next to what they are examples for.
Maybe something liek this:
documents: Either: | |
- A URI string (e.g., "gcs://bucket/*") to specify source location(s), or | |
- A sequence of documents or metadata of the documents to ingest | |
URI format depends on the source type, for example: | |
- "file:///path/to/files/*.txt" | |
- "gcs://bucket/folder/*" | |
- "huggingface://dataset/split/row" | |
documents: Either: | |
- A sequence of `Document`, `DocumentMetadata`, or `Source` objects | |
- A source-specific URI string (e.g., "gcs://bucket/*") to specify source location(s), for example: | |
- "file:///path/to/files/*.txt" | |
- "gcs://bucket/folder/*" | |
- "huggingface://dataset/split/row" |
document_processor: The document processor to use. If not provided, the document processor will be | ||
determined based on the document metadata. | ||
""" | ||
if isinstance(documents, str): | ||
from ragbits.document_search.documents.source_resolver import SourceResolver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the conditional import? I think in a long run having the import run once during the start of the process will be more efficient, and will help avoid hard to debug errors.
|
||
sources = await SourceResolver.resolve(documents) | ||
else: | ||
sources = documents # type: ignore[assignment] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid having to disable type-checking, just add the typing (Sequence[DocumentMeta | Document | Source]
) to the first assignment of sources
super().__init_subclass__(**kwargs) | ||
Source._registry[cls.class_identifier()] = cls | ||
if cls.protocol is not None: | ||
from ragbits.document_search.documents.source_resolver import SourceResolver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again - we never have done this (conditional imports inside methods based on whether a particular feature is enabled) before in ragbits - switching to similar convention would require a discussion. I think there are clear cons.
if not path_obj.is_absolute(): | ||
# For relative paths, use current directory as base | ||
path_obj = Path.cwd() / path_obj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure this is is the default behavior of the Path
class, so I believe this conditional is not necessary
path_obj = Path.cwd() / path_obj | ||
|
||
if "*" in str(path_obj): | ||
# If path contains wildcards, use its parent as base |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems confusing to me, could you explain? I don't get how this code can lead to support for cases explained in the method's docstring (e.g. simple "*.py" or patterns with "?").
To be honest, in my mind the whole method could just be this one line:
return [cls(path=f) for f in Path().glob(pattern)]
document_search = DocumentSearch.from_config(CONFIG) | ||
|
||
# Test ingesting from URI with wildcard | ||
await document_search.ingest(f"file://{temp_dir}/test*.txt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to test other supported syntax options, for example file://{temp_dir}/test?.txt
, file://*.txt
, file://**/test?.txt
This PR should probably also include some documentation |
Solves #221