-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature(embeddings): vectorize corpora corpus #15
Conversation
PR Reviewer Guide 🔍(Review updated until commit 5b9209d)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 5b9209d
Previous suggestionsSuggestions up to commit e632d02
|
@@ -4,4 +4,7 @@ | |||
-r packages/corpora_client/test-requirements.txt | |||
-r packages/corpora_proj/requirements.txt | |||
-r packages/corpora_ai_openai/requirements.txt | |||
# TODO: decide if these should be isolated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably corpora_ai is a good namespace for these ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: we will learn more when we go to packaging ...
from typing import Union | ||
|
||
from langchain_text_splitters import ( | ||
PythonCodeTextSplitter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: this one is pretty meh ... should work harder to split at good places more than try to meet the Character specification - I'd rather have variable length with natural breaks than hard length with non-sense splits.
/review |
/describe |
Persistent review updated to latest commit 5b9209d |
1 similar comment
Persistent review updated to latest commit 5b9209d |
PR Description updated to latest commit (5b9209d) |
I skimped on the tests a little bit but we will harden a bit later once we finish some of the core features and stabilize... or I'll take it in the next PR ...
I agree ... I'll figure something out better later. |
PR Type
enhancement, tests, documentation
Description
langchain-text-splitters
andtiktoken
.Changes walkthrough 📝
10 files
admin.py
Refactor admin fieldsets and remove vector fields
py/packages/corpora/admin.py
vector_of_summary
field fromCorpusTextFileAdmin
.vector
field fromSplitAdmin
.0007_alter_split_vector.py
Migration to update vector field in Split model
py/packages/corpora/migrations/0007_alter_split_vector.py
vector
field inSplit
model.pgvector.django.vector.VectorField
with 1536dimensions.
0008_alter_corpustextfile_vector_of_summary.py
Migration to update vector_of_summary field in CorpusTextFile
py/packages/corpora/migrations/0008_alter_corpustextfile_vector_of_summary.py
vector_of_summary
field inCorpusTextFile
model.
pgvector.django.vector.VectorField
with 1536dimensions.
models.py
Enhance models with vectorization and content splitting
py/packages/corpora/models.py
vector_of_summary
andvector
fields to 1536 dimensions.tasks.py
Implement tasks for summarization and vectorization
py/packages/corpora/tasks.py
count_tokens.py
Add token counting utility function
py/packages/corpora_ai/count_tokens.py
tiktoken
.llm_interface.py
Update LLM interface with embedding and summary methods
py/packages/corpora_ai/llm_interface.py
generate_embedding
toget_embedding
.prompts.py
Introduce summarization prompt message
py/packages/corpora_ai/prompts.py
split.py
Add utility for text splitting based on file type
py/packages/corpora_ai/split.py
type.
llm_client.py
Update method name for embedding generation
py/packages/corpora_ai_openai/llm_client.py
generate_embedding
toget_embedding
.2 files
test_provider_loader.py
Enhance test for OpenAI provider loading
py/packages/corpora_ai/test_provider_loader.py
test_llm_client.py
Adjust tests for updated embedding method
py/packages/corpora_ai_openai/test_llm_client.py
2 files
celery-tasks.md
Document Celery task methods and usage
md/notes/celery-tasks.md
practical-embeddings-tutorial.md
Add tutorial on embeddings and dimensionality strategies
md/notes/practical-embeddings-tutorial.md
text-embedding-3-small
.trade-offs.
1 files
requirements.txt
Update requirements with new dependencies
py/requirements.txt
langchain-text-splitters
andtiktoken
dependencies.