feature(embeddings): vectorize corpora corpus #15

skyl · 2024-11-04T03:31:12Z

PR Type

enhancement, tests, documentation

Description

Enhanced models to support 1536-dimensional vectors and added methods for summarization and vectorization.
Implemented tasks for processing text files, generating summaries, and vectorizing content.
Added utility functions for text splitting and token counting.
Updated tests to reflect changes in method names and added new tests for OpenAI provider loading.
Documented Celery task methods and provided a comprehensive tutorial on embeddings.
Updated dependencies to include langchain-text-splitters and tiktoken.

Changes walkthrough 📝

Relevant files

Enhancement

10 files

admin.py `Refactor admin fieldsets and remove vector fields` py/packages/corpora/admin.py Removed `vector_of_summary` field from `CorpusTextFileAdmin`. Removed `vector` field from `SplitAdmin`. Adjusted fieldsets for better organization.	+6/-4
0007_alter_split_vector.py `Migration to update vector field in Split model` py/packages/corpora/migrations/0007_alter_split_vector.py Added migration to alter `vector` field in `Split` model. Updated field to use `pgvector.django.vector.VectorField` with 1536 dimensions.	+24/-0
0008_alter_corpustextfile_vector_of_summary.py `Migration to update vector_of_summary field in CorpusTextFile` py/packages/corpora/migrations/0008_alter_corpustextfile_vector_of_summary.py Added migration to alter `vector_of_summary` field in `CorpusTextFile` model. Updated field to use `pgvector.django.vector.VectorField` with 1536 dimensions.	+24/-0
models.py `Enhance models with vectorization and content splitting` py/packages/corpora/models.py Updated `vector_of_summary` and `vector` fields to 1536 dimensions. Added methods for summarizing and vectorizing content. Introduced content splitting functionality.	+79/-2
tasks.py `Implement tasks for summarization and vectorization` py/packages/corpora/tasks.py Added tasks for generating summaries and vectors. Implemented content splitting task. Enhanced tarball processing to trigger new tasks.	+38/-14
count_tokens.py `Add token counting utility function` py/packages/corpora_ai/count_tokens.py Introduced function to count tokens using `tiktoken`. Supports token counting for specific models.	+20/-0
llm_interface.py `Update LLM interface with embedding and summary methods` py/packages/corpora_ai/llm_interface.py Renamed `generate_embedding` to `get_embedding`. Added method for generating text summaries.	+26/-1
prompts.py `Introduce summarization prompt message` py/packages/corpora_ai/prompts.py Added system message for summarization prompts.	+9/-0
split.py `Add utility for text splitting based on file type` py/packages/corpora_ai/split.py Added utility to determine appropriate text splitter based on file type. Supports Python and Markdown file splitting.	+41/-0
llm_client.py `Update method name for embedding generation` py/packages/corpora_ai_openai/llm_client.py Renamed `generate_embedding` to `get_embedding`.	+1/-1

Tests

2 files

test_provider_loader.py `Enhance test for OpenAI provider loading` py/packages/corpora_ai/test_provider_loader.py Updated test to check for missing OpenAI API key.	+1/-1
test_llm_client.py `Adjust tests for updated embedding method` py/packages/corpora_ai_openai/test_llm_client.py Updated tests to reflect changes in method names.	+7/-7

Documentation

2 files

celery-tasks.md `Document Celery task methods and usage` md/notes/celery-tasks.md Added detailed notes on Celery task methods. Included examples and usage notes for task management.	+217/-0
practical-embeddings-tutorial.md `Add tutorial on embeddings and dimensionality strategies` md/notes/practical-embeddings-tutorial.md Added tutorial on embeddings with `text-embedding-3-small`. Discussed strategies for different corpora and dimensionality trade-offs.	+102/-0

Dependencies

1 files

requirements.txt `Update requirements with new dependencies` py/requirements.txt Added `langchain-text-splitters` and `tiktoken` dependencies.	+3/-0

💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

github-actions · 2024-11-04T03:32:42Z

PR Reviewer Guide 🔍

(Review updated until commit `5b9209d`)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Code Smell The method `get_and_save_summary` and `get_and_save_vector_of_summary` in `CorpusTextFile` class directly load the LLM provider within the method. Consider injecting the dependency or using a service layer to improve testability and separation of concerns. Code Smell The `process_tarball` function in `tasks.py` directly calls `generate_summary_task` and `split_file_task` for each file. Consider handling exceptions or failures in these tasks to ensure robustness and reliability of the task processing pipeline. Possible Bug In `get_text_splitter`, the function defaults to `CharacterTextSplitter` if no specific splitter is found. Ensure that this default behavior is appropriate for all file types that might be processed, as it might lead to unexpected results for unsupported formats.

github-actions · 2024-11-04T03:33:41Z

PR Code Suggestions ✨

Latest suggestions up to 5b9209d
Explore these optional code suggestions:

Category	Suggestion	Score
Possible issue	Add exception handling for potential errors during LLM provider loading Consider handling exceptions that might occur during the `load_llm_provider` call to prevent the application from crashing if the provider fails to load. py/packages/corpora/models.py [88-89] -llm = load_llm_provider() -summary = llm.get_summary(self._get_text_representation()) +try: + llm = load_llm_provider() + summary = llm.get_summary(self._get_text_representation()) +except Exception as e: + # Handle exception, e.g., log error or set a default summary Suggestion importance[1-10]: 8 Why: Adding exception handling for the LLM provider loading is crucial to prevent application crashes and ensure robustness. This suggestion directly addresses a potential runtime issue, making it highly relevant and impactful.	8
Enhancement	Ensure the input text is a valid non-empty string before generating embeddings Validate that the `text` parameter is not only non-empty but also a valid string to prevent unexpected errors during embedding generation. py/packages/corpora_ai_openai/llm_client.py [29-30] -if not text: - raise ValueError("Input text must not be empty.") +if not isinstance(text, str) or not text.strip(): + raise ValueError("Input text must be a non-empty string.") Suggestion importance[1-10]: 7 Why: Enhancing the validation of the `text` parameter to check for a valid non-empty string improves robustness and prevents potential errors during embedding generation. This suggestion is a valuable enhancement to the code's reliability.	7
Enhancement	Update package versions to the latest stable releases for improved compatibility and security Consider specifying a more recent version for `langchain-text-splitters` and `tiktoken` to ensure compatibility with the latest features and security patches. py/requirements.txt [8-9] -langchain-text-splitters==0.3.2 -tiktoken==0.8.0 +langchain-text-splitters==0.3.3 +tiktoken==0.8.1 Suggestion importance[1-10]: 5 Why: Updating package versions can enhance compatibility and security by incorporating the latest features and patches. However, the suggestion assumes newer versions exist without verifying their availability or compatibility with the existing codebase, which slightly reduces its impact.	5

Previous suggestions

Suggestions up to commit e632d02

Category	Suggestion	Score
Possible bug	Check for `None` when extracting files from a tarball to prevent errors Ensure that the `tar.extractfile(member)` result is checked for `None` before attempting to read, to prevent potential `AttributeError`. py/packages/corpora/tasks.py [16-17] -file_content = ( - tar.extractfile(member).read().decode("utf-8", errors="replace") -) +extracted_file = tar.extractfile(member) +if extracted_file is not None: + file_content = extracted_file.read().decode("utf-8", errors="replace") Suggestion importance[1-10]: 8 Why: This suggestion prevents potential `AttributeError` by ensuring that the extracted file is not `None` before attempting to read it. This is a critical fix to avoid runtime errors when processing tarball files.	8
Possible bug	Validate the response structure from the OpenAI API to prevent errors Validate that the `response` from the OpenAI API contains the expected data structure before accessing elements to prevent potential `IndexError`. py/packages/corpora_ai_openai/llm_client.py [29-31] -return response.data[0].embedding +if response.data and len(response.data) > 0: + return response.data[0].embedding +else: + raise ValueError("Unexpected response structure from OpenAI API") Suggestion importance[1-10]: 8 Why: Validating the response structure from the OpenAI API before accessing its elements prevents potential `IndexError`, ensuring that the application handles unexpected API responses gracefully. This is an important improvement for error handling.	8
Possible issue	Add exception handling when loading the LLM provider to improve robustness Consider handling potential exceptions when calling `load_llm_provider()` to ensure the application can gracefully handle any issues with loading the LLM provider. py/packages/corpora/models.py [84-86] -llm = load_llm_provider() -summary = llm.get_summary(self._get_text_representation()) +try: + llm = load_llm_provider() + summary = llm.get_summary(self._get_text_representation()) +except Exception as e: + # Handle exception, e.g., log error or set a default summary Suggestion importance[1-10]: 7 Why: Adding exception handling when loading the LLM provider increases the robustness of the application by preventing crashes due to unforeseen errors during the provider loading process. This is a valuable enhancement for maintaining application stability.	7
Enhancement	Add logging to track the execution of task chains for better monitoring Consider logging the start and completion of each task in the task chain to aid in debugging and monitoring task execution. py/packages/corpora/tasks.py [28-31] +logger.info(f"Starting task chain for corpus file {corpus_file.id}") chain( generate_summary_task.s(corpus_file.id), split_file_task.s(corpus_file.id), ).apply_async() +logger.info(f"Task chain for corpus file {corpus_file.id} completed") Suggestion importance[1-10]: 5 Why: Adding logging for task chain execution can aid in debugging and monitoring, providing insights into task progress and completion. However, it is a moderate enhancement as it does not directly affect functionality or correctness.	5

skyl · 2024-11-04T23:45:36Z

py/requirements.txt

@@ -4,4 +4,7 @@
 -r packages/corpora_client/test-requirements.txt
 -r packages/corpora_proj/requirements.txt
 -r packages/corpora_ai_openai/requirements.txt
+# TODO: decide if these should be isolated


probably corpora_ai is a good namespace for these ..

TODO: we will learn more when we go to packaging ...

skyl · 2024-11-05T01:15:20Z

py/packages/corpora_ai/split.py

+from typing import Union
+
+from langchain_text_splitters import (
+    PythonCodeTextSplitter,


TODO: this one is pretty meh ... should work harder to split at good places more than try to meet the Character specification - I'd rather have variable length with natural breaks than hard length with non-sense splits.

skyl · 2024-11-05T01:17:41Z

/review

skyl · 2024-11-05T01:17:48Z

/describe

github-actions · 2024-11-05T01:18:42Z

Persistent review updated to latest commit 5b9209d

github-actions · 2024-11-05T01:18:53Z

Persistent review updated to latest commit 5b9209d

github-actions · 2024-11-05T01:19:03Z

PR Description updated to latest commit (5b9209d)

skyl · 2024-11-05T01:20:37Z

I skimped on the tests a little bit but we will harden a bit later once we finish some of the core features and stabilize... or I'll take it in the next PR ...

The method get_and_save_summary and get_and_save_vector_of_summary in CorpusTextFile class directly load the LLM provider within the method. Consider injecting the dependency or using a service layer to improve testability and separation of concerns.

I agree ... I'll figure something out better later.

skyl added 4 commits November 3, 2024 17:59

next to vectorize and summarize

8aff3b3

learn about embeddings

cb423ae

add migration

c95adc3

WIP getting into embedding practice, splits

e632d02

github-actions bot added the Review effort [1-5]: 4 label Nov 4, 2024

skyl added 2 commits November 4, 2024 23:42

approach to splitting

20db9be

..

d675ab6

skyl commented Nov 4, 2024

View reviewed changes

skyl added 6 commits November 4, 2024 23:53

add openai env to compose

ab077c4

ingest tasks

9499b1d

run server automatically and start new terminal

fefdd9f

wip wip fixes and splitz

df658f6

r is weird

8585103

fix loader test

ae5cffa

skyl commented Nov 5, 2024

View reviewed changes

...

5b9209d

skyl marked this pull request as ready for review November 5, 2024 01:17

skyl merged commit 9042b2a into main Nov 5, 2024
2 checks passed

skyl deleted the vectorize branch November 5, 2024 01:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(embeddings): vectorize corpora corpus #15

feature(embeddings): vectorize corpora corpus #15

skyl commented Nov 4, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 4, 2024 •

edited

Loading

github-actions bot commented Nov 4, 2024 •

edited

Loading

skyl Nov 4, 2024

skyl Nov 5, 2024

skyl Nov 5, 2024

skyl commented Nov 5, 2024

skyl commented Nov 5, 2024

github-actions bot commented Nov 5, 2024

github-actions bot commented Nov 5, 2024

github-actions bot commented Nov 5, 2024

skyl commented Nov 5, 2024

feature(embeddings): vectorize corpora corpus #15

feature(embeddings): vectorize corpora corpus #15

Conversation

skyl commented Nov 4, 2024 • edited by github-actions bot Loading

PR Type

Description

Changes walkthrough 📝

github-actions bot commented Nov 4, 2024 • edited Loading

PR Reviewer Guide 🔍

(Review updated until commit 5b9209d)

github-actions bot commented Nov 4, 2024 • edited Loading

PR Code Suggestions ✨

Previous suggestions

skyl Nov 4, 2024

Choose a reason for hiding this comment

skyl Nov 5, 2024

Choose a reason for hiding this comment

skyl Nov 5, 2024

Choose a reason for hiding this comment

skyl commented Nov 5, 2024

skyl commented Nov 5, 2024

github-actions bot commented Nov 5, 2024

github-actions bot commented Nov 5, 2024

github-actions bot commented Nov 5, 2024

skyl commented Nov 5, 2024

skyl commented Nov 4, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 4, 2024 •

edited

Loading

(Review updated until commit `5b9209d`)

github-actions bot commented Nov 4, 2024 •

edited

Loading