Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(embeddings): vectorize corpora corpus #15

Merged
merged 13 commits into from
Nov 5, 2024
Merged

feature(embeddings): vectorize corpora corpus #15

merged 13 commits into from
Nov 5, 2024

Conversation

skyl
Copy link
Owner

@skyl skyl commented Nov 4, 2024

PR Type

enhancement, tests, documentation


Description

  • Enhanced models to support 1536-dimensional vectors and added methods for summarization and vectorization.
  • Implemented tasks for processing text files, generating summaries, and vectorizing content.
  • Added utility functions for text splitting and token counting.
  • Updated tests to reflect changes in method names and added new tests for OpenAI provider loading.
  • Documented Celery task methods and provided a comprehensive tutorial on embeddings.
  • Updated dependencies to include langchain-text-splitters and tiktoken.

Changes walkthrough 📝

Relevant files
Enhancement
10 files
admin.py
Refactor admin fieldsets and remove vector fields               

py/packages/corpora/admin.py

  • Removed vector_of_summary field from CorpusTextFileAdmin.
  • Removed vector field from SplitAdmin.
  • Adjusted fieldsets for better organization.
  • +6/-4     
    0007_alter_split_vector.py
    Migration to update vector field in Split model                   

    py/packages/corpora/migrations/0007_alter_split_vector.py

  • Added migration to alter vector field in Split model.
  • Updated field to use pgvector.django.vector.VectorField with 1536
    dimensions.
  • +24/-0   
    0008_alter_corpustextfile_vector_of_summary.py
    Migration to update vector_of_summary field in CorpusTextFile

    py/packages/corpora/migrations/0008_alter_corpustextfile_vector_of_summary.py

  • Added migration to alter vector_of_summary field in CorpusTextFile
    model.
  • Updated field to use pgvector.django.vector.VectorField with 1536
    dimensions.
  • +24/-0   
    models.py
    Enhance models with vectorization and content splitting   

    py/packages/corpora/models.py

  • Updated vector_of_summary and vector fields to 1536 dimensions.
  • Added methods for summarizing and vectorizing content.
  • Introduced content splitting functionality.
  • +79/-2   
    tasks.py
    Implement tasks for summarization and vectorization           

    py/packages/corpora/tasks.py

  • Added tasks for generating summaries and vectors.
  • Implemented content splitting task.
  • Enhanced tarball processing to trigger new tasks.
  • +38/-14 
    count_tokens.py
    Add token counting utility function                                           

    py/packages/corpora_ai/count_tokens.py

  • Introduced function to count tokens using tiktoken.
  • Supports token counting for specific models.
  • +20/-0   
    llm_interface.py
    Update LLM interface with embedding and summary methods   

    py/packages/corpora_ai/llm_interface.py

  • Renamed generate_embedding to get_embedding.
  • Added method for generating text summaries.
  • +26/-1   
    prompts.py
    Introduce summarization prompt message                                     

    py/packages/corpora_ai/prompts.py

    • Added system message for summarization prompts.
    +9/-0     
    split.py
    Add utility for text splitting based on file type               

    py/packages/corpora_ai/split.py

  • Added utility to determine appropriate text splitter based on file
    type.
  • Supports Python and Markdown file splitting.
  • +41/-0   
    llm_client.py
    Update method name for embedding generation                           

    py/packages/corpora_ai_openai/llm_client.py

    • Renamed generate_embedding to get_embedding.
    +1/-1     
    Tests
    2 files
    test_provider_loader.py
    Enhance test for OpenAI provider loading                                 

    py/packages/corpora_ai/test_provider_loader.py

    • Updated test to check for missing OpenAI API key.
    +1/-1     
    test_llm_client.py
    Adjust tests for updated embedding method                               

    py/packages/corpora_ai_openai/test_llm_client.py

    • Updated tests to reflect changes in method names.
    +7/-7     
    Documentation
    2 files
    celery-tasks.md
    Document Celery task methods and usage                                     

    md/notes/celery-tasks.md

  • Added detailed notes on Celery task methods.
  • Included examples and usage notes for task management.
  • +217/-0 
    practical-embeddings-tutorial.md
    Add tutorial on embeddings and dimensionality strategies 

    md/notes/practical-embeddings-tutorial.md

  • Added tutorial on embeddings with text-embedding-3-small.
  • Discussed strategies for different corpora and dimensionality
    trade-offs.
  • +102/-0 
    Dependencies
    1 files
    requirements.txt
    Update requirements with new dependencies                               

    py/requirements.txt

    • Added langchain-text-splitters and tiktoken dependencies.
    +3/-0     

    💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

    Copy link

    github-actions bot commented Nov 4, 2024

    PR Reviewer Guide 🔍

    (Review updated until commit 5b9209d)

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Code Smell
    The method get_and_save_summary and get_and_save_vector_of_summary in CorpusTextFile class directly load the LLM provider within the method. Consider injecting the dependency or using a service layer to improve testability and separation of concerns.

    Code Smell
    The process_tarball function in tasks.py directly calls generate_summary_task and split_file_task for each file. Consider handling exceptions or failures in these tasks to ensure robustness and reliability of the task processing pipeline.

    Possible Bug
    In get_text_splitter, the function defaults to CharacterTextSplitter if no specific splitter is found. Ensure that this default behavior is appropriate for all file types that might be processed, as it might lead to unexpected results for unsupported formats.

    Copy link

    github-actions bot commented Nov 4, 2024

    PR Code Suggestions ✨

    Latest suggestions up to 5b9209d
    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    Possible issue
    Add exception handling for potential errors during LLM provider loading

    Consider handling exceptions that might occur during the load_llm_provider call to
    prevent the application from crashing if the provider fails to load.

    py/packages/corpora/models.py [88-89]

    -llm = load_llm_provider()
    -summary = llm.get_summary(self._get_text_representation())
    +try:
    +    llm = load_llm_provider()
    +    summary = llm.get_summary(self._get_text_representation())
    +except Exception as e:
    +    # Handle exception, e.g., log error or set a default summary
    Suggestion importance[1-10]: 8

    Why: Adding exception handling for the LLM provider loading is crucial to prevent application crashes and ensure robustness. This suggestion directly addresses a potential runtime issue, making it highly relevant and impactful.

    8
    Enhancement
    Ensure the input text is a valid non-empty string before generating embeddings

    Validate that the text parameter is not only non-empty but also a valid string to
    prevent unexpected errors during embedding generation.

    py/packages/corpora_ai_openai/llm_client.py [29-30]

    -if not text:
    -    raise ValueError("Input text must not be empty.")
    +if not isinstance(text, str) or not text.strip():
    +    raise ValueError("Input text must be a non-empty string.")
    Suggestion importance[1-10]: 7

    Why: Enhancing the validation of the text parameter to check for a valid non-empty string improves robustness and prevents potential errors during embedding generation. This suggestion is a valuable enhancement to the code's reliability.

    7
    Update package versions to the latest stable releases for improved compatibility and security

    Consider specifying a more recent version for langchain-text-splitters and tiktoken
    to ensure compatibility with the latest features and security patches.

    py/requirements.txt [8-9]

    -langchain-text-splitters==0.3.2
    -tiktoken==0.8.0
    +langchain-text-splitters==0.3.3
    +tiktoken==0.8.1
    Suggestion importance[1-10]: 5

    Why: Updating package versions can enhance compatibility and security by incorporating the latest features and patches. However, the suggestion assumes newer versions exist without verifying their availability or compatibility with the existing codebase, which slightly reduces its impact.

    5

    Previous suggestions

    Suggestions up to commit e632d02
    CategorySuggestion                                                                                                                                    Score
    Possible bug
    Check for None when extracting files from a tarball to prevent errors

    Ensure that the tar.extractfile(member) result is checked for None before attempting
    to read, to prevent potential AttributeError.

    py/packages/corpora/tasks.py [16-17]

    -file_content = (
    -    tar.extractfile(member).read().decode("utf-8", errors="replace")
    -)
    +extracted_file = tar.extractfile(member)
    +if extracted_file is not None:
    +    file_content = extracted_file.read().decode("utf-8", errors="replace")
    Suggestion importance[1-10]: 8

    Why: This suggestion prevents potential AttributeError by ensuring that the extracted file is not None before attempting to read it. This is a critical fix to avoid runtime errors when processing tarball files.

    8
    Validate the response structure from the OpenAI API to prevent errors

    Validate that the response from the OpenAI API contains the expected data structure
    before accessing elements to prevent potential IndexError.

    py/packages/corpora_ai_openai/llm_client.py [29-31]

    -return response.data[0].embedding
    +if response.data and len(response.data) > 0:
    +    return response.data[0].embedding
    +else:
    +    raise ValueError("Unexpected response structure from OpenAI API")
    Suggestion importance[1-10]: 8

    Why: Validating the response structure from the OpenAI API before accessing its elements prevents potential IndexError, ensuring that the application handles unexpected API responses gracefully. This is an important improvement for error handling.

    8
    Possible issue
    Add exception handling when loading the LLM provider to improve robustness

    Consider handling potential exceptions when calling load_llm_provider() to ensure
    the application can gracefully handle any issues with loading the LLM provider.

    py/packages/corpora/models.py [84-86]

    -llm = load_llm_provider()
    -summary = llm.get_summary(self._get_text_representation())
    +try:
    +    llm = load_llm_provider()
    +    summary = llm.get_summary(self._get_text_representation())
    +except Exception as e:
    +    # Handle exception, e.g., log error or set a default summary
    Suggestion importance[1-10]: 7

    Why: Adding exception handling when loading the LLM provider increases the robustness of the application by preventing crashes due to unforeseen errors during the provider loading process. This is a valuable enhancement for maintaining application stability.

    7
    Enhancement
    Add logging to track the execution of task chains for better monitoring

    Consider logging the start and completion of each task in the task chain to aid in
    debugging and monitoring task execution.

    py/packages/corpora/tasks.py [28-31]

    +logger.info(f"Starting task chain for corpus file {corpus_file.id}")
     chain(
         generate_summary_task.s(corpus_file.id),
         split_file_task.s(corpus_file.id),
     ).apply_async()
    +logger.info(f"Task chain for corpus file {corpus_file.id} completed")
    Suggestion importance[1-10]: 5

    Why: Adding logging for task chain execution can aid in debugging and monitoring, providing insights into task progress and completion. However, it is a moderate enhancement as it does not directly affect functionality or correctness.

    5

    @@ -4,4 +4,7 @@
    -r packages/corpora_client/test-requirements.txt
    -r packages/corpora_proj/requirements.txt
    -r packages/corpora_ai_openai/requirements.txt
    # TODO: decide if these should be isolated
    Copy link
    Owner Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    probably corpora_ai is a good namespace for these ..

    Copy link
    Owner Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    TODO: we will learn more when we go to packaging ...

    from typing import Union

    from langchain_text_splitters import (
    PythonCodeTextSplitter,
    Copy link
    Owner Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    TODO: this one is pretty meh ... should work harder to split at good places more than try to meet the Character specification - I'd rather have variable length with natural breaks than hard length with non-sense splits.

    @skyl skyl marked this pull request as ready for review November 5, 2024 01:17
    @skyl
    Copy link
    Owner Author

    skyl commented Nov 5, 2024

    /review

    @skyl
    Copy link
    Owner Author

    skyl commented Nov 5, 2024

    /describe

    Copy link

    github-actions bot commented Nov 5, 2024

    Persistent review updated to latest commit 5b9209d

    1 similar comment
    Copy link

    github-actions bot commented Nov 5, 2024

    Persistent review updated to latest commit 5b9209d

    Copy link

    github-actions bot commented Nov 5, 2024

    PR Description updated to latest commit (5b9209d)

    @skyl
    Copy link
    Owner Author

    skyl commented Nov 5, 2024

    I skimped on the tests a little bit but we will harden a bit later once we finish some of the core features and stabilize... or I'll take it in the next PR ...

    The method get_and_save_summary and get_and_save_vector_of_summary in CorpusTextFile class directly load the LLM provider within the method. Consider injecting the dependency or using a service layer to improve testability and separation of concerns.

    I agree ... I'll figure something out better later.

    @skyl skyl merged commit 9042b2a into main Nov 5, 2024
    2 checks passed
    @skyl skyl deleted the vectorize branch November 5, 2024 01:21
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    1 participant