Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OAI File Storage Metadata management #26

Open
dahifi opened this issue May 22, 2024 · 2 comments
Open

OAI File Storage Metadata management #26

dahifi opened this issue May 22, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@dahifi
Copy link
Member

dahifi commented May 22, 2024

The OAI Platform files UX is pretty lacking. https://platform.openai.com/storage/files

Right now it shows the following information in the UX:
image

We would like a better interface to manage files between our local file systems, git repos, Chainlit front end, &c, and make sure we're not loading the same file in multiple places. A lot of times we get random file names. And most of the other endpoints such as vector stores and annotation file citations refer to the OAI file ID. We need a way to manage all this better.

The vector storage has a slightly better interface:
image

It at least shows us the file names (most of the time) and now shows what assistants and threads a datastore is attached to.

I'm not sure I want to rebuild the OAI UX for all this, but we do need to do checks for file uploads to do file hashes, as well as some sort of summary or descriptive details about a file and why it was added. These can be used for rollups and the like.

We might also use some of this metadata for storing the original download URL or source, this is crucial when we start building ingestion pipelines for youtube videos and other datasources that aren't natively supported by retrieval.

@dahifi dahifi added the enhancement New feature or request label May 22, 2024
@dahifi
Copy link
Member Author

dahifi commented Jun 17, 2024

Requirements for OAI File Metadata Management

Description

Enhance the metadata management of files uploaded to the OAI Platform to improve UX and prevent duplication. This includes adding checks for file uploads using file hashes and storing comprehensive metadata such as the original download URL.

Acceptance Criteria

  1. Metadata Enhancements

    • Store additional metadata for each file, including:
      • Original download URL
      • File hash (to check for duplicates)
      • Upload timestamp
      • File size
      • User who uploaded the file
    • Ensure metadata is retrievable via the Chainlit front end and API.
  2. Duplicate File Handling

    • Implement file hash checks during the upload process.
    • Prevent upload of duplicate files based on hash comparison.
    • Provide a clear message to the user if a duplicate file is detected, suggesting actions (e.g., use the existing file or rename).
  3. User Interface Improvements

    • Enhance the Chainlit front end to display the additional metadata.
    • Allow users to filter and search files based on metadata attributes (e.g., file name, upload date, uploader).
  4. Backend Adjustments

    • Update the backend to handle and store the new metadata fields.
    • Ensure existing files are retroactively updated with the new metadata where possible.
  5. Documentation and Testing

    • Update the documentation to reflect the changes in metadata management.
    • Write and execute unit tests to ensure the correct functionality of the new features.
    • Perform user acceptance testing (UAT) to verify the improved UX and duplicate handling.

@dahifi
Copy link
Member Author

dahifi commented Jun 17, 2024

Inital plan for enhancing metadata management, handling duplicates, and improving the backend:

High-Level Plan

  1. Metadata Enhancements

    • Create a FileMetadata class to store additional metadata attributes.
    • Implement methods to calculate file hashes and retrieve file metadata.
  2. Duplicate File Handling

    • Calculate file hashes during the upload process.
    • Check for existing files with the same hash and prevent duplicates.
  3. User Interface Improvements

    • Ensure API endpoints return the necessary metadata for the front-end to consume.
    • Update the front-end to display and handle new metadata fields.
  4. Backend Adjustments

    • Update the database schema to include new metadata fields.
    • Implement scripts to update existing files with the new metadata.
  5. Documentation and Testing

    • Update API documentation to reflect changes.
    • Write and execute unit tests to ensure proper functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant