Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement Request: Make chunk_size and chunk_overlap Configurable #1010

Open
dhiaaeddine16 opened this issue Jan 16, 2025 · 0 comments
Open
Labels
enhancement New feature or request

Comments

@dhiaaeddine16
Copy link

dhiaaeddine16 commented Jan 16, 2025

Hi,

Currently, the chunk_size and chunk_overlap parameters in the TokenTextSplitter are hardcoded to 200 and 20 respectively. This limitation reduces flexibility for users who may require different chunk sizes for various tokenization tasks, especially when working with large datasets or specific language models.

Proposed Solution

  • We propose enhancing the current functionality by making both chunk_size and chunk_overlap configurable. This change will allow users to define custom values for these parameters, offering more control over how text is split into chunks. This will be particularly useful for users with specialized tokenization needs or those working with models that require specific chunk configurations.

Reference

A similar concern was raised in issue #488, where users experienced slow processing or failures when handling large files. Improving the chunking flexibility could help mitigate some of these performance bottlenecks, especially for larger datasets.

Benefits

  • Provides greater flexibility to users for handling diverse datasets.
  • Ability to optimize processing performance by adjusting chunk sizes based on language model needs.
  • Better handling of large files with custom chunking configurations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants