You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the chunk_size and chunk_overlap parameters in the TokenTextSplitter are hardcoded to 200 and 20 respectively. This limitation reduces flexibility for users who may require different chunk sizes for various tokenization tasks, especially when working with large datasets or specific language models.
Proposed Solution
We propose enhancing the current functionality by making both chunk_size and chunk_overlap configurable. This change will allow users to define custom values for these parameters, offering more control over how text is split into chunks. This will be particularly useful for users with specialized tokenization needs or those working with models that require specific chunk configurations.
Reference
A similar concern was raised in issue #488, where users experienced slow processing or failures when handling large files. Improving the chunking flexibility could help mitigate some of these performance bottlenecks, especially for larger datasets.
Benefits
Provides greater flexibility to users for handling diverse datasets.
Ability to optimize processing performance by adjusting chunk sizes based on language model needs.
Better handling of large files with custom chunking configurations.
The text was updated successfully, but these errors were encountered:
Hi,
Currently, the
chunk_size
andchunk_overlap
parameters in theTokenTextSplitter
are hardcoded to200
and20
respectively. This limitation reduces flexibility for users who may require different chunk sizes for various tokenization tasks, especially when working with large datasets or specific language models.Proposed Solution
Reference
A similar concern was raised in issue #488, where users experienced slow processing or failures when handling large files. Improving the chunking flexibility could help mitigate some of these performance bottlenecks, especially for larger datasets.
Benefits
The text was updated successfully, but these errors were encountered: