-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Librispeech documentation, clarification on format #4185
Comments
Also cc @lhoestq here |
The documentation in the code is definitely outdated - thanks for letting me know, I'll remove it in #4184 . You're exactly right |
So, again to clarify: On disk, only the raw flac file content is stored? Is this also the case after And is it simple to also store it re-encoded as ogg or mp3 instead? |
Hey, Sorry yeah I was just about to look into this! We actually had an outdated version of Librispeech ASR that didn't save any files, but instead converted the audio files to a byte string, then was then decoded on-the-fly. This however is not very user-friendly so we recently decided to instead show the full path of the audio files with the I'm currently changing this for Librispeech here: #4184 . |
Sure, I would expect that |
A follow-up question: I wonder whether a Parquet dataset is maybe more what we actually want to have? (Following also my comment here: #4184 (comment).) Because I think we actually would prefer to embed the data content in the dataset. So, instead of Related is also the doc update in #4193. |
Therefore you can directly reload a dataset saved with Parquet files are used for cold storage: to use memory mapping on a Parquet dataset, you first have to convert it to Arrow. We use Parquet to reduce the I/O when pushing/downloading data from the Hugging face Hub. When you load a Parquet file from the Hub, it is converted to Arrow on the fly during the download. |
datasets/datasets/librispeech_asr/librispeech_asr.py
Line 53 in cd3ce34
Is this still true?
In my case,
ds["train.100"]
returns:and taking the first instance yields:
The
audio
array
seems to be already decoded. So such convert/decode code as mentioned in the doc is wrong?But I wonder, is it actually stored as flac on disk, and the decoding is done on-the-fly? Or was it decoded already during the preparation and is stored as raw samples on disk?
Note that I also used
datasets.load_dataset("librispeech_asr", "clean").save_to_disk(...)
and thendatasets.load_from_disk(...)
in this example. Does this change anything on how it is stored on disk?A small related question: Actually I would prefer to even store it as mp3 or ogg on disk. Is this easy to convert?
The text was updated successfully, but these errors were encountered: