Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Librispeech documentation, clarification on format #4185

Open
albertz opened this issue Apr 20, 2022 · 8 comments
Open

Librispeech documentation, clarification on format #4185

albertz opened this issue Apr 20, 2022 · 8 comments

Comments

@albertz
Copy link

albertz commented Apr 20, 2022

dataset = dataset.map(map_to_array, remove_columns=["file"])

Note that in order to limit the required storage for preparing this dataset, the audio
is stored in the .flac format and is not converted to a float32 array. To convert, the audio
file to a float32 array, please make use of the .map() function as follows:

import soundfile as sf
def map_to_array(batch):
    speech_array, _ = sf.read(batch["file"])
    batch["speech"] = speech_array
    return batch
dataset = dataset.map(map_to_array, remove_columns=["file"])

Is this still true?

In my case, ds["train.100"] returns:

Dataset({
    features: ['file', 'audio', 'text', 'speaker_id', 'chapter_id', 'id'],
    num_rows: 28539
})

and taking the first instance yields:

{'file': '374-180298-0000.flac',
 'audio': {'path': '374-180298-0000.flac',
  'array': array([ 7.01904297e-04,  7.32421875e-04,  7.32421875e-04, ...,
         -2.74658203e-04, -1.83105469e-04, -3.05175781e-05]),
  'sampling_rate': 16000},
 'text': 'CHAPTER SIXTEEN I MIGHT HAVE TOLD YOU OF THE BEGINNING OF THIS LIAISON IN A FEW LINES BUT I WANTED YOU TO SEE EVERY STEP BY WHICH WE CAME I TO AGREE TO WHATEVER MARGUERITE WISHED',
 'speaker_id': 374,
 'chapter_id': 180298,
 'id': '374-180298-0000'}

The audio array seems to be already decoded. So such convert/decode code as mentioned in the doc is wrong?

But I wonder, is it actually stored as flac on disk, and the decoding is done on-the-fly? Or was it decoded already during the preparation and is stored as raw samples on disk?

Note that I also used datasets.load_dataset("librispeech_asr", "clean").save_to_disk(...) and then datasets.load_from_disk(...) in this example. Does this change anything on how it is stored on disk?

A small related question: Actually I would prefer to even store it as mp3 or ogg on disk. Is this easy to convert?

@albertz
Copy link
Author

albertz commented Apr 20, 2022

(@patrickvonplaten )

@patrickvonplaten
Copy link
Contributor

Also cc @lhoestq here

@patrickvonplaten
Copy link
Contributor

The documentation in the code is definitely outdated - thanks for letting me know, I'll remove it in #4184 .

You're exactly right audio array already decodes the audio file to the correct waveform. This is done on the fly, which is also why one should not do ds["audio"]["array"][0] as this will decode all dataset samples, but instead ds[0]["audio"]["array"] see: https://huggingface.co/docs/datasets/audio_process#audio-datasets

@albertz
Copy link
Author

albertz commented Apr 20, 2022

So, again to clarify: On disk, only the raw flac file content is stored? Is this also the case after save_to_disk?

And is it simple to also store it re-encoded as ogg or mp3 instead?

@patrickvonplaten
Copy link
Contributor

Hey,

Sorry yeah I was just about to look into this! We actually had an outdated version of Librispeech ASR that didn't save any files, but instead converted the audio files to a byte string, then was then decoded on-the-fly. This however is not very user-friendly so we recently decided to instead show the full path of the audio files with the path parameter.

I'm currently changing this for Librispeech here: #4184 .
You should be able to see the audio file in the original flac format under path then. I don't think it's a good idea to convert to MP3 out-of-the-box, but we could maybe think about some kind of convert function for audio datasets cc @lhoestq ?

@albertz
Copy link
Author

albertz commented Apr 20, 2022

I don't think it's a good idea to convert to MP3 out-of-the-box, but we could maybe think about some kind of convert function for audio datasets cc @lhoestq ?

Sure, I would expect that load_dataset("librispeech_asr") would give you the original (not re-encoded) data (flac or already decoded). So such re-encoding logic would be some separate generic function. So I could do sth like dataset.reencode_as_ogg(**ogg_encode_opts).save_to_disk(...) or so.

@albertz
Copy link
Author

albertz commented Apr 21, 2022

A follow-up question: I wonder whether a Parquet dataset is maybe more what we actually want to have? (Following also my comment here: #4184 (comment).) Because I think we actually would prefer to embed the data content in the dataset.

So, instead of save_to_disk/load_from_disk, we would use to_parquet,from_parquet? Is there any downside? Are arrow files more efficient?

Related is also the doc update in #4193.

@lhoestq
Copy link
Member

lhoestq commented Apr 21, 2022

save_to_disk saves the dataset as an Arrow file, which is the format we use to load a dataset using memory mapping. This way the dataset does not fill your RAM, but is read from your disk instead.

Therefore you can directly reload a dataset saved with save_to_disk using load_from_disk.

Parquet files are used for cold storage: to use memory mapping on a Parquet dataset, you first have to convert it to Arrow. We use Parquet to reduce the I/O when pushing/downloading data from the Hugging face Hub. When you load a Parquet file from the Hub, it is converted to Arrow on the fly during the download.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants