Librispeech documentation, clarification on format #4185

albertz · 2022-04-20T09:35:55Z

datasets/datasets/librispeech_asr/librispeech_asr.py

Line 53 in cd3ce34

dataset = dataset.map(map_to_array, remove_columns=["file"])

Note that in order to limit the required storage for preparing this dataset, the audio
is stored in the .flac format and is not converted to a float32 array. To convert, the audio
file to a float32 array, please make use of the .map() function as follows:
import soundfile as sf
def map_to_array(batch):
    speech_array, _ = sf.read(batch["file"])
    batch["speech"] = speech_array
    return batch
dataset = dataset.map(map_to_array, remove_columns=["file"])

Is this still true?

In my case, ds["train.100"] returns:

Dataset({
    features: ['file', 'audio', 'text', 'speaker_id', 'chapter_id', 'id'],
    num_rows: 28539
})

and taking the first instance yields:

{'file': '374-180298-0000.flac',
 'audio': {'path': '374-180298-0000.flac',
  'array': array([ 7.01904297e-04,  7.32421875e-04,  7.32421875e-04, ...,
         -2.74658203e-04, -1.83105469e-04, -3.05175781e-05]),
  'sampling_rate': 16000},
 'text': 'CHAPTER SIXTEEN I MIGHT HAVE TOLD YOU OF THE BEGINNING OF THIS LIAISON IN A FEW LINES BUT I WANTED YOU TO SEE EVERY STEP BY WHICH WE CAME I TO AGREE TO WHATEVER MARGUERITE WISHED',
 'speaker_id': 374,
 'chapter_id': 180298,
 'id': '374-180298-0000'}

The audio array seems to be already decoded. So such convert/decode code as mentioned in the doc is wrong?

But I wonder, is it actually stored as flac on disk, and the decoding is done on-the-fly? Or was it decoded already during the preparation and is stored as raw samples on disk?

Note that I also used datasets.load_dataset("librispeech_asr", "clean").save_to_disk(...) and then datasets.load_from_disk(...) in this example. Does this change anything on how it is stored on disk?

A small related question: Actually I would prefer to even store it as mp3 or ogg on disk. Is this easy to convert?

The text was updated successfully, but these errors were encountered:

albertz · 2022-04-20T09:36:09Z

(@patrickvonplaten )

patrickvonplaten · 2022-04-20T11:33:09Z

Also cc @lhoestq here

patrickvonplaten · 2022-04-20T11:38:28Z

The documentation in the code is definitely outdated - thanks for letting me know, I'll remove it in #4184 .

You're exactly right audio array already decodes the audio file to the correct waveform. This is done on the fly, which is also why one should not do ds["audio"]["array"][0] as this will decode all dataset samples, but instead ds[0]["audio"]["array"] see: https://huggingface.co/docs/datasets/audio_process#audio-datasets

albertz · 2022-04-20T11:47:32Z

So, again to clarify: On disk, only the raw flac file content is stored? Is this also the case after save_to_disk?

And is it simple to also store it re-encoded as ogg or mp3 instead?

patrickvonplaten · 2022-04-20T12:06:06Z

Hey,

Sorry yeah I was just about to look into this! We actually had an outdated version of Librispeech ASR that didn't save any files, but instead converted the audio files to a byte string, then was then decoded on-the-fly. This however is not very user-friendly so we recently decided to instead show the full path of the audio files with the path parameter.

I'm currently changing this for Librispeech here: #4184 .
You should be able to see the audio file in the original flac format under path then. I don't think it's a good idea to convert to MP3 out-of-the-box, but we could maybe think about some kind of convert function for audio datasets cc @lhoestq ?

albertz · 2022-04-20T14:15:48Z

I don't think it's a good idea to convert to MP3 out-of-the-box, but we could maybe think about some kind of convert function for audio datasets cc @lhoestq ?

Sure, I would expect that load_dataset("librispeech_asr") would give you the original (not re-encoded) data (flac or already decoded). So such re-encoding logic would be some separate generic function. So I could do sth like dataset.reencode_as_ogg(**ogg_encode_opts).save_to_disk(...) or so.

albertz · 2022-04-21T10:55:21Z

A follow-up question: I wonder whether a Parquet dataset is maybe more what we actually want to have? (Following also my comment here: #4184 (comment).) Because I think we actually would prefer to embed the data content in the dataset.

So, instead of save_to_disk/load_from_disk, we would use to_parquet,from_parquet? Is there any downside? Are arrow files more efficient?

Related is also the doc update in #4193.

lhoestq · 2022-04-21T11:00:53Z

save_to_disk saves the dataset as an Arrow file, which is the format we use to load a dataset using memory mapping. This way the dataset does not fill your RAM, but is read from your disk instead.

Therefore you can directly reload a dataset saved with save_to_disk using load_from_disk.

Parquet files are used for cold storage: to use memory mapping on a Parquet dataset, you first have to convert it to Arrow. We use Parquet to reduce the I/O when pushing/downloading data from the Hugging face Hub. When you load a Parquet file from the Hub, it is converted to Arrow on the fly during the download.

albertz mentioned this issue Apr 13, 2023

Arbitrary Data Format for HDFDataset rwth-i6/returnn#331

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Librispeech documentation, clarification on format #4185

Librispeech documentation, clarification on format #4185

albertz commented Apr 20, 2022

albertz commented Apr 20, 2022

patrickvonplaten commented Apr 20, 2022

patrickvonplaten commented Apr 20, 2022

albertz commented Apr 20, 2022

patrickvonplaten commented Apr 20, 2022

albertz commented Apr 20, 2022

albertz commented Apr 21, 2022

lhoestq commented Apr 21, 2022

Librispeech documentation, clarification on format #4185

Librispeech documentation, clarification on format #4185

Comments

albertz commented Apr 20, 2022

albertz commented Apr 20, 2022

patrickvonplaten commented Apr 20, 2022

patrickvonplaten commented Apr 20, 2022

albertz commented Apr 20, 2022

patrickvonplaten commented Apr 20, 2022

albertz commented Apr 20, 2022

albertz commented Apr 21, 2022

lhoestq commented Apr 21, 2022