DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

Substantial update: We 1) abandon the explicit prediction of the F0 curve; 2) increase the receptive field of the denoiser; 3) make the linguistic encoder more robust. By doing so, 1) the synthesized recordings are more natural in terms of pitch; 2) the pipeline is more simpler.

简而言之，把F0曲线的动态性交给生成式模型去捕捉，而不再是以前那样用MSE约束对数域F0。

DiffSinger (MIDI version SVS)

First, we tend to remind you that MIDI version is not included in the content of our AAAI paper. The camera-ready version of the paper won't be changed. Thus, the authors make no warranties regarding this part of codes/experiments.
Second, there are many differences of model structure, especially in the melody frontend.
Third, thanks Opencpop team for releasing their SVS dataset with MIDI label, Jan.20, 2022. (Also thanks to my co-author Yi Ren, who applied for the dataset and did some preprocessing works for this part)

0. Data Acquirement

a) For PopCS dataset: WIP. We may release the MIDI label of PopCS in the future, and update this part.

b) For Opencpop dataset: Please strictly follow the instructions of Opencpop. We have no right to give you the access to Opencpop.

The pipeline below is designed for Opencpop dataset:

1. Preparation

Data Preparation

a) Download and extract Opencpop, then create a link to the dataset folder: ln -s /xxx/opencpop data/raw/

b) Run the following scripts to pack the dataset for training/inference.

export PYTHONPATH=.
CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config usr/configs/midi/cascade/opencs/aux_rel.yaml

# `data/binary/opencpop-midi-dp` will be generated.

Vocoder Preparation

We provide the pre-trained model of HifiGAN-Singing which is specially designed for SVS with NSF mechanism.

Also, please unzip pre-trained vocoder and this pendant for vocoder into checkpoints before training your acoustic model.

(Update: You can also move a ckpt with more training steps into this vocoder directory)

This singing vocoder is trained on ~70 hours singing data, which can be viewed as a universal vocoder.

Exp Name Preparation

export MY_DS_EXP_NAME=0228_opencpop_ds100_rel

.
|--data
    |--raw
        |--opencpop
            |--segments
                |--transcriptions.txt
                |--wavs
|--checkpoints
    |--MY_DS_EXP_NAME (optional)
    |--0109_hifigan_bigpopcs_hop128 (vocoder)
        |--model_ckpt_steps_1512000.ckpt
        |--config.yaml

2. Training Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name $MY_DS_EXP_NAME --reset

3. Inference Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name $MY_DS_EXP_NAME --reset --infer

We also provide:

the pre-trained model of DiffSinger;

They can be found in here.

Remember to put the pre-trained models in checkpoints directory.

4. Some issues.

a) the HifiGAN-Singing is trained on our vocoder dataset and the training set of PopCS. Opencpop is the out-of-domain dataset (unseen speaker). This may cause the deterioration of audio quality, and we are considering fine-tuning this vocoder on the training set of Opencpop.

b) in this version of codes, we used the melody frontend ([lyric + MIDI]->[ph_dur]) to predict phoneme duration. F0 curve is implicitly predicted together with mel-spectrogram.

c) example generated audio. More generated audio demos can be found in DiffSinger.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme-e2e.md

readme-e2e.md

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

DiffSinger (MIDI version SVS)

0. Data Acquirement

1. Preparation

Data Preparation

Vocoder Preparation

Exp Name Preparation

2. Training Example

3. Inference Example

4. Some issues.

Files

readme-e2e.md

Latest commit

History

readme-e2e.md

File metadata and controls

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

DiffSinger (MIDI version SVS)

0. Data Acquirement

1. Preparation

Data Preparation

Vocoder Preparation

Exp Name Preparation

2. Training Example

3. Inference Example

4. Some issues.