Skip to content

Latest commit

 

History

History
89 lines (64 loc) · 4.78 KB

readme-e2e.md

File metadata and controls

89 lines (64 loc) · 4.78 KB

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

arXiv GitHub Stars downloads

Substantial update: We 1) abandon the explicit prediction of the F0 curve; 2) increase the receptive field of the denoiser; 3) make the linguistic encoder more robust. By doing so, 1) the synthesized recordings are more natural in terms of pitch; 2) the pipeline is more simpler.

简而言之,把F0曲线的动态性交给生成式模型去捕捉,而不再是以前那样用MSE约束对数域F0。

DiffSinger (MIDI version SVS)

  • First, we tend to remind you that MIDI version is not included in the content of our AAAI paper. The camera-ready version of the paper won't be changed. Thus, the authors make no warranties regarding this part of codes/experiments.
  • Second, there are many differences of model structure, especially in the melody frontend.
  • Third, thanks Opencpop team for releasing their SVS dataset with MIDI label, Jan.20, 2022. (Also thanks to my co-author Yi Ren, who applied for the dataset and did some preprocessing works for this part)

0. Data Acquirement

a) For PopCS dataset: WIP. We may release the MIDI label of PopCS in the future, and update this part.

b) For Opencpop dataset: Please strictly follow the instructions of Opencpop. We have no right to give you the access to Opencpop.

The pipeline below is designed for Opencpop dataset:

1. Preparation

Data Preparation

a) Download and extract Opencpop, then create a link to the dataset folder: ln -s /xxx/opencpop data/raw/

b) Run the following scripts to pack the dataset for training/inference.

export PYTHONPATH=.
CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config usr/configs/midi/cascade/opencs/aux_rel.yaml

# `data/binary/opencpop-midi-dp` will be generated.

Vocoder Preparation

We provide the pre-trained model of HifiGAN-Singing which is specially designed for SVS with NSF mechanism.

Also, please unzip pre-trained vocoder and this pendant for vocoder into checkpoints before training your acoustic model.

(Update: You can also move a ckpt with more training steps into this vocoder directory)

This singing vocoder is trained on ~70 hours singing data, which can be viewed as a universal vocoder.

Exp Name Preparation

export MY_DS_EXP_NAME=0228_opencpop_ds100_rel
.
|--data
    |--raw
        |--opencpop
            |--segments
                |--transcriptions.txt
                |--wavs
|--checkpoints
    |--MY_DS_EXP_NAME (optional)
    |--0109_hifigan_bigpopcs_hop128 (vocoder)
        |--model_ckpt_steps_1512000.ckpt
        |--config.yaml

2. Training Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name $MY_DS_EXP_NAME --reset  

3. Inference Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name $MY_DS_EXP_NAME --reset --infer

We also provide:

  • the pre-trained model of DiffSinger;

They can be found in here.

Remember to put the pre-trained models in checkpoints directory.

4. Some issues.

a) the HifiGAN-Singing is trained on our vocoder dataset and the training set of PopCS. Opencpop is the out-of-domain dataset (unseen speaker). This may cause the deterioration of audio quality, and we are considering fine-tuning this vocoder on the training set of Opencpop.

b) in this version of codes, we used the melody frontend ([lyric + MIDI]->[ph_dur]) to predict phoneme duration. F0 curve is implicitly predicted together with mel-spectrogram.

c) example generated audio. More generated audio demos can be found in DiffSinger.