Substantial update: We 1) abandon the explicit prediction of the F0 curve; 2) increase the receptive field of the denoiser; 3) make the linguistic encoder more robust. By doing so, 1) the synthesized recordings are more natural in terms of pitch; 2) the pipeline is more simpler.
简而言之,把F0曲线的动态性交给生成式模型去捕捉,而不再是以前那样用MSE约束对数域F0。
- First, we tend to remind you that MIDI version is not included in the content of our AAAI paper. The camera-ready version of the paper won't be changed. Thus, the authors make no warranties regarding this part of codes/experiments.
- Second, there are many differences of model structure, especially in the melody frontend.
- Third, thanks Opencpop team for releasing their SVS dataset with MIDI label, Jan.20, 2022. (Also thanks to my co-author Yi Ren, who applied for the dataset and did some preprocessing works for this part)
a) For PopCS dataset: WIP. We may release the MIDI label of PopCS in the future, and update this part.
b) For Opencpop dataset: Please strictly follow the instructions of Opencpop. We have no right to give you the access to Opencpop.
The pipeline below is designed for Opencpop dataset:
a) Download and extract Opencpop, then create a link to the dataset folder: ln -s /xxx/opencpop data/raw/
b) Run the following scripts to pack the dataset for training/inference.
export PYTHONPATH=.
CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config usr/configs/midi/cascade/opencs/aux_rel.yaml
# `data/binary/opencpop-midi-dp` will be generated.
We provide the pre-trained model of HifiGAN-Singing which is specially designed for SVS with NSF mechanism.
Also, please unzip pre-trained vocoder and this pendant for vocoder into checkpoints
before training your acoustic model.
(Update: You can also move a ckpt with more training steps into this vocoder directory)
This singing vocoder is trained on ~70 hours singing data, which can be viewed as a universal vocoder.
export MY_DS_EXP_NAME=0228_opencpop_ds100_rel
.
|--data
|--raw
|--opencpop
|--segments
|--transcriptions.txt
|--wavs
|--checkpoints
|--MY_DS_EXP_NAME (optional)
|--0109_hifigan_bigpopcs_hop128 (vocoder)
|--model_ckpt_steps_1512000.ckpt
|--config.yaml
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name $MY_DS_EXP_NAME --reset
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name $MY_DS_EXP_NAME --reset --infer
We also provide:
- the pre-trained model of DiffSinger;
They can be found in here.
Remember to put the pre-trained models in checkpoints
directory.
a) the HifiGAN-Singing is trained on our vocoder dataset and the training set of PopCS. Opencpop is the out-of-domain dataset (unseen speaker). This may cause the deterioration of audio quality, and we are considering fine-tuning this vocoder on the training set of Opencpop.
b) in this version of codes, we used the melody frontend ([lyric + MIDI]->[ph_dur]) to predict phoneme duration. F0 curve is implicitly predicted together with mel-spectrogram.
c) example generated audio. More generated audio demos can be found in DiffSinger.