Initial commit

keonlee9420 · Sep 24, 2021 · 33d53c7 · 33d53c7
commit 33d53c7
Show file tree

Hide file tree

Showing 79 changed files with 332,292 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,120 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# pyenv
+.python-version
+
+# celery beat schedule file
+celerybeat-schedule
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+
+__pycache__
+.vscode
+.DS_Store
+
+# MFA
+montreal-forced-aligner/
+
+# data, checkpoint, and models
+raw_data/
+output/
+*.npy
+TextGrid/
+hifigan/*.pth.tar
+*.out
+deepspeaker/pretrained_models/*
diff --git a/CITATION.cff b/CITATION.cff
@@ -0,0 +1,11 @@
+cff-version: 1.0.0
+message: "If you use this software, please cite it as below."
+authors:
+- family-names: "Lee"
+  given-names: "Keon"
+  orcid: "https://orcid.org/0000-0001-9028-1018"
+title: "Comprehensive-Transformer-TTS"
+version: 0.1.0
+doi: ___
+date-released: 2021-08-25
+url: "https://github.com/keonlee9420/Comprehensive-Transformer-TTS"
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,44 @@
+FROM nvcr.io/nvidia/cuda:11.1.1-cudnn8-devel-ubuntu18.04
+ARG UID
+ARG USER_NAME
+
+WORKDIR /workspace
+
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    apt-utils \
+    build-essential \
+    ca-certificates \
+    curl \
+    cmake \
+    ffmpeg \
+    git \
+    python3-pip \
+    python3-setuptools \
+    python3-dev \
+    sudo \
+    ssh \
+    unzip \
+    vim \
+    wget && rm -rf /var/lib/apt/lists/*
+
+# RUN curl -o /tmp/miniconda.sh -sSL http://repo.continuum.io/miniconda/Miniconda3-py37_4.9.2-Linux-x86_64.sh && \
+#     bash /tmp/miniconda.sh -bfp /usr/local && \
+#     rm -rf /tmp/miniconda.sh
+# RUN conda update -y conda
+
+COPY requirements.txt requirements.txt
+
+RUN pip3 install --upgrade pip setuptools wheel
+RUN pip3 install -r requirements.txt
+
+RUN adduser $USER_NAME --u $UID --quiet --gecos "" --disabled-password && \
+    echo "$USER_NAME ALL=(root) NOPASSWD:ALL" > /etc/sudoers.d/$USER_NAME && \
+    chmod 0440 /etc/sudoers.d/$USER_NAME
+
+RUN echo "PasswordAuthentication yes" >> /etc/ssh/sshd_config
+RUN echo "PermitEmptyPasswords yes" >> /etc/ssh/sshd_config
+RUN echo "UsePAM no" >> /etc/ssh/sshd_config
+
+USER $USER_NAME
+
+EXPOSE 6006 6007 6008 6009
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2021 Keon Lee
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,163 @@
+# Comprehensive-Transformer-TTS - PyTorch Implementation
+
+A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, **aiming to achieve the ultimate TTS**.
+
+### Transformers
+- [x] [Fastformer: Additive Attention Can Be All You Need](https://arxiv.org/abs/2108.09084) (Wu et al., 2021)
+- [ ] [Long-Short Transformer: Efficient Transformers for Language and Vision](https://arxiv.org/abs/2107.02192) (Zhu et al., 2021)
+- [x] [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100) (Gulati et al., 2020)
+- [ ] [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) (Kitaev et al., 2020)
+- [x] [Attention Is All You Need](https://arxiv.org/abs/1706.03762) (Vaswani et al., 2017)
+
+### Supervised Duration Modelings
+- [x] [FastSpeech 2: Fast and High-Quality End-to-End Text to Speech](https://arxiv.org/abs/2006.04558) (Ren et al., 2020)
+
+### Unsupervised Duration Modelings
+- [x] [One TTS Alignment To Rule Them All](https://arxiv.org/abs/2108.10447) (Badlani et al., 2021)
+
+### Transformer Performance Comparison on LJSpeech (1 TITAN RTX 24G / 16 batch size)
+| Model | Memory Usage | Training Time (1K steps) |
+| --- | ----------- | ----- |
+|Fastformer (lucidrains')|10531MiB / 24220MiB|4m 25s
+|Fastformer (wuch15's)|10515MiB / 24220MiB|4m 45s
+|Long-Short Transformer|-|-
+|Conformer|18903MiB / 24220MiB|7m 4s
+|Reformer|-|-
+|Transformer|7909MiB / 24220MiB|4m 51s
+
+Toggle type of building blocks by
+```yaml
+# In the model.yaml
+block_type: "transformer" # ["transformer", "fastformer", "conformer"]
+```
+
+Toggle type of duration modelings by
+```yaml
+# In the model.yaml
+duration_modeling:
+  learn_alignment: True # for unsupervised modeling, False for supervised modeling
+```
+
+# Quickstart
+
+***DATASET*** refers to the names of datasets such as `LJSpeech` and `VCTK` in the following documents.
+
+## Dependencies
+You can install the Python dependencies with
+```
+pip3 install -r requirements.txt
+```
+Also, `Dockerfile` is provided for `Docker` users.
+
+## Inference
+
+You have to download the [pretrained models](https://drive.google.com/drive/folders/1xEOVbv3PLfGX8EgEkzg1014c9h8QMxQ-?usp=sharing) and put them in `output/ckpt/DATASET/`. The model is trained on LJSpeech with unsupervised duration modeling under transformer building blocks.
+
+For a **single-speaker TTS**, run
+```
+python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET
+```
+
+For a **multi-speaker TTS**, run
+```
+python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET
+```
+
+The dictionary of learned speakers can be found at `preprocessed_data/DATASET/speakers.json`, and the generated utterances will be put in `output/result/`.
+
+
+## Batch Inference
+Batch inference is also supported, try
+
+```
+python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET
+```
+to synthesize all utterances in `preprocessed_data/DATASET/val.txt`.
+
+## Controllability
+The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios.
+For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by
+
+```
+python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8 --energy_control 0.8
+```
+Add ***--speaker_id SPEAKER_ID*** for a multi-speaker TTS.
+
+# Training
+
+## Datasets
+
+The supported datasets are
+
+- [LJSpeech](https://keithito.com/LJ-Speech-Dataset/): a **single-speaker** English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
+- [VCTK](https://datashare.ed.ac.uk/handle/10283/3443): The CSTR VCTK Corpus includes speech data uttered by 110 English speakers (**multi-speaker TTS**) with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.
+
+Any of both **single-speaker TTS** dataset (e.g., [Blizzard Challenge 2013](https://www.synsig.org/index.php/Blizzard_Challenge_2013)) and **multi-speaker TTS** dataset (e.g., [LibriTTS](https://openslr.org/60/)) can be added following LJSpeech and VCTK, respectively. Moreover, **your own language and dataset** can be adapted following [here](https://github.com/keonlee9420/Expressive-FastSpeech2).
+
+## Preprocessing
+
+- For a **multi-speaker TTS** with external speaker embedder, download [ResCNN Softmax+Triplet pretrained model](https://drive.google.com/file/d/1F9NvdrarWZNktdX9KlRYWWHDwRkip_aP) of [philipperemy's DeepSpeaker](https://github.com/philipperemy/deep-speaker) for the speaker embedding and locate it in `./deepspeaker/pretrained_models/`.
+- Run 
+  ```
+  python3 prepare_align.py --dataset DATASET
+  ```
+  for some preparations.
+
+  For the forced alignment, [Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/en/latest/) (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.
+  Pre-extracted alignments for the datasets are provided [here](https://drive.google.com/drive/folders/1fizpyOiQ1lG2UDaMlXnT3Ll4_j6Xwg7K?usp=sharing). 
+  You have to unzip the files in `preprocessed_data/DATASET/TextGrid/`. Alternately, you can [run the aligner by yourself](https://montreal-forced-aligner.readthedocs.io/en/latest/aligning.html).
+
+  After that, run the preprocessing script by
+  ```
+  python3 preprocess.py --dataset DATASET
+  ```
+
+## Training
+
+Train your model with
+```
+python3 train.py --dataset DATASET
+```
+Useful options:
+- Support [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html) by `--use_amp` argument.
+- Assume single-node multi-GPU training. To use a specific GPU, specify `CUDA_VISIBLE_DEVICES=<GPU_ID>` at the beginning of the above command.
+
+# TensorBoard
+
+Use
+```
+tensorboard --logdir output/log
+```
+
+to serve TensorBoard on your localhost.
+<!-- The loss curves, synthesized mel-spectrograms, and audios are shown.
+
+![](./img/tensorboard_loss.png)
+![](./img/tensorboard_spec.png)
+![](./img/tensorboard_audio.png) -->
+
+# Notes
+
+- Both phoneme-level and frame-level variance are supported in both supervised and unsupervised duration modeling.
+- Note that there are no pre-extracted phoneme-level variance features in unsupervised duration modeling.
+- Convolutional embedding is used as [StyleSpeech](https://github.com/keonlee9420/StyleSpeech) for phoneme-level variance in unsupervised duration modeling. Otherwise, bucket-based embedding is used as [FastSpeech2](https://github.com/ming024/FastSpeech2).
+- Unsupervised duration modeling in phoneme-level will take longer time than frame-level since the additional computation of phoneme-level variance is activated at runtime.
+- Two options for embedding for the **multi-speaker TTS** setting: training speaker embedder from scratch or using a pre-trained [philipperemy's DeepSpeaker](https://github.com/philipperemy/deep-speaker) model (as [STYLER](https://github.com/keonlee9420/STYLER) did). You can toggle it by setting the config (between `'none'` and `'DeepSpeaker'`).
+- DeepSpeaker on VCTK dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding.
+
+<p align="center">
+    <img src="./preprocessed_data/VCTK/spker_embed_tsne.png" width="40%">
+</p>
+
+- For vocoder, **HiFi-GAN** and **MelGAN** are supported.
+
+# Citation
+
+Please cite this repository by the "[Cite this repository](https://github.blog/2021-08-19-enhanced-support-citations-github/)" of **About** section (top right of the main page).
+
+# References
+- [ming024's FastSpeech2](https://github.com/ming024/FastSpeech2)
+- [wuch15's Fastformer](https://github.com/wuch15/Fastformer)
+- [lucidrains' fast-transformer-pytorch](https://github.com/lucidrains/fast-transformer-pytorch)
+- [sooftware's conformer](https://github.com/sooftware/conformer)
+- [NVIDIA' NeMo](https://github.com/NVIDIA/NeMo): special thanks to [Onur Babacan](https://github.com/babua) and [Rafael Valle](https://github.com/rafaelvalle) for unsupervised duration modeling.
diff --git a/audio/__init__.py b/audio/__init__.py
@@ -0,0 +1,3 @@
+import audio.tools
+import audio.stft
+import audio.audio_processing