Official implementation of A Robust Text-to-Speech System in Bangla with Stochastic Duration Predictor
Authors: Mushahid Intesum*, Abdullah Ibne Masud*, Md Ashraful Islam**, Dr Md Rezaul Karim.
*Equal contribution. **Corresponding Author.
Text-to-speech (TTS), a field aiming to produce natural speech from text, is a prominent area of research in speech, language, and machine learning, with broad industrial applications. Despite Bangla being the seventh most spoken language globally, there exists a significant shortage of high-quality audio data for TTS, automatic speech recognition, and other audio-related natural language processing tasks. To address this gap, we have compiled a meticulously curated single-speaker Bangla audio dataset. Following extensive preprocessing, our dataset has more than 20 hours of clean audio data featuring a diverse array of genres and sources, supplemented by novel metrics, including categorization based on sentence complexity, distribution of tense and person, as well as quantitative measurements such as word count, unique word count, and compound letter count. Our dataset, along with its distinctive evaluation metrics, fills a significant void in the evaluation of Bangla audio datasets, rendering it a valuable asset for future research endeavors. Additionally, we propose a novel TTS model employing diffusion and a duration predictor. Our model integrates a Stochastic Duration Predictor(SDP) to enhance alignment between input text and speech duration, alongside a context prediction network for improved word pronunciation. The incorporation of the SDP aims to emulate the variability observed in human speech, where the same sentence may be pronounced with different duration. This addition facilitates the generation of more natural-sounding audio samples with improved duration characteristics. Through blind subjective analysis utilizing the Mean Opinion Score (MOS), we demonstrate that our proposed model enhances the quality of the state-of-the-art GradTTS model.
Install all Python package requirements:
pip install -r requirements.txt
Note: code is tested on Python==3.10.9.
You can download BnTTS dataset (22kHz) from here.
Put necessary HiFi-GAN checkpoints into checkpts
folder in root directory (note: in inference.py
you can change default HiFi-GAN path).
- Create text file with sentences you want to synthesize like
resources/filelists/synthesis.txt
. - For single speaker set
params.n_spks=1
. - Run script
inference.py
by providing path to the text file, path to the model checkpoint, number of iterations to be used for reverse diffusion (default: 10) and speaker id if you want to perform multispeaker inference:python inference.py -f <your-text-file> -c <bn-tts-checkpoint> -t <number-of-timesteps>
- Check out folder called
out
for generated audios.
- Make filelists of your audio data like ones included into
resources/filelists
folder. Make a new folder namesdata
and put audio files insidewavs
and text files intext
folders respectively. - Make a
metadata.txt
file that has audio file name and the corresponding text. An example is:
1233456|this is the text
- Set experiment configuration in
params.py
file. - Specify your GPU device and run training script:
python train.py
- To track your training process run tensorboard server on any available port:
During training all logging information and checkpoints are stored in
tensorboard --logdir=YOUR_LOG_DIR --port=8888
YOUR_LOG_DIR
, which you can specify inparams.py
before training.
If you'd like to run the SDP+Context model, simple comment out the parts under GradTTSSDP model comments and uncomment the parts under GradTTSSDPContext in train.py
and inference.py