A Robust Text-to-Speech System in Bangla with Stochastic Duration Predictor

Official implementation of A Robust Text-to-Speech System in Bangla with Stochastic Duration Predictor

Authors: Mushahid Intesum*, Abdullah Ibne Masud*, Md Ashraful Islam**, Dr Md Rezaul Karim.

^{*Equal contribution.} ^{**Corresponding Author.}

Abstract

Text-to-speech (TTS), a field aiming to produce natural speech from text, is a prominent area of research in speech, language, and machine learning, with broad industrial applications. Despite Bangla being the seventh most spoken language globally, there exists a significant shortage of high-quality audio data for TTS, automatic speech recognition, and other audio-related natural language processing tasks. To address this gap, we have compiled a meticulously curated single-speaker Bangla audio dataset. Following extensive preprocessing, our dataset has more than 20 hours of clean audio data featuring a diverse array of genres and sources, supplemented by novel metrics, including categorization based on sentence complexity, distribution of tense and person, as well as quantitative measurements such as word count, unique word count, and compound letter count. Our dataset, along with its distinctive evaluation metrics, fills a significant void in the evaluation of Bangla audio datasets, rendering it a valuable asset for future research endeavors. Additionally, we propose a novel TTS model employing diffusion and a duration predictor. Our model integrates a Stochastic Duration Predictor(SDP) to enhance alignment between input text and speech duration, alongside a context prediction network for improved word pronunciation. The incorporation of the SDP aims to emulate the variability observed in human speech, where the same sentence may be pronounced with different duration. This addition facilitates the generation of more natural-sounding audio samples with improved duration characteristics. Through blind subjective analysis utilizing the Mean Opinion Score (MOS), we demonstrate that our proposed model enhances the quality of the state-of-the-art GradTTS model.

Installation

Install all Python package requirements:

pip install -r requirements.txt

Note: code is tested on Python==3.10.9.

Inference

You can download BnTTS dataset (22kHz) from here.

Put necessary HiFi-GAN checkpoints into checkpts folder in root directory (note: in inference.py you can change default HiFi-GAN path).

Create text file with sentences you want to synthesize like resources/filelists/synthesis.txt.
For single speaker set params.n_spks=1.
Run script inference.py by providing path to the text file, path to the model checkpoint, number of iterations to be used for reverse diffusion (default: 10) and speaker id if you want to perform multispeaker inference:
```
python inference.py -f <your-text-file> -c <bn-tts-checkpoint> -t <number-of-timesteps> 
```
Check out folder called out for generated audios.

Training

Make filelists of your audio data like ones included into resources/filelists folder. Make a new folder names data and put audio files inside wavs and text files in text folders respectively.
Make a metadata.txt file that has audio file name and the corresponding text. An example is:

1233456|this is the text

Set experiment configuration in params.py file.
Specify your GPU device and run training script:
```
python train.py 
```
To track your training process run tensorboard server on any available port:
```
tensorboard --logdir=YOUR_LOG_DIR --port=8888
```
During training all logging information and checkpoints are stored in YOUR_LOG_DIR, which you can specify in params.py before training.

Note

If you'd like to run the SDP+Context model, simple comment out the parts under GradTTSSDP model comments and uncomment the parts under GradTTSSDPContext in train.py and inference.py

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
hifi_gan		hifi_gan
model		model
text		text
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
THIRD_PARTY_NOTICE		THIRD_PARTY_NOTICE
best_trial.json		best_trial.json
data.py		data.py
inference.ipynb		inference.ipynb
inference.py		inference.py
make_data.py		make_data.py
meldataset.py		meldataset.py
params.py		params.py
requirements.txt		requirements.txt
train.py		train.py
train_multi_speaker.py		train_multi_speaker.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Robust Text-to-Speech System in Bangla with Stochastic Duration Predictor

Abstract

Installation

Inference

Training

Note

About

Releases

Packages

Languages

License

mushahid-intesum/speech_synthesis_in_bangla

Folders and files

Latest commit

History

Repository files navigation

A Robust Text-to-Speech System in Bangla with Stochastic Duration Predictor

Abstract

Installation

Inference

Training

Note

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages