A PyTorch project currently under research can provide easily reproducible baselines for automatic speech recognition using semi-supervised learning. Contributions are welcome.
SABER consists of the following components
- Several SOTA models including an Mixnet based variant of QuartzNet by NVIDIA.
- Ranger (RAdam + Lookahead) optimizer to offset warmup used by SpecAugment (by Leslie Smith)
- Mish activation function
- Data Augmentions used are SpecNoise, SpecAugment, SpecSparkle a cutout inspired variant, SpecBlur (a novel approach). Augmentation parameters linearly increase in a curriculum based approach.
- Aggregated Cross Entropy loss instead of CTC loss for easier training
- Unsupervised Data Augmentation as means for Semi-Supervised Learning
- ariar2c
- python3.x
- libraries in requirements.txt
Librispeech & CommonVoice datasets using download scripts, change dir parameter as per your configuration
sh download_scripts/download_librispeech.sh
sh download_scripts/extract_librispeech_tars.sh
sh download_scripts/download_common_voice.sh
sh download_scripts/extract_common_voice_tars.sh
Setup sentencepeiece vocab & form LMDB dataset.
sh dataset_scripts/librispeech_all_lines.sh
sh dataset_scripts/librispeech_sentencepiece_model.sh
OMP_NUM_THREADS="1" OPENBLAS_NUM_THREADS="1" python3 -W ignore -m dataset_scripts.create_librispeech_lmdb
OMP_NUM_THREADS="1" OPENBLAS_NUM_THREADS="1" python3 -W ignore -m dataset_scripts.create_commonvoice_lmdb
OMP_NUM_THREADS="1" OPENBLAS_NUM_THREADS="1" python3 -W ignore -m dataset_scripts.create_airtel_lmdb
OMP_NUM_THREADS="1" OPENBLAS_NUM_THREADS="1" python3 -W ignore -m dataset_scripts.create_airtelpayments_lmdb
Modify utils/config.py
as per your configuration and run
OMP_NUM_THREADS="1" CUDA_VISIBLE_DEVICES="0,1,2" python3.6 train.py
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
Jasper: An End-to-End Convolutional Neural Acoustic Model
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
Improved Regularization of Convolutional Neural Networks with Cutout
On the Variance of the Adaptive Learning Rate and Beyond
Aggregation Cross-Entropy for Sequence Recognition
MixMatch: A Holistic Approach to Semi-Supervised Learning
MixConv: Mixed Depthwise Convolutional Kernels
Unsupervised Data Augmentation for Consistency Training
Cyclical Learning Rates for Training Neural Networks
Cycle-consistency training for end-to-end speech recognition
RandAugment: Practical data augmentation with no separate search
Self-Attention Networks For Connectionist Temporal Classification in Speech Recognition