Notebooks for "AI 응용-음성"
Tutorial for python and data science packages
- python review
- numpy
- matplotlib
- PyTorch Tensor
Audio file handling using torchaudio
- Load audio file(torchaudio.load)
- Feature extraction(Mel-spectrogram, MFCC)
Audio MNIST classification using MLP(torch.Linear)
Simple Exercise(model training using CTC loss) for Connectionist Temporal Classification
Exercise using OpenAI - Whisper and Gradio
Quartznet Model finetune with Nemo(English to Korean)
Exercise for WFST using k2
- C,L,G transducer
- composition, determinization
Wav2Vec2.0 Model finetune with HunggingFace(English to Korean)
Whisper Model finetune with HunggingFace(English to Korean)
- Huang, X. D., Acero, A., Hon, H. W., & Foreword By-Rabiner, L. (2001). Spoken Language Processing: A Guide to Theory, Algorithm, and System Development - Chapter 9. Pearson Education. (https://dl.acm.org/doi/book/10.5555/560905#cited-by-sec)
- Uday Kamath, John Liu, and James Whitaker (2019). Deep Learning for NLP and Speech Recognition. Springer. (https://www.amazon.com/Deep-Learning-NLP-Speech-Recognition/dp/3030145980)
- Introduction to Deep Learning: MIT 6.S191. (2023). Lecture2 – Deep Sequence Modeling. (http://introtodeeplearning.com/slides/6S191_MIT_DeepLearning_L2.pdf)
- Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. arXiv preprint arXiv:1409.3215. (https://arxiv.org/abs/1409.3215)
- Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv preprint arXiv:1406.1078. (https://arxiv.org/abs/1406.1078)
- Stanford University. (2022). CS224N: Natural Language Processing with Deep Learning: Lecture7 – Machine Translation, Sequence-to-Sequence and Attention. (https://web.stanford.edu/class/cs224n/slides/cs224n-2022-lecture07-nmt.pdf)
- Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473. (https://arxiv.org/abs/1409.0473)
- Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML'06: Proceedings of the 23rd international conference on Machine learning. (https://www.cs.toronto.edu/~graves/icml_2006.pdf)
- Olah, C., & Carter, S. (2017). Distill: Sequence Modeling With CTC. (https://distill.pub/2017/ctc/)
- Bluche, T. (N.D.). The intriguing blank label in CTC. [Blog post]. (https://www.tbluche.com/ctc_and_blank.html)
- Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention Is All You Need. arXiv preprint arXiv:1706.03762. (https://arxiv.org/abs/1706.03762)
- Olah, C. (2015). Understanding LSTM Networks. [Blog post]. (https://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- Stanford University. (2022). CS224N: Natural Language Processing with Deep Learning: Lecture9 – Transformers. (https://web.stanford.edu/class/cs224n/slides/cs224n-2022-lecture09-transformers.pdf)
- Jurafsky, D., & Martin, J. H. (2022). Speech and Language Processing: Chapter9 – RNNs and LSTMs. (https://web.stanford.edu/~jurafsky/slp3/)
- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR. (https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html)
- Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. arXiv preprint arXiv:1607.06450. (https://arxiv.org/abs/1607.06450)
- Alammar, J. (N.D.). The Illustrated Transformer. [Blog post]. (https://jalammar.github.io/illustrated-transformer/)
- Bloem, P. (N.D.). TRANSFORMERS FROM SCRATCH. [Blog post]. (https://peterbloem.nl/blog/transformers)
- Gulati, A., Qin, J., Chiu, C. C., et al. (2020). Conformer: Convolution-augmented Transformer for Speech Recognition. arXiv preprint arXiv:2005.08100. (https://arxiv.org/abs/2005.08100)
- Hinton, G., Deng, L., Yu, D., et al. (2012). Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Processing Magazine. (https://www.cs.toronto.edu/~hinton/absps/DNN-2012-proof.pdf)
- Young, S. J., & Woodland, P. C. (1994). Tree-Based State Tying for High Accuracy Acoustic Modelling. Proc. ARPA Spoken Language Systems Technology Workshop. (https://aclanthology.org/H94-1062.pdf)
- Barsky, M. (N.D.). Victoria university: data mining Lab3 - Classifiers: toy example of decision tree. (http://csci.viu.ca/~barskym/teaching/DM_LABS/LAB_3/Lab3_decisiontreeexample.pdf)
- Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy Layer-Wise Training of Deep Networks. Advances in Neural Information Processing Systems 19. (https://proceedings.neurips.cc/paper/2006/file/5da713a690c067105aeb2fae32403405-Paper.pdf)
- Graves, A., Mohamed, A. R., & Hinton, G. (2013). SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. (http://www.cs.toronto.edu/~hinton/absps/RNN13.pdf)
- American National Corpus (ANC). (N.D.). AMC. (https://anc.org/data/anc-second-release/frequency-data/)
- Sekine, S. (2010). On-Demand Information Extraction and Linguistic Knowledge Acquisition. New York University. (https://nlp.cs.nyu.edu/sekine/papers/10spring.pdf)
- Gillick, L., & Cox, S. J. (1991). Some statistical issues in the comparison of speech recognition algorithms. ICASSP-91: International Conference on Acoustics, Speech, and Signal Processing. (https://www.researchgate.net/publication/2360210_Comparison_Of_Part-Of-Speech_And_Automatically_Derived_Category-Based_Language_Models_For_Speech_Recognition)
- Mohri, M., Pereira, F., & Riley, M. (2008). SPEECH RECOGNITION WITH WEIGHTED FINITE-STATE TRANSDUCERS. Springer Handbook of Speech Processing. (https://cs.nyu.edu/~mohri/pub/hbka.pdf)
- Panayotov, V. (2012). Decoding graph construction in Kaldi: A visual walkthrough. [Blog post]. (http://vpanayotov.blogspot.com/2012/06/kaldi-decoding-graph-construction.html)
- Lecture on Weighted Finite State Transducers in Automatic Speech Recognition. (N.D.). Brno University of Technology. (http://www.fit.vutbr.cz/study/courses/ZRE/public/pred/13_wfst_sid_lid/zre_lecture_asr_wfst.pdf)