Research #2

thewh1teagle · 2024-06-21T21:38:46Z

Is it bad that we'll use different voices to the training?
Do we will be able to control how to final voice will be sound like?
What's the best model to choose, both in terms of ease of training and potential

Dataset:

Ivrit.ai
https://huggingface.co/datasets/ivrit-ai/audio-labeled

RoboShaul
https://www.openslr.org/134/

Inspiration:

TTS with voice cloning
https://www.youtube.com/watch?v=jPFDtB3kKkY

https://localai.io/features/text-to-audio/
https://github.com/Sharonio/roboshaul

Working model for tts in hebrew
https://gist.github.com/thewh1teagle/6d477f91d3f3fb7380b6fb3d839dda2e

Open source tts projects:

Up to date
https://github.com/dykyivladk1/tacotron
https://github.com/ttaoREtw/Tacotron-pytorch

Roboshaul 1st place
https://github.com/maxmelichov/Text-To-speech

Updated, but needs emotional voices dataset
https://github.com/netease-youdao/EmotiVoice

Promising but project is shut down
https://github.com/coqui-ai/TTS

Promising
https://github.com/metavoiceio/metavoice-src

https://github.com/rhasspy/piper

https://github.com/snakers4/silero-models

https://github.com/neonbjb/tortoise-tts

https://github.com/espnet/espnet

https://github.com/NVIDIA/NeMo

https://github.com/espeak-ng/espeak-ng

https://github.com/huggingface/parler-tts

https://github.com/fishaudio/fish-speech

https://github.com/PaddlePaddle/PaddleSpeech

https://github.com/myshell-ai/MeloTTS

https://github.com/Plachtaa/VALL-E-X

https://github.com/collabora/WhisperSpeech

https://github.com/slp-rl/HebTTS

https://github.com/speechbrain/speechbrain

Guides

Serious
https://medium.com/@peechapp/text-to-speech-models-part-1-intro-little-theory-and-math-0ffa5d3e0e3f

rhasspy/piper#51

Papers

https://pages.cs.huji.ac.il/adiyoss-lab/HebTTS/

Hardware

https://vast.ai/

Chat questions

https://discord.com/channels/1087775482688323656/1090298218107130001/1254800478487969853

TTS communities

metavoice
https://discord.gg/ShDqyA3m

suno
https://discord.gg/PTP3GD8h

espnet
https://discord.gg/MCbETmFs

fishaudio
https://discord.gg/wqxyePyj

vall-e
https://discord.gg/wnDuKHma

huggingface
https://discord.gg/hugging-face-879548962464493619

speechbrain
https://discord.gg/rEBtaXrJ

Voice conversion

https://github.com/IAHispano/Applio

2024-07-12

Find open and good dataset from audio books
Use voice-changer to change voice to allowed open voice
Train on it
Be close to result of vits-ljs. it's trained on 24 hours.
MMS multilang tts (including hebrew) based on that.

metavoiceio/metavoice-src#70

2024-07-14

Prepare saspeech
Enhance the voice with applio and change it to something that sounds better
Train on Tacotron2

2024-07-15

The closest project that works good https://github.com/nipponjo/tts-arabic-pytorch

2024-07-16

Collect Audio: Gather 10-20 hours of clean audio from a single native Hebrew speaker.
Transcribe Audio: Accurately transcribe the audio.
Normalize Transcriptions: Convert numbers and symbols to Hebrew words.
Add Nikud: Annotate transcriptions with Nikud (vowel symbols).
Transliterate: Convert Hebrew with Nikud to Roman/Latin script.
Create Spectrograms: Generate spectrograms from the audio files using tools like Librosa.
Split Dataset: Divide into training (80-90%) and testing (10-20%) sets.
Pretrained Model: Use a pretrained Tacotron2 model.
Fine-Tune: Fine-tune the model on the Hebrew dataset with a GPU.
Evaluate and Adjust: Test, listen, and adjust based on performance.

NVIDIA/tacotron2#321 (comment)

https://www.youtube.com/watch?v=EWp6UitlnDo

https://m.youtube.com/watch?v=e71H--vxRvo

2024-07-18

First training running.
Steps done:

Collect saspeech_gold_standard_v1.0.tar.gz
Prepare dataset

Number to word using num2words/
Add points (nikud) using nakdimon
Take only two raws from the metadtaa.csv (wav id, sentence)
transliterate using hebrew-transliteration with JSPyBridge
Split into train_data and validation_data (keep 5% for validation)

Prepare tacotron2

Clone tacotron2 and migrate to v2
Init submodule (waveglow)
Update symbols using for training in symbols.py

Point tacotron2 to training_data and validation_data and start training
On Colab, use A100. get dataset from Google Drive. Don't forget to save checkpoints to Google Drive. as the session may lose.

Costs:
100 units = 32ILS

1K iterations in hour

2025-01-03

Kokoro

https://huggingface.co/hexgrad/Kokoro-82M/discussions/10#6773226c5a14f2e615632359

thewh1teagle mentioned this issue Jul 18, 2024

Feat/python package elazarg/nakdimon#23

Merged

thewh1teagle closed this as completed Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research #2

Research #2

thewh1teagle commented Jun 21, 2024 •

edited

Loading

Research #2

Research #2

Comments

thewh1teagle commented Jun 21, 2024 • edited Loading

Dataset:

Inspiration:

Open source tts projects:

Guides

Papers

Hardware

Chat questions

TTS communities

Voice conversion

2024-07-12

2024-07-14

2024-07-15

2024-07-16

2024-07-18

2025-01-03

thewh1teagle commented Jun 21, 2024 •

edited

Loading