Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research #2

Closed
thewh1teagle opened this issue Jun 21, 2024 · 0 comments
Closed

Research #2

thewh1teagle opened this issue Jun 21, 2024 · 0 comments

Comments

@thewh1teagle
Copy link
Owner

thewh1teagle commented Jun 21, 2024

  • Is it bad that we'll use different voices to the training?
  • Do we will be able to control how to final voice will be sound like?
  • What's the best model to choose, both in terms of ease of training and potential

Dataset:

Ivrit.ai
https://huggingface.co/datasets/ivrit-ai/audio-labeled

RoboShaul
https://www.openslr.org/134/

Inspiration:

TTS with voice cloning
https://www.youtube.com/watch?v=jPFDtB3kKkY

https://localai.io/features/text-to-audio/
https://github.com/Sharonio/roboshaul

Working model for tts in hebrew
https://gist.github.com/thewh1teagle/6d477f91d3f3fb7380b6fb3d839dda2e

Open source tts projects:

Up to date
https://github.com/dykyivladk1/tacotron
https://github.com/ttaoREtw/Tacotron-pytorch

Roboshaul 1st place
https://github.com/maxmelichov/Text-To-speech

Updated, but needs emotional voices dataset
https://github.com/netease-youdao/EmotiVoice

Promising but project is shut down
https://github.com/coqui-ai/TTS

Promising
https://github.com/metavoiceio/metavoice-src

https://github.com/rhasspy/piper

https://github.com/snakers4/silero-models

https://github.com/neonbjb/tortoise-tts

https://github.com/espnet/espnet

https://github.com/NVIDIA/NeMo

https://github.com/espeak-ng/espeak-ng

https://github.com/huggingface/parler-tts

https://github.com/fishaudio/fish-speech

https://github.com/PaddlePaddle/PaddleSpeech

https://github.com/myshell-ai/MeloTTS

https://github.com/Plachtaa/VALL-E-X

https://github.com/collabora/WhisperSpeech

https://github.com/slp-rl/HebTTS

https://github.com/speechbrain/speechbrain

Guides

Serious
https://medium.com/@peechapp/text-to-speech-models-part-1-intro-little-theory-and-math-0ffa5d3e0e3f

rhasspy/piper#51

Papers

https://pages.cs.huji.ac.il/adiyoss-lab/HebTTS/

Hardware

https://vast.ai/

Chat questions

https://discord.com/channels/1087775482688323656/1090298218107130001/1254800478487969853

TTS communities

metavoice
https://discord.gg/ShDqyA3m

suno
https://discord.gg/PTP3GD8h

espnet
https://discord.gg/MCbETmFs

fishaudio
https://discord.gg/wqxyePyj

vall-e
https://discord.gg/wnDuKHma

huggingface
https://discord.gg/hugging-face-879548962464493619

speechbrain
https://discord.gg/rEBtaXrJ

Voice conversion

https://github.com/IAHispano/Applio

2024-07-12

  1. Find open and good dataset from audio books
  2. Use voice-changer to change voice to allowed open voice
  3. Train on it
  4. Be close to result of vits-ljs. it's trained on 24 hours.
    MMS multilang tts (including hebrew) based on that.

metavoiceio/metavoice-src#70

2024-07-14

  1. Prepare saspeech
  2. Enhance the voice with applio and change it to something that sounds better
  3. Train on Tacotron2

2024-07-15

The closest project that works good https://github.com/nipponjo/tts-arabic-pytorch

2024-07-16

  1. Collect Audio: Gather 10-20 hours of clean audio from a single native Hebrew speaker.
  2. Transcribe Audio: Accurately transcribe the audio.
  3. Normalize Transcriptions: Convert numbers and symbols to Hebrew words.
  4. Add Nikud: Annotate transcriptions with Nikud (vowel symbols).
  5. Transliterate: Convert Hebrew with Nikud to Roman/Latin script.
  6. Create Spectrograms: Generate spectrograms from the audio files using tools like Librosa.
  7. Split Dataset: Divide into training (80-90%) and testing (10-20%) sets.
  8. Pretrained Model: Use a pretrained Tacotron2 model.
  9. Fine-Tune: Fine-tune the model on the Hebrew dataset with a GPU.
  10. Evaluate and Adjust: Test, listen, and adjust based on performance.

NVIDIA/tacotron2#321 (comment)

https://www.youtube.com/watch?v=EWp6UitlnDo

https://m.youtube.com/watch?v=e71H--vxRvo

2024-07-18

First training running.
Steps done:

  1. Collect saspeech_gold_standard_v1.0.tar.gz
  2. Prepare dataset
  • Number to word using num2words/
  • Add points (nikud) using nakdimon
  • Take only two raws from the metadtaa.csv (wav id, sentence)
  • transliterate using hebrew-transliteration with JSPyBridge
  • Split into train_data and validation_data (keep 5% for validation)
  1. Prepare tacotron2
  • Clone tacotron2 and migrate to v2
  • Init submodule (waveglow)
  • Update symbols using for training in symbols.py
  1. Point tacotron2 to training_data and validation_data and start training
  2. On Colab, use A100. get dataset from Google Drive. Don't forget to save checkpoints to Google Drive. as the session may lose.

Costs:
100 units = 32ILS
image
image

1K iterations in hour

2025-01-03

Kokoro

https://huggingface.co/hexgrad/Kokoro-82M/discussions/10#6773226c5a14f2e615632359

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant