Conditional latent sampling strategy at training #151

gcambara · 2022-09-08T14:06:53Z

gcambara
Sep 8, 2022

Hi! First of all, many thanks for the effort in this project, Tortoise is so cool! I’m training Tortoise on my own, and I’m finding some issues with respect to the conditioning latent that I wanted to comment with you.

Basically, Tortoise is pronouncing the input text, but in the middle it is saying some unintelligible words. It is only like 5% to 15% of the times that output speech is clearly equal to the input text.

At training for autoregressive and diffusion models, I’m using the same reference clip for both conditioning latent and target signal in the loss. Being so, I’m really suspicious that the model learns to encode some of the phonetic content in the conditioning latent, and therefore at inference, when I use a different reference clip, phonetic content from it leaks in the final signal.

For such a reason, I’m thinking about fine-tuning my current autoregressive and diffusion checkpoints, but this time randomly sampling different reference clips from the target speaker ID, to force the model to disentangle phonetic content. I guess it should be enough to preserve identity information in the latent vector, but I’m wondering whether if it will discard prosody information, as also this won’t match that of the target signal. Did you encounter similar problems while developing Tortoise and if so, what measures did you take to ameliorate them?

Many thanks in advance and cheers! :)

Answered by neonbjb

Sep 8, 2022

Hey there,
That's exciting - I'm really glad to hear you're making progress on this!

When I trained Tortoise, all of my conditioning clips were different from the target audio. I did this by building a "similarity list" across my entire dataset. Generally speaking, I built these similarities by finding instances of the same speaker speaking in a clip close to the target clip. As a specific example, let's say I had an hour long podcast, I would chop it up into 10 second clips. For each clip I would generate a latent encoding generated by a contrastive model trained to minimize the dot product of latents of the same speaker saying something. I'd then generate a dot product matrix across all…

View full answer

neonbjb · 2022-09-08T14:33:37Z

neonbjb
Sep 8, 2022
Maintainer

Hey there,
That's exciting - I'm really glad to hear you're making progress on this!

When I trained Tortoise, all of my conditioning clips were different from the target audio. I did this by building a "similarity list" across my entire dataset. Generally speaking, I built these similarities by finding instances of the same speaker speaking in a clip close to the target clip. As a specific example, let's say I had an hour long podcast, I would chop it up into 10 second clips. For each clip I would generate a latent encoding generated by a contrastive model trained to minimize the dot product of latents of the same speaker saying something. I'd then generate a dot product matrix across all of the clips in the podcast and would pick the 3 clips with the lowest dot product for each clip. These would be my conditioning inputs in training.

I think alternative schemes for generating these similarities are worth pursuing. I did not try out voice ID models like you describe, but I think that would be a great approach. Another possible option would be to using the target speech, but randomly cropped.

You mentioned concerns about preserving prosody and tone. I think if you generally pull your conditioning clips from adjacent audio, you'll capture those elements most of the time.

BTW - In early models, I did like you are describing and I used the target speech as a conditioning clip as well. I remember that it performed poorly whenever given text that was not aligned with the conditioning clip in inference. I don't remember hearing exactly what you are describing, though. One way you could test your theory would be to see if the model always produced good outputs if you fed in text paired with the conditioning input. If that works, I'd say there's a high chance that this is your problem.

2 replies

gcambara Sep 16, 2022
Author

Sampling different reference clips from the same speaker solved the issue, now I get consistently intelligible speech without any phonetic leakage from the reference signal. Many thanks!! :)

meltingrock Feb 2, 2023

@gcambara Hi Guillermo.

I am really interested in training TorToise on African languages but lack the finer details to do this. Would you be willing to consult (as in consultation fees paid for your time) to help me setup a training pipeline for TorToise.

Your help will be greatly appreciated :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conditional latent sampling strategy at training #151

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Conditional latent sampling strategy at training #151

gcambara Sep 8, 2022

Replies: 1 comment · 2 replies

neonbjb Sep 8, 2022 Maintainer

gcambara Sep 16, 2022 Author

meltingrock Feb 2, 2023

gcambara
Sep 8, 2022

Replies: 1 comment 2 replies

neonbjb
Sep 8, 2022
Maintainer

gcambara Sep 16, 2022
Author