-
Hi! First of all, many thanks for the effort in this project, Tortoise is so cool! I’m training Tortoise on my own, and I’m finding some issues with respect to the conditioning latent that I wanted to comment with you. Basically, Tortoise is pronouncing the input text, but in the middle it is saying some unintelligible words. It is only like 5% to 15% of the times that output speech is clearly equal to the input text. At training for autoregressive and diffusion models, I’m using the same reference clip for both conditioning latent and target signal in the loss. Being so, I’m really suspicious that the model learns to encode some of the phonetic content in the conditioning latent, and therefore at inference, when I use a different reference clip, phonetic content from it leaks in the final signal. For such a reason, I’m thinking about fine-tuning my current autoregressive and diffusion checkpoints, but this time randomly sampling different reference clips from the target speaker ID, to force the model to disentangle phonetic content. I guess it should be enough to preserve identity information in the latent vector, but I’m wondering whether if it will discard prosody information, as also this won’t match that of the target signal. Did you encounter similar problems while developing Tortoise and if so, what measures did you take to ameliorate them? Many thanks in advance and cheers! :) |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Hey there, When I trained Tortoise, all of my conditioning clips were different from the target audio. I did this by building a "similarity list" across my entire dataset. Generally speaking, I built these similarities by finding instances of the same speaker speaking in a clip close to the target clip. As a specific example, let's say I had an hour long podcast, I would chop it up into 10 second clips. For each clip I would generate a latent encoding generated by a contrastive model trained to minimize the dot product of latents of the same speaker saying something. I'd then generate a dot product matrix across all of the clips in the podcast and would pick the 3 clips with the lowest dot product for each clip. These would be my conditioning inputs in training. I think alternative schemes for generating these similarities are worth pursuing. I did not try out voice ID models like you describe, but I think that would be a great approach. Another possible option would be to using the target speech, but randomly cropped. You mentioned concerns about preserving prosody and tone. I think if you generally pull your conditioning clips from adjacent audio, you'll capture those elements most of the time. BTW - In early models, I did like you are describing and I used the target speech as a conditioning clip as well. I remember that it performed poorly whenever given text that was not aligned with the conditioning clip in inference. I don't remember hearing exactly what you are describing, though. One way you could test your theory would be to see if the model always produced good outputs if you fed in text paired with the conditioning input. If that works, I'd say there's a high chance that this is your problem. |
Beta Was this translation helpful? Give feedback.
Hey there,
That's exciting - I'm really glad to hear you're making progress on this!
When I trained Tortoise, all of my conditioning clips were different from the target audio. I did this by building a "similarity list" across my entire dataset. Generally speaking, I built these similarities by finding instances of the same speaker speaking in a clip close to the target clip. As a specific example, let's say I had an hour long podcast, I would chop it up into 10 second clips. For each clip I would generate a latent encoding generated by a contrastive model trained to minimize the dot product of latents of the same speaker saying something. I'd then generate a dot product matrix across all…