Skip to content

Conditional latent sampling strategy at training #151

Answered by neonbjb
gcambara asked this question in Q&A
Discussion options

You must be logged in to vote

Hey there,
That's exciting - I'm really glad to hear you're making progress on this!

When I trained Tortoise, all of my conditioning clips were different from the target audio. I did this by building a "similarity list" across my entire dataset. Generally speaking, I built these similarities by finding instances of the same speaker speaking in a clip close to the target clip. As a specific example, let's say I had an hour long podcast, I would chop it up into 10 second clips. For each clip I would generate a latent encoding generated by a contrastive model trained to minimize the dot product of latents of the same speaker saying something. I'd then generate a dot product matrix across all…

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@gcambara
Comment options

@meltingrock
Comment options

Answer selected by gcambara
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants