Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions regarding creation of training data for new taxa #917

Open
DanJeffries opened this issue Dec 20, 2024 · 11 comments
Open

Questions regarding creation of training data for new taxa #917

DanJeffries opened this issue Dec 20, 2024 · 11 comments

Comments

@DanJeffries
Copy link

DanJeffries commented Dec 20, 2024

Hi DV DeVs,

I am starting a new issue here, though it is somewhat related to our discussion in issue #904

I am trying to train a model for use in a fish, and am following the procedure laid out in the mosquito example here

To create the training data, we sequenced several trios for our target species so that we can use mendelian expectations to identify loci for training. My questions here relate to this process. I am not having a problem as such, I am just trying to make sure that my strategy is correct, and the training data as good as it can be. So your input would be very much appreciated!

In the mosquito blog post, you outline the general approach as follows:

"To generate examples, we look for positions that we can confidently determine the transmitted allele from each parent, but do not look at the evidence in the child. This allows us to represent sites that may be difficult in the child, but which have a known label."

This approach makes a lot of sense to me, however I am unclear exactly how you implemented it. I think easiest is to first outline what I am doing, and then point out the steps that I am unclear on.

I am doing initial SNP calling for the training samples using GATK4. To isolate confident HOM_REF calls, I filter the parent GVCFs for positions which are HOM_REF in both with minimum depth of 20 and a minimum number of reads supporting the REF allele being 18. I am not checking the evidence in the offspring. When I check the outcome of this filter things look good - I have only HOM_REF calls in the parents, and the offspring have almost all HOM_REF calls. There are, however, some loci in the offspring which are called as variants by the initial GATK4 genotyping. These could be errors in the parents, but given the parent genotypes are well supported, I think they are more likely errors in the offspring. The allele depth ratios for these loci seem to support this too. See plots below. Row 1: Male parent, Row 2: Female parent, Row 3: Offspring.

Image

So this seems good to me, and if my understanding of the general approach is correct, I want to keep these erroneous calls in the offspring (in this case false negative HOM_REF calls or false positive HET/HOM_ALT calls), as they will help the model learn what difficult-to-call HOM_REF sites might look like. I read in the Deepvariant docs that "variant calls within these regions and not in the truth_vcf are considered false positives". So am I right in thinking that the appropriate way to deal with these loci the would be to include them in the --confident_regions bed file, and make sure they are not in the truth set? And that by doing this I will be including known false positive variant calls (or false negative HOM_REF) in the training set?

I have a similar query regarding erroneous calls in the --truth_variants set. Based on initial GATK4 calls, I filter loci in the parents which are either 0/0, 1/1 in the father and mother, or vice versa, or 1/1/ in both parents. Again minimum depth and allele depth ratios are used, as well as genotype quality, strand bias etc. I get the expected outcome, with parents being either 0/0 or 1/1, and offspring being either 0/1 or 1/1, see below.

Image

However, again there are some locus calls in the offspring that do not match the mendelian expectation. E.g. father: 0/0, mother 1/1, offspring 0/0. Sticking to the philosophy of including training loci based only on the evidence in the parents, I should include such loci. But in this case I do not know exactly how to communicate to DV that, in this case, the locus is a false negative HET call, as the make_examples label based on the GATK4 call would be wrong. Should I filter out loci who's genotypes don't match mendelian expectations for the --truth_variants? Or does DV somehow know that the label is wrong for such loci? Or should I impute genotypes for these loci?

Your comments on the above would be very helpful, and also any other suggested improvements that I could make to the training data would be very welcome.

Thanks!

Dan

@kishwarshafin
Copy link
Collaborator

Hi @DanJeffries,

Thank you for such detailed issue, I will try to answer most of your questions first:

So am I right in thinking that the appropriate way to deal with these loci the would be to include them in the --confident_regions bed file, and make sure they are not in the truth set?

There are two scenarios that can happen:

  • You removed a variant from the truth set but it is actually a TP: DV in that case will tag it 0/0 erroneously.
  • You removed the region from confident_bed: DV will not consider this region for training so variants will be excluded.

A more conservative approach to exclude any region that seems confusing works best, then you can add more and see if it improves.

And that by doing this I will be including known false positive variant calls (or false negative HOM_REF) in the training set?

  • As replied before, there's a chance that the training set will have falsely labeled variants.

Should I filter out loci who's genotypes don't match mendelian expectations for the --truth_variants? Or does DV somehow know that the label is wrong for such loci? Or should I impute genotypes for these loci?

  • DV does not have the ability to check mandelian expectation during training, so you need to do it pre training.

Your approach seems very comprehensive, so I can't think anything on top of my head to improve your approach. I would probably suggest using DV for the initial variant generation as our GQ value is well calibrated so you can more confidently create the confident_bed, other than that, nothing on top of my head. I will bring this issue up in our internal meeting to see if others have any better ideas.

@DanJeffries
Copy link
Author

Hi @kishwarshafin

Thanks a lot for your response. Good to hear that my approach makes sense.

Can I just clarify on this point:

Should I filter out loci who's genotypes don't match mendelian expectations for the --truth_variants? Or does DV somehow know that the label is wrong for such loci? Or should I impute genotypes for these loci?

DV does not have the ability to check mandelian expectation during training, so you need to do it pre training.

So if DV has no way of knowing that these loci have erroneous labels, how should I deal with them? Is it better to

  1. remove them from the truth variants completely?
  2. Impute genotypes? E.g. if GATK calls are 0/0, 1/1, 0/0 for the mother, father, and offspring respectively (and the parent calls are reliable), then I could change the offspring genotype to 0/1, this would allow me to include problematic heterozygous calls in the training, but with the correct label. The same for the other genotype categories.

On the one hand number 2 would seem to me to be a good way of helping the model learn how to deal with problematic loci. But I have no real intuition for how much this will affect the training, as these loci really are a small fraction of the total training data.

Time is not infinite on this project so if it is unlikely to make a large difference I would perhaps just remove them

Thanks

Dan

@kishwarshafin
Copy link
Collaborator

Hi @DanJeffries ,

I would start with "remove them from the truth variants completely", I would actually remove the regions from the confident beds to make sure you are falsely removing any variant that are true. So just excluding them from training would be the right call.

Imputation would be a great solution, but a good and quick start is to exclude them from training and get a model. Then you can add more regions for training and see if it measures up to the baseline model you have.

@DanJeffries
Copy link
Author

Hi @kishwarshafin, and happy new year!

Ok thanks for the advice. So I will proceed as you suggest - leaving false positive variants in the confident regions bed, so they are seen by DV, but leaving out false negative 0/1 and 1/1 variants from the truth set and the confident regions.

It would be interesting to compare with and without the imputation, but I will see if I have time for this.

Thanks for the help!

Cheers

Dan

@DanJeffries
Copy link
Author

Hello again,

I am just reopening this issue to ask a quick question related to the same training run, which I am still working on. I followed the suggestion in the case study here to downsample the data, in the hope of making the model more robust to low coverage loci. My approach was to essentially double the set of examples, so each position is represented by a standard- and low-coverage (--downsample_fraction=0.5) example. However I now realise I do not understand exactly how DV deals with this information. Am I potentially confusing the training by including more than one example for the same locus?

Thanks as usual!

Dan

@pichuan
Copy link
Collaborator

pichuan commented Jan 14, 2025

@DanJeffries When you run make_examples with --downsample_fraction=0.5, it gets passed into third_party/nucleus/io/sam_reader.cc, which will keep 50% of the reads. The core logic can be found here:

https://github.com/google/deepvariant/blob/r1.8/third_party/nucleus/io/sam_reader.cc#L797-L808

So, suppose you start with a 50x BAM. And you do one make_examples with --downsample_fraction=0.5, it's conceptually the same as if you used a 25x BAM as input.

If you prefer to fully control what reads you sampled, you can also use other tools to make a 25x BAM first, and just run multiple make_examples without the --downsample_fraction flag. That way, you'll have better understanding and control of which reads are being kept in your lower coverage BAM.

I hope that answers your question.

@DanJeffries
Copy link
Author

hi @kishwarshafin, apologies, perhaps I was unclear with my question. I understand what the downsampling is doing. My question is to do with how DeepVariant deals with multiple examples from the same locus.

In my current pipeline I make examples for the full dataset (e.g. 50x bams). I then make examples again, using all loci, with --downsample_fraction=0.5. So then each locus is present in the set of training examples twice, once at full coverage and once at half coverage. My question is, will this create confusion for DeepVariant? Or is each example completely independent in the training, regardless of whether they share positions?

Thanks, Dan

@kishwarshafin
Copy link
Collaborator

hi @DanJeffries,

Examples from same loci with different coverages would not create major issues of overfitting as coverage titration creates enough difference between examples not to overfit. If you think of a normal distribution and want to create examples that have some deviation, coverage in this case provides you the deviation you require. So, please use fractions that are distant enough and overfitting should not be an issue.

@DanJeffries
Copy link
Author

Ok great, thanks @kishwarshafin

@DanJeffries
Copy link
Author

DanJeffries commented Jan 20, 2025

hi @kishwarshafin (again),

I have an additional question relating to my training efforts. I think it makes sense to keep them all on this general thread.

I am still not confident that I am making the training data properly. My preliminary training runs still show some strange issues, including very low hom-ref recall (e.g. <40% regardless of training hyperparams) and strange GQ distributions (different from those of the human wgs model).

Currently I am testing the theory that it to do with INDELS. I am exploring several things, but one specific question I have relates to exactly how TRUTH_INDELS should be included in the confident regions bed file. With SNPs it is simple, just convert the POS of the SNP in the VCF to POS-1 - POS in the bed file. With INDELS however it is less clear to me how I should do this. For example, imagine a locus with an insertion in the query, REF = A, ALT = ATTGA, with START POS in the VCF of, say, 1000. Currently I am not dealing with these loci any differently that SNPs, so position 999-1000 would be added to the confident regions bed. But the query in this case contains a 4 bp insertion. Should I then add positions 999-1004 to the bed file?

In general, it would be great to have some guidelines in the documentation at some point on how to prepare the truth variants and confident sites for non-model organisms. Unlike those working on model species, those of us working on other organisms don't have access to these sorts of pre-made resources, and it can be difficult to reverse engineer methods for making them.

Lastly, I wonder if you would be willing to take a look at my conf_regions.bed and truth_variants.vcf at some point, this might be a quicker way (for all of us) of figuring out what is going on.

Thanks for your help as always

Dan

@DanJeffries DanJeffries reopened this Jan 20, 2025
@kishwarshafin
Copy link
Collaborator

Hi @DanJeffries if you can upload the files I can take a look then suggest you the best ideas on how to create the bed file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants