Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ABH format assumes non-missing, polymorphic SNPs between parents #3

Closed
carolyncaron opened this issue Apr 27, 2017 · 4 comments
Closed
Assignees
Labels

Comments

@carolyncaron
Copy link
Member

If the parents are identical, or missing, the script will happily translate all SNPs across the individuals to match, i.e. If Parent A is missing, all individuals that are also missing will have allele A.

For now, we've ensured the files we're distributing to our group have polymorphic non-missing calls for the parents. But we should not necessarily be making this assumption and thus we should handle this situation properly in case we run into it in the future.

@carolyncaron carolyncaron self-assigned this Apr 27, 2017
@carolyncaron
Copy link
Member Author

I fixed this bug, but because I was lazy I didn't create a pull request first, and now I regret it. :-(

Requesting review from @laceysanderson. All changes are in commit cc808c4

How I fixed the problem

When looking at the genotype calls for each parent, there is now a check for if the alleles are missing (".") or different (implying heterozygosity), in which case a message is printed to the terminal indicating the SNP site and which parent had the problematic genotype. I then continue to the next line of the while loop, thereby skipping processing of that line in the input file (and no conversion for said line will exist in the output file).

How to test

At present, the vcf_filter module uses the first two columns of a VCF as the maternal and paternal parents, respectively. You can upload a VCF file to the module through the administration page with known missing (./.) or heterozygous alleles (0/1 or 1/0) in either of these columns. Then, select your file and select "A/B format" as your Export format. Run the tripal job manually, and check for output regarding skipping of SNP sites in which the first two columns had missing or heterozygous alleles. You can download the output file to confirm these sites were omitted.

@laceysanderson
Copy link
Member

Did some research on ABH format with the intent of deciding whether I could find an alternative to your approach of removing lines where the parents are missing or heterozygous.

  • TASSLE GenostoABH Plugin takes the same approach you do when converting to ABH
  • ParentChecker uses statistics based on the assumption of a RIL to infer the parent.
  • Another option could be to use X or N to signify a non-parental allele. This could allow a researcher to inspect each such site and correct the calls.

I'm not sure I'm recommending either of the alternatives... Perhaps the best approach would be to let the researcher choose by providing ABH format specific options to exclude sites with missing/heterozygous parents or denote non-parental alleles with X. This would definitely need an additional issue and could be considered an enhancement. In the meantime I would suggest clarifying our approach in the description of the ABH format.

@laceysanderson
Copy link
Member

laceysanderson commented May 9, 2017

This worked as expected on LR-86 and LR-70. Unfortunately both had already been filtered to be bi-allelic and neither seems to have both parents the same (expected a row with no B). When I tried to test converting the UCDavis set to ABH (which would have had both these issues :-) I ran into #7 and the job seemed to hang.

Concerns (to be addressed in a pull request!):

  • Sites with 3 alleles. Currently it is suspected that the third allele would simply be changed to missing and the site would remain. I feel this is very misleading and that these sites should also be skipped.
  • Sites were both parents are the same. Have you tested this? I would expect to see both parents as A and no B's in the row.
  • Document on the VCF filter page that this export format will be restricted to bi-allelic sites with no missing calls in the parents.

laceysanderson added a commit that referenced this issue Apr 19, 2018
Addresses issue #3: ABH format assumes non-missing, polymorphic SNPs between parents
@laceysanderson
Copy link
Member

Fixed in PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants