You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As single-cell datasets are really sparse, it's important to handle missing values in a way that doesn't consume too much memory. Currently, CellSNP labels missing entries with ".:.:.:.:.:."
(11 bits at best). I would strongly suggest using an empty string instead of that stub. I have been processing the output of CellSNP, and when I manually replaced all occurrences of ".:.:.:.:.:." with an empty string, I reduced the file size from 25.6Gb to 2.5Gb. This is dramatic. Not only that this choice of nan-filling value wastes the memory but it also makes the file harder to process using some convenient tools in Python/R.
The text was updated successfully, but these errors were encountered:
Very good point. The reason we used ".:.:.:.:.:." is to keep the same format (i.e., the same number of tags) even it is missing. I will check if common R/Python packages processing VCF files is compatible with "." for missing values. If positive, this indeed will save a lot of space.
Alternatively, from v0.1.6, it supports saving to sparse matrices for AD, DP, OTH tags.
please use -O OUT_DIR instead of -o OUT_FILE.vcf.gz.
Also, you can use sparseVCF.py to convert existing VCF.gz into sparse matrices.
As single-cell datasets are really sparse, it's important to handle missing values in a way that doesn't consume too much memory. Currently, CellSNP labels missing entries with ".:.:.:.:.:."
(11 bits at best). I would strongly suggest using an empty string instead of that stub. I have been processing the output of CellSNP, and when I manually replaced all occurrences of ".:.:.:.:.:." with an empty string, I reduced the file size from 25.6Gb to 2.5Gb. This is dramatic. Not only that this choice of nan-filling value wastes the memory but it also makes the file harder to process using some convenient tools in Python/R.
The text was updated successfully, but these errors were encountered: