You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe
The current tabix/bgzip module does two things:
Given an uncompressed file, return a bgzip-compressed file.
Given a compressed file, return an uncompressed file, even if the input file was already bgip-compressed.
Although it ensures that the output of tabix/bgzip will always be accepted by samtools/faidx, I find this behavior counter-intuitive and also counter-productive, since it removes the compression of bgzip-compressed files that are already good for samtools/faidx, which can make a big difference in terms of space and speed efficiency when the files are large genomes sequences.
Describe the solution you'd like
I propose to solve the problem by testing the compression method and acting accordingly for each of them.
I added support for bzip2 and xz decompression too, but this is not essential and can be removed if there is no interest for. This said I verified that these tools are in the htslib biocontainer.
Describe alternatives you've considered
I also proposed a separate module in case the behaviour of tabix/bgzip can not be changed.
It is important to note that my proposal breaks the tests and use cases that do not check for potential collision between the file basename and the input metadata ID. In the current version of the module, the files are guaranteed to be renamed. After my patch is applied, if foo.fasta.gz is given as input and it is already bgzipped, the module will try to create a symbolic link to itself, which will cause an error. I would appreciate suggestions on how to better handle this.
As discussed in slack I think this is better as a new module as it preserves the behaviour of the widely used tabix/bgzip.
Probably should be some consideration of whether it really makes sense to keep the new module under tabix if the scope is so wide but I guess if all recompression is with bgzip it may make sense.
Possible new names:
tabix/recompress
tabix/ensurecompressed
tabix/ensurebgz
I like the use of htsfile and it appears that bzip2 and xz support would not require any modification to the standard docker image used by the tabix/bgzip process.
(We could perhaps also replace file-type detection in tabix/bgzip with htsfile to be more robust to odd extensions?)
I would be pleased to make a PR based on #7433 that I would upgrade to the use of htsfile instead of file. Regarding the name of the module, my preference would be to have it under samtools/* and have it use the same containers as the other samtools modules as the typical use is to prepare input for samtools/faidx. This said, I will of course use the name that will get consensus in this issue.
Is your feature request related to a problem? Please describe
The current
tabix/bgzip
module does two things:Although it ensures that the output of
tabix/bgzip
will always be accepted bysamtools/faidx
, I find this behavior counter-intuitive and also counter-productive, since it removes the compression of bgzip-compressed files that are already good forsamtools/faidx
, which can make a big difference in terms of space and speed efficiency when the files are large genomes sequences.Describe the solution you'd like
I propose to solve the problem by testing the compression method and acting accordingly for each of them.
I added support for bzip2 and xz decompression too, but this is not essential and can be removed if there is no interest for. This said I verified that these tools are in the htslib biocontainer.
Describe alternatives you've considered
I also proposed a separate module in case the behaviour of
tabix/bgzip
can not be changed.It is important to note that my proposal breaks the tests and use cases that do not check for potential collision between the file basename and the input metadata ID. In the current version of the module, the files are guaranteed to be renamed. After my patch is applied, if
foo.fasta.gz
is given as input and it is already bgzipped, the module will try to create a symbolic link to itself, which will cause an error. I would appreciate suggestions on how to better handle this.Additional context
tabix/bgzf
module maintainers are @JoseEspinosa , @drpatelh , @maxulysse , @nvnieuwk .The text was updated successfully, but these errors were encountered: