Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Ensure tabix/bgzip always returns a bgzipped file. #7444

Open
charles-plessy opened this issue Feb 6, 2025 · 2 comments
Open

[FEATURE] Ensure tabix/bgzip always returns a bgzipped file. #7444

charles-plessy opened this issue Feb 6, 2025 · 2 comments

Comments

@charles-plessy
Copy link
Contributor

Is your feature request related to a problem? Please describe

The current tabix/bgzip module does two things:

  • Given an uncompressed file, return a bgzip-compressed file.
  • Given a compressed file, return an uncompressed file, even if the input file was already bgip-compressed.

Although it ensures that the output of tabix/bgzip will always be accepted by samtools/faidx, I find this behavior counter-intuitive and also counter-productive, since it removes the compression of bgzip-compressed files that are already good for samtools/faidx, which can make a big difference in terms of space and speed efficiency when the files are large genomes sequences.

Describe the solution you'd like

I propose to solve the problem by testing the compression method and acting accordingly for each of them.

--- a/modules/nf-core/tabix/bgzip/main.nf
+++ b/modules/nf-core/tabix/bgzip/main.nf
@@ -21,16 +21,27 @@ process TABIX_BGZIP {
     script:
     def args = task.ext.args ?: ''
     prefix   = task.ext.prefix ?: "${meta.id}"
-    in_bgzip = ["gz", "bgz", "bgzf"].contains(input.getExtension())
-    extension = in_bgzip ? input.getBaseName().tokenize(".")[-1] : input.getExtension()
-    output   = in_bgzip ? "${prefix}.${extension}" : "${prefix}.${extension}.gz"
-    command = in_bgzip ? '-d' : ''
+    in_zip   = ["bz2", "gz", "bgz", "bgzf", "xz"].contains(input.getExtension())
+    extension = in_zip ? input.getBaseName().tokenize(".")[-1] : input.getExtension()
+    output    = "${prefix}.${extension}.gz"
     // Name the index according to $prefix, unless a name has been requested
     if ((args.matches("(^| )-i\\b") || args.matches("(^| )--index(\$| )")) && !args.matches("(^| )-I\\b") && !args.matches("(^| )--index-name\\b")) {
         args = args + " -I ${output}.gzi"
     }
     """
-    bgzip $command -c $args -@${task.cpus} $input > ${output}
+    FILE_TYPE=\$(htsfile $input)
+    case "\$FILE_TYPE" in
+        *BGZF-compressed*)
+            ln -s $input ${output} ;;
+        *gzip-compressed*)
+            zcat  $input | bgzip -c $args -@${task.cpus} > ${output} ;;
+        *bzip2-compressed*)
+            bzcat $input | bgzip -c $args -@${task.cpus} > ${output} ;;
+        *XZ-compressed*)
+            xzcat $input | bgzip -c $args -@${task.cpus} > ${output} ;;
+        *)
+            bgzip -c $args -@${task.cpus} $input > ${output} ;;
+    esac
 
     cat <<-END_VERSIONS > versions.yml
     "${task.process}":

I added support for bzip2 and xz decompression too, but this is not essential and can be removed if there is no interest for. This said I verified that these tools are in the htslib biocontainer.

Describe alternatives you've considered

I also proposed a separate module in case the behaviour of tabix/bgzip can not be changed.

It is important to note that my proposal breaks the tests and use cases that do not check for potential collision between the file basename and the input metadata ID. In the current version of the module, the files are guaranteed to be renamed. After my patch is applied, if foo.fasta.gz is given as input and it is already bgzipped, the module will try to create a symbolic link to itself, which will cause an error. I would appreciate suggestions on how to better handle this.

Additional context

tabix/bgzf module maintainers are @JoseEspinosa , @drpatelh , @maxulysse , @nvnieuwk .

@awgymer
Copy link
Contributor

awgymer commented Feb 6, 2025

As discussed in slack I think this is better as a new module as it preserves the behaviour of the widely used tabix/bgzip.

Probably should be some consideration of whether it really makes sense to keep the new module under tabix if the scope is so wide but I guess if all recompression is with bgzip it may make sense.

Possible new names:

  • tabix/recompress
  • tabix/ensurecompressed
  • tabix/ensurebgz

I like the use of htsfile and it appears that bzip2 and xz support would not require any modification to the standard docker image used by the tabix/bgzip process.

(We could perhaps also replace file-type detection in tabix/bgzip with htsfile to be more robust to odd extensions?)

@charles-plessy
Copy link
Contributor Author

I would be pleased to make a PR based on #7433 that I would upgrade to the use of htsfile instead of file. Regarding the name of the module, my preference would be to have it under samtools/* and have it use the same containers as the other samtools modules as the typical use is to prepare input for samtools/faidx. This said, I will of course use the name that will get consensus in this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants