Merge pull request #14 from COMBINE-lab/develop

Develop
COMBINE-lab · Jan 7, 2023 · fa48fe6 · fa48fe6
2 parents 39e552b + 815def0
commit fa48fe6
Show file tree

Hide file tree

Showing 7 changed files with 1,276 additions and 10 deletions.
diff --git a/README.md b/README.md
@@ -13,20 +13,20 @@ The `pyroe` package provides useful functions for analyzing single-cell or singl
 ## Installation
 The `pyroe` package can be accessed from its [github repository](https://github.com/COMBINE-lab/pyroe), installed via [`pip`](https://pip.pypa.io/en/stable/). To install the `pyroe` package via `pip` use the command:
 
-```
+```sh
 pip install pyroe
 ```
 
 To make use of the `load_fry` function (which, itself, installs [scanpy](https://scanpy.readthedocs.io/en/stable/)), you should also be sure to install the package with the `scanpy` extra:
 
-```
+```sh
 pip install pyroe[scanpy]
 ```
 
 Alternatively, `pyroe` can be installed via `bioconda`, which will automatically install the variant of the package including `load_fry`, and will
 also install `bedtools` to enable faster construction of the *splici* reference (see below).  This installation can be performed with the command:
 
-```
+```sh
 conda install pyroe
 ```
 
@@ -39,7 +39,7 @@ The USA mode in alevin-fry requires a special index reference, which is called t
 
 Following is an example of calling the `pyroe` to make the *splici* index reference. The final flank length is calculated as the difference between the read length and the flank_trim_length, i.e., 5-2=3. This function allows you to add extra spliced and unspliced sequences to the *splici* index, which will be useful when some unannotated sequences, such as mitochondrial genes, are important for your experiment. **Note** : to make `pyroe` work more quickly, it is recommended to have the latest version of [`bedtools`](https://bedtools.readthedocs.io/en/latest/) ([Aaron R. Quinlan and Ira M. Hall, 2010](https://doi.org/10.1093/bioinformatics/btq033)) installed.
 
-```
+```sh
 pyroe make-splici extdata/small_example_genome.fa extdata/small_example.gtf 5 splici_txome \
       --flank-trim-length 2 --filename-prefix transcriptome_splici --dedup-seqs
 ```
@@ -90,6 +90,47 @@ optional arguments:
 
 The *splici* index of a given species consists of the transcriptome of the species, i.e., the spliced transcripts, and the intronic sequences of the species. Within a gene, if the flanked intronic sequences overlap with each other, the overlapped intronic sequences will be collapsed as a single intronic sequence to make sure each base will appear only once in the intronic sequences. For more detailed information, please check the section S2 in the supplementary file of [alevin-fry manuscript](https://www.biorxiv.org/content/10.1101/2021.06.29.450377v2).
 
+## Prepare spliceu index for quantification with alevin-fry
+
+Recently, [He et al.](https://www.biorxiv.org/content/10.1101/2023.01.04.522742v1) introduced the <ins>*splice*</ins>d+<ins>*u*</ins>nspliced (_spliceu_) index in alevin-fry. This requires the _spliceu_ transcriptome. The command of making an *spliceu* transcriptome reference is similar to making a _splici_ reference:
+
+```sh
+pyroe make-spliceu extdata/small_example_genome.fa extdata/small_example.gtf spliceu_txome \
+--filename-prefix transcriptome_spliceu
+```
+
+### Full usage
+```
+usage: pyroe make-spliceu [-h] [--filename-prefix FILENAME_PREFIX]
+                          [--extra-spliced EXTRA_SPLICED]
+                          [--extra-unspliced EXTRA_UNSPLICED]
+                          [--bt-path BT_PATH] [--no-bt] [--dedup-seqs]
+                          [--write-clean-gtf]
+                          genome-path gtf-path output-dir
+
+positional arguments:
+  genome-path           The path to a genome fasta file.
+  gtf-path              The path to a gtf file.
+  output-dir            The output directory where Spliceu reference files
+                        will be written.
+
+options:
+  -h, --help            show this help message and exit
+  --filename-prefix FILENAME_PREFIX
+                        The file name prefix of the generated output files.
+  --extra-spliced EXTRA_SPLICED
+                        The path to an extra spliced sequence fasta file.
+  --extra-unspliced EXTRA_UNSPLICED
+                        The path to an extra unspliced sequence fasta file.
+  --bt-path BT_PATH     The path to bedtools v2.30.0 or greater.
+  --no-bt               A flag indicates whether bedtools will be used for
+                        generating Spliceu reference files.
+  --dedup-seqs          A flag indicates whether identical sequences will be
+                        deduplicated.
+  --write-clean-gtf     A flag indicates whether a clean gtf will be written
+                        if encountered invalid records.
+```
+
 ## Processing alevin-fry quantification result
 
 The quantification result of alevin-fry can be loaded into python by the `load_fry()` function. This function takes a output directory returned by `alevin-fry quant` command as the minimum input, and load the quantification result as an `AnnData` object. When processing USA mode result, it assumes that the data comes from a single-cell RNA-sequencing experiment. If one wants to process single-nucleus RNA-sequencing data or prepare the single-cell data for RNA-velocity analysis, the `output_format` argument should be set as `snRNA` or `velocity` correspondingly. One can also define customized output format, see the Full Usage section for detail.
@@ -176,7 +217,7 @@ We provide two python functions:
 - `load_processed_quant()` can fetch the quantification result of one or more available dataset as `fetch_processed_quant()`, and load them into python as `AnnData` objects. We also provide a CLI for fetching quantification results.
 
 
-```bash
+```sh
 pyroe fetch-quant 1 3 6
 ```
 

diff --git a/bin/pyroe b/bin/pyroe
@@ -2,7 +2,7 @@
 
 import logging
 
-from pyroe import make_splici_txome
+from pyroe import make_splici_txome, make_spliceu_txome
 from pyroe import fetch_processed_quant
 from pyroe import convert
 from pyroe import id_to_name
@@ -22,12 +22,15 @@ if __name__ == "__main__":
     parser.add_argument(
         "-v", "--version", action="version", version=f"pyroe {__version__}"
     )
+
     subparsers = parser.add_subparsers(
         title="subcommands",
         dest="command",
         description="valid subcommands",
         help="additional help",
     )
+
+    # make-splici
     parser_makeSplici = subparsers.add_parser(
         "make-splici", help="Make splici reference"
     )
@@ -101,6 +104,63 @@ if __name__ == "__main__":
         help="A flag indicates whether a clean gtf will be written if encountered invalid records.",
     )
 
+    # make-spliceu
+    parser_makeSpliceu = subparsers.add_parser(
+        "make-spliceu", help="Make spliceu reference"
+    )
+    parser_makeSpliceu.add_argument(
+        "genome_path",
+        metavar="genome-path",
+        type=str,
+        help="The path to a genome fasta file.",
+    )
+    parser_makeSpliceu.add_argument(
+        "gtf_path", metavar="gtf-path", type=str, help="The path to a gtf file."
+    )
+    parser_makeSpliceu.add_argument(
+        "output_dir",
+        metavar="output-dir",
+        type=str,
+        help="The output directory where Spliceu reference files will be written.",
+    )
+    parser_makeSpliceu.add_argument(
+        "--filename-prefix",
+        type=str,
+        default="spliceu",
+        help="The file name prefix of the generated output files.",
+    )
+    parser_makeSpliceu.add_argument(
+        "--extra-spliced",
+        type=str,
+        help="The path to an extra spliced sequence fasta file.",
+    )
+    parser_makeSpliceu.add_argument(
+        "--extra-unspliced",
+        type=str,
+        help="The path to an extra unspliced sequence fasta file.",
+    )
+    parser_makeSpliceu.add_argument(
+        "--bt-path",
+        type=str,
+        default="bedtools",
+        help="The path to bedtools v2.30.0 or greater.",
+    )
+    parser_makeSpliceu.add_argument(
+        "--no-bt",
+        action="store_true",
+        help="A flag indicates whether bedtools will be used for generating Spliceu reference files.",
+    )
+    parser_makeSpliceu.add_argument(
+        "--dedup-seqs",
+        action="store_true",
+        help="A flag indicates whether identical sequences will be deduplicated.",
+    )
+    parser_makeSpliceu.add_argument(
+        "--write-clean-gtf",
+        action="store_true",
+        help="A flag indicates whether a clean gtf will be written if encountered invalid records.",
+    )
+
     # parse available datasets
     available_datasets = fetch_processed_quant()
     epilog = "\n".join(
@@ -212,6 +272,19 @@ if __name__ == "__main__":
             no_flanking_merge=args.no_flanking_merge,
             write_clean_gtf=args.write_clean_gtf,
         )
+    elif args.command == "make-spliceu":
+        make_spliceu_txome(
+            genome_path=args.genome_path,
+            gtf_path=args.gtf_path,
+            output_dir=args.output_dir,
+            filename_prefix=args.filename_prefix,
+            extra_spliced=args.extra_spliced,
+            extra_unspliced=args.extra_unspliced,
+            dedup_seqs=args.dedup_seqs,
+            no_bt=args.no_bt,
+            bt_path=args.bt_path,
+            write_clean_gtf=args.write_clean_gtf,
+        )
     elif args.command == "fetch-quant":
         fetch_processed_quant(
             dataset_ids=args.dataset_ids,

diff --git a/setup.cfg b/setup.cfg
@@ -1,6 +1,6 @@
 [metadata]
 name = pyroe
-version = 0.6.4
+version = 0.7.0
 author = Dongze He, Rob Patro
 author_email = [email protected], [email protected]
 description = utilities of alevin-fry

diff --git a/src/pyroe/__init__.py b/src/pyroe/__init__.py
@@ -1,7 +1,7 @@
-__version__ = "0.6.4"
+__version__ = "0.7.0"
 
 from pyroe.load_fry import load_fry
-from pyroe.make_splici_txome import make_splici_txome
+from pyroe.make_txome import make_splici_txome, make_spliceu_txome
 from pyroe.fetch_processed_quant import fetch_processed_quant
 from pyroe.load_processed_quant import load_processed_quant
 from pyroe.ProcessedQuant import ProcessedQuant