diff --git a/index.html b/index.html index 97371b8..a92f160 100644 --- a/index.html +++ b/index.html @@ -269,27 +269,18 @@
oarfish
oarfish
is a program, written in rust
, for quantifying transcript-level expression from long-read (i.e. Oxford nanopore cDNA and direct RNA and PacBio) sequencing technologies. oarfish
requires a sample of sequencing reads aligned to the transcriptome (currntly not to the genome). It handles multi-mapping reads through the use of probabilistic allocation via an expectation-maximization (EM) algorithm.
There are many methods and programs that exist for transcript discovery or
-identification of novel transcripts using long-read RNA sequencing data;
-oarfish
does not tackle this problem. Rather, oarfish
is focused entirely
-on accurate quantification of transcripts. Of course, if you wish to add
-transcripts to the catalog to be quantified, you can perform discovery upstream
-of oarfish
, and then quantify the newly-dicovered transcipts with oarfish
.
oarfish
's methodoarfish
evaluates alignments of the sequencing reads against the transcriptome. For reads that align to multiple transcripts, oarfish
attempts to resolve their allocation probabilistically using an iterative algorithm (expectation maximization — EM).
oarfish
optionally employs many filters to help discard alignments that may reduce quantification accuracy. Currently, the set of filters applied in oarfish
are directly derived from the NanoCount
1 tool; both the filters that exist, and the way their values are set (with the exception of the --three-prime-clip
filter, which is not set by default in oarfish
but is in NanoCount
).
Additionally, oarfish
provides options to make use of coverage profiles derived from the aligned reads to improve quantification accuracy. The use of this coverage model is enabled with the --model-coverage
flag. If this flag is passed, then oarfish
will apply the coverage model to help further ascertain the origin of each read, which can lead to improved quantification accuracy.
oarfish
is a program, written in rust
, for quantifying transcript-level expression from long-read (i.e. Oxford nanopore cDNA and direct RNA and PacBio) sequencing technologies. oarfish
requires a sample of sequencing reads aligned to the transcriptome (currntly not to the genome). It handles multi-mapping reads through the use of probabilistic allocation via an expectation-maximization (EM) algorithm.
It optionally employs many filters to help discard alignments that may reduce quantification accuracy. Currently, the set of filters applied in oarfish
are directly derived from the NanoCount
1 tool; both the filters that exist, and the way their values are set (with the exception of the --three-prime-clip
filter, which is not set by default in oarfish
but is in NanoCount
).
Additionally, oarfish
provides options to make use of coverage profiles derived from the aligned reads to improve quantification accuracy. The use of this coverage model is enabled with the --model-coverage
flag. You can read more about oarfish
2 in the preprint. Please cite the preprint if you use oarfish
in your work or analysis.
The usage can be provided by passing -h
at the command line.
accurate transcript quantification from long-read RNA-seq data
+A fast, accurate and versatile tool for long-read transcript quantification.
Usage: oarfish [OPTIONS] --alignments <ALIGNMENTS> --output <OUTPUT>
Options:
- --quiet be quiet (i.e. don't output log messages that aren't at least warnings)
- --verbose be verbose (i.e. output all non-developer logging messages)
- -a, --alignments <ALIGNMENTS> path to the file containing the input alignments
- -o, --output <OUTPUT> location where output quantification file should be written
- -t, --threads <THREADS> maximum number of cores that the oarfish can use to obtain binomial probability [default: 1]
- -h, --help Print help
- -V, --version Print version
+ --quiet
+ be quiet (i.e. don't output log messages that aren't at least warnings)
+ --verbose
+ be verbose (i.e. output all non-developer logging messages)
+ -a, --alignments <ALIGNMENTS>
+ path to the file containing the input alignments
+ -o, --output <OUTPUT>
+ location where output quantification file should be written
+ -j, --threads <THREADS>
+ maximum number of cores that the oarfish can use to obtain binomial probability [default: 1]
+ --num-bootstraps <NUM_BOOTSTRAPS>
+ number of bootstrap replicates to produce to assess quantification uncertainty [default: 0]
+ -h, --help
+ Print help
+ -V, --version
+ Print version
filters:
--filter-group <FILTER_GROUP>
@@ -462,15 +445,15 @@ Details about oarfish
's meth
-q, --short-quant <SHORT_QUANT>
location of short read quantification (if provided)
-You can set the various filters using the options listed above. Alternatively, oarfish
exposes a --filter-group
option. This filter-group option applies a collection of values to the filter options at once. Currently, the available filter groups are nanocount-filters
which seeks to match the filter parameters of NanoCount
as closely as possible. It's worth noting that, like NanoCount
these flags are primarily designed for direct RNA-seq data as negative-strand alignments (which may be expected in ONT cNDA sequencing) will be discarded. Likewise the no-filters
flag disables as many filters as possible, leaving it entirely up to the quantification algorithm to determine the best placement of each read, and having to account for each read even if all of the alignments are of rather poor quality.
-Input
The input should be a bam
format file, with reads aligned using minimap2
against the transcriptome. That is, oarfish
does not currently handle spliced alignment to the genome. Further, the output alignments should be name sorted (the default order produced by minimap2
should be fine). Specifically, oarfish
relies on the existence of the AS
tag in the bam
records that encodes the alignment score in order to obtain the score for each alignment (which is used in probabilistic read assignment), and the score of the best alignment, overall, for each read.
-Note: The actual characteristics required by oarfish
are that the provided alignments align each read to the transcriptome, and report all alignments that should be considered for each read (i.e. multimappings are allowed). Likewise, the alignments for a given read should be adjacent in the input bam
file, and each valid alignment should have an AS
flag. If you'd like support for an alternative aligner that meets these requirements, please reach out (e.g. open a GitHub issue), and we'd be happy to consider adding support.
+Inferential Replicates
+oarfish
has the ability to compute inferential replicates of its quantification estimates. This is performed by bootstrap sampling of the original read mappings, and subsequently performing inference under each resampling. These inferential replicates allow assessing the variance of the point estimate of transcript abundance, and can lead to improved differential analysis at the transcript level, if using a differential testing tool that takes advantage of this information. The generation of inferential replicates is controlled by the --num-bootstraps
argument to oarfish
. The default value is 0
, meaning that no inferential replicates are generated. If you set this to some value greater than 0
, the the requested number of inferential replicates will be generated. It is recommended, if generating inferential replicates, to run oarfish
with multiple threads, since replicate generation is highly-parallelized. Finally, if replicates are generated, they are written to a Parquet
, starting with the specified output stem and ending with infreps.pq
.
Output
The --output
option passed to oarfish
corresponds to a path prefix (this prefix can contain the path separator character and if it refers to a directory that does not yeat exist, that directory will be created). Based on this path prefix, say P
, oarfish
will create 2 files:
P.meta_info.json
- a JSON format file containing information about relevant parameters with which oarfish
was run, and other relevant inforamtion from the processed sample apart from the actual transcript quantifications.
P.quant
- a tab separated file listing the quantified targets, as well as information about their length and other metadata. The num_reads
column provides the estimate of the number of reads originating from each target.
+P.infreps.pq
- a Parquet
table where each row is a transcript and each column is an inferential replicate, containing the estimated counts for each transcript under each computed inferential replicate.
References
@@ -479,6 +462,9 @@ References
-
Josie Gleeson, Adrien Leger, Yair D J Prawer, Tracy A Lane, Paul J Harrison, Wilfried Haerty, Michael B Clark, Accurate expression quantification from nanopore direct RNA sequencing with NanoCount, Nucleic Acids Research, Volume 50, Issue 4, 28 February 2022, Page e19, https://doi.org/10.1093/nar/gkab1129 ↩
+-
+
Zahra Zare Jousheghani, Rob Patro. Oarfish: Enhanced probabilistic modeling leads to improved accuracy in long read transcriptome quantification, bioRxiv 2024.02.28.582591; doi: https://doi.org/10.1101/2024.02.28.582591 ↩
+
diff --git a/search/search_index.json b/search/search_index.json
index 4c3eb90..ee7ab5d 100644
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"oarfish: transcript quantification from long-read RNA-seq data","text":""},{"location":"#about-oarfish","title":"About oarfish
","text":"oarfish
is a program, written in rust
, for quantifying transcript-level expression from long-read (i.e. Oxford nanopore cDNA and direct RNA and PacBio) sequencing technologies. oarfish
requires a sample of sequencing reads aligned to the transcriptome (currntly not to the genome). It handles multi-mapping reads through the use of probabilistic allocation via an expectation-maximization (EM) algorithm.
There are many methods and programs that exist for transcript discovery or identification of novel transcripts using long-read RNA sequencing data; oarfish
does not tackle this problem. Rather, oarfish
is focused entirely on accurate quantification of transcripts. Of course, if you wish to add transcripts to the catalog to be quantified, you can perform discovery upstream of oarfish
, and then quantify the newly-dicovered transcipts with oarfish
.
"},{"location":"#details-about-oarfishs-method","title":"Details about oarfish
's method","text":"oarfish
evaluates alignments of the sequencing reads against the transcriptome. For reads that align to multiple transcripts, oarfish
attempts to resolve their allocation probabilistically using an iterative algorithm (expectation maximization \u2014 EM).
oarfish
optionally employs many filters to help discard alignments that may reduce quantification accuracy. Currently, the set of filters applied in oarfish
are directly derived from the NanoCount
1 tool; both the filters that exist, and the way their values are set (with the exception of the --three-prime-clip
filter, which is not set by default in oarfish
but is in NanoCount
).
Additionally, oarfish
provides options to make use of coverage profiles derived from the aligned reads to improve quantification accuracy. The use of this coverage model is enabled with the --model-coverage
flag. If this flag is passed, then oarfish
will apply the coverage model to help further ascertain the origin of each read, which can lead to improved quantification accuracy.
The usage can be provided by passing -h
at the command line.
accurate transcript quantification from long-read RNA-seq data\n\nUsage: oarfish [OPTIONS] --alignments <ALIGNMENTS> --output <OUTPUT>\n\nOptions:\n --quiet be quiet (i.e. don't output log messages that aren't at least warnings)\n --verbose be verbose (i.e. output all non-developer logging messages)\n -a, --alignments <ALIGNMENTS> path to the file containing the input alignments\n -o, --output <OUTPUT> location where output quantification file should be written\n -t, --threads <THREADS> maximum number of cores that the oarfish can use to obtain binomial probability [default: 1]\n -h, --help Print help\n -V, --version Print version\n\nfilters:\n --filter-group <FILTER_GROUP>\n [possible values: no-filters, nanocount-filters]\n -t, --three-prime-clip <THREE_PRIME_CLIP>\n maximum allowable distance of the right-most end of an alignment from the 3' transcript end [default: 4294967295]\n -f, --five-prime-clip <FIVE_PRIME_CLIP>\n maximum allowable distance of the left-most end of an alignment from the 5' transcript end [default: 4294967295]\n -s, --score-threshold <SCORE_THRESHOLD>\n fraction of the best possible alignment score that a secondary alignment must have for consideration [default: 0.95]\n -m, --min-aligned-fraction <MIN_ALIGNED_FRACTION>\n fraction of a query that must be mapped within an alignemnt to consider the alignemnt valid [default: 0.5]\n -l, --min-aligned-len <MIN_ALIGNED_LEN>\n minimum number of nucleotides in the aligned portion of a read [default: 50]\n -n, --allow-negative-strand\n allow both forward-strand and reverse-complement alignments\n\ncoverage model:\n --model-coverage apply the coverage model\n -b, --bins <BINS> number of bins to use in coverage model [default: 10]\n\nEM:\n --max-em-iter <MAX_EM_ITER>\n maximum number of iterations for which to run the EM algorithm [default: 1000]\n --convergence-thresh <CONVERGENCE_THRESH>\n maximum number of iterations for which to run the EM algorithm [default: 0.001]\n -q, --short-quant <SHORT_QUANT>\n location of short read quantification (if provided)\n
You can set the various filters using the options listed above. Alternatively, oarfish
exposes a --filter-group
option. This filter-group option applies a collection of values to the filter options at once. Currently, the available filter groups are nanocount-filters
which seeks to match the filter parameters of NanoCount
as closely as possible. It's worth noting that, like NanoCount
these flags are primarily designed for direct RNA-seq data as negative-strand alignments (which may be expected in ONT cNDA sequencing) will be discarded. Likewise the no-filters
flag disables as many filters as possible, leaving it entirely up to the quantification algorithm to determine the best placement of each read, and having to account for each read even if all of the alignments are of rather poor quality.
"},{"location":"#input","title":"Input","text":"The input should be a bam
format file, with reads aligned using minimap2
against the transcriptome. That is, oarfish
does not currently handle spliced alignment to the genome. Further, the output alignments should be name sorted (the default order produced by minimap2
should be fine). Specifically, oarfish
relies on the existence of the AS
tag in the bam
records that encodes the alignment score in order to obtain the score for each alignment (which is used in probabilistic read assignment), and the score of the best alignment, overall, for each read.
Note: The actual characteristics required by oarfish
are that the provided alignments align each read to the transcriptome, and report all alignments that should be considered for each read (i.e. multimappings are allowed). Likewise, the alignments for a given read should be adjacent in the input bam
file, and each valid alignment should have an AS
flag. If you'd like support for an alternative aligner that meets these requirements, please reach out (e.g. open a GitHub issue), and we'd be happy to consider adding support.
"},{"location":"#output","title":"Output","text":"The --output
option passed to oarfish
corresponds to a path prefix (this prefix can contain the path separator character and if it refers to a directory that does not yeat exist, that directory will be created). Based on this path prefix, say P
, oarfish
will create 2 files:
P.meta_info.json
- a JSON format file containing information about relevant parameters with which oarfish
was run, and other relevant inforamtion from the processed sample apart from the actual transcript quantifications. P.quant
- a tab separated file listing the quantified targets, as well as information about their length and other metadata. The num_reads
column provides the estimate of the number of reads originating from each target.
"},{"location":"#references","title":"References","text":" -
Josie Gleeson, Adrien Leger, Yair D J Prawer, Tracy A Lane, Paul J Harrison, Wilfried Haerty, Michael B Clark, Accurate expression quantification from nanopore direct RNA sequencing with NanoCount, Nucleic Acids Research, Volume 50, Issue 4, 28 February 2022, Page e19, https://doi.org/10.1093/nar/gkab1129 \u21a9
"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"oarfish: transcript quantification from long-read RNA-seq data","text":""},{"location":"#basic-usage","title":"Basic usage","text":"oarfish
is a program, written in rust
, for quantifying transcript-level expression from long-read (i.e. Oxford nanopore cDNA and direct RNA and PacBio) sequencing technologies. oarfish
requires a sample of sequencing reads aligned to the transcriptome (currntly not to the genome). It handles multi-mapping reads through the use of probabilistic allocation via an expectation-maximization (EM) algorithm.
It optionally employs many filters to help discard alignments that may reduce quantification accuracy. Currently, the set of filters applied in oarfish
are directly derived from the NanoCount
1 tool; both the filters that exist, and the way their values are set (with the exception of the --three-prime-clip
filter, which is not set by default in oarfish
but is in NanoCount
).
Additionally, oarfish
provides options to make use of coverage profiles derived from the aligned reads to improve quantification accuracy. The use of this coverage model is enabled with the --model-coverage
flag. You can read more about oarfish
2 in the preprint. Please cite the preprint if you use oarfish
in your work or analysis.
The usage can be provided by passing -h
at the command line.
A fast, accurate and versatile tool for long-read transcript quantification.\n\nUsage: oarfish [OPTIONS] --alignments <ALIGNMENTS> --output <OUTPUT>\n\nOptions:\n --quiet\n be quiet (i.e. don't output log messages that aren't at least warnings)\n --verbose\n be verbose (i.e. output all non-developer logging messages)\n -a, --alignments <ALIGNMENTS>\n path to the file containing the input alignments\n -o, --output <OUTPUT>\n location where output quantification file should be written\n -j, --threads <THREADS>\n maximum number of cores that the oarfish can use to obtain binomial probability [default: 1]\n --num-bootstraps <NUM_BOOTSTRAPS>\n number of bootstrap replicates to produce to assess quantification uncertainty [default: 0]\n -h, --help\n Print help\n -V, --version\n Print version\n\nfilters:\n --filter-group <FILTER_GROUP>\n [possible values: no-filters, nanocount-filters]\n -t, --three-prime-clip <THREE_PRIME_CLIP>\n maximum allowable distance of the right-most end of an alignment from the 3' transcript end [default: 4294967295]\n -f, --five-prime-clip <FIVE_PRIME_CLIP>\n maximum allowable distance of the left-most end of an alignment from the 5' transcript end [default: 4294967295]\n -s, --score-threshold <SCORE_THRESHOLD>\n fraction of the best possible alignment score that a secondary alignment must have for consideration [default: 0.95]\n -m, --min-aligned-fraction <MIN_ALIGNED_FRACTION>\n fraction of a query that must be mapped within an alignemnt to consider the alignemnt valid [default: 0.5]\n -l, --min-aligned-len <MIN_ALIGNED_LEN>\n minimum number of nucleotides in the aligned portion of a read [default: 50]\n -n, --allow-negative-strand\n allow both forward-strand and reverse-complement alignments\n\ncoverage model:\n --model-coverage apply the coverage model\n -b, --bins <BINS> number of bins to use in coverage model [default: 10]\n\nEM:\n --max-em-iter <MAX_EM_ITER>\n maximum number of iterations for which to run the EM algorithm [default: 1000]\n --convergence-thresh <CONVERGENCE_THRESH>\n maximum number of iterations for which to run the EM algorithm [default: 0.001]\n -q, --short-quant <SHORT_QUANT>\n location of short read quantification (if provided)\n
The input should be a bam
format file, with reads aligned using minimap2
against the transcriptome. That is, oarfish
does not currently handle spliced alignment to the genome. Further, the output alignments should be name sorted (the default order produced by minimap2
should be fine). Specifically, oarfish
relies on the existence of the AS
tag in the bam
records that encodes the alignment score in order to obtain the score for each alignment (which is used in probabilistic read assignment), and the score of the best alignment, overall, for each read.
"},{"location":"#inferential-replicates","title":"Inferential Replicates","text":"oarfish
has the ability to compute inferential replicates of its quantification estimates. This is performed by bootstrap sampling of the original read mappings, and subsequently performing inference under each resampling. These inferential replicates allow assessing the variance of the point estimate of transcript abundance, and can lead to improved differential analysis at the transcript level, if using a differential testing tool that takes advantage of this information. The generation of inferential replicates is controlled by the --num-bootstraps
argument to oarfish
. The default value is 0
, meaning that no inferential replicates are generated. If you set this to some value greater than 0
, the the requested number of inferential replicates will be generated. It is recommended, if generating inferential replicates, to run oarfish
with multiple threads, since replicate generation is highly-parallelized. Finally, if replicates are generated, they are written to a Parquet
, starting with the specified output stem and ending with infreps.pq
.
"},{"location":"#output","title":"Output","text":"The --output
option passed to oarfish
corresponds to a path prefix (this prefix can contain the path separator character and if it refers to a directory that does not yeat exist, that directory will be created). Based on this path prefix, say P
, oarfish
will create 2 files:
P.meta_info.json
- a JSON format file containing information about relevant parameters with which oarfish
was run, and other relevant inforamtion from the processed sample apart from the actual transcript quantifications. P.quant
- a tab separated file listing the quantified targets, as well as information about their length and other metadata. The num_reads
column provides the estimate of the number of reads originating from each target. P.infreps.pq
- a Parquet
table where each row is a transcript and each column is an inferential replicate, containing the estimated counts for each transcript under each computed inferential replicate.
"},{"location":"#references","title":"References","text":" -
Josie Gleeson, Adrien Leger, Yair D J Prawer, Tracy A Lane, Paul J Harrison, Wilfried Haerty, Michael B Clark, Accurate expression quantification from nanopore direct RNA sequencing with NanoCount, Nucleic Acids Research, Volume 50, Issue 4, 28 February 2022, Page e19, https://doi.org/10.1093/nar/gkab1129 \u21a9
-
Zahra Zare Jousheghani, Rob Patro. Oarfish: Enhanced probabilistic modeling leads to improved accuracy in long read transcriptome quantification, bioRxiv 2024.02.28.582591; doi: https://doi.org/10.1101/2024.02.28.582591 \u21a9
"}]}
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index 13f545b..3ca4a27 100644
Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ