Skip to content

Latest commit

 

History

History
132 lines (87 loc) · 12.9 KB

README.md

File metadata and controls

132 lines (87 loc) · 12.9 KB

Documentation

AppendNCBIGeneDescriptionColumn

AppendNCBIGeneDescriptionColumn(ExcelDataFilePath, ExcelDataFileEnsemblIDColumnName = "ensembl_gene_id")

Appends a column of NCBI gene descriptions to an Excel sheet with a column (named ExcelDataFileEnsemblIDColumnName) containing Ensembl IDs. Depends on EnsemblID2Entrez.

BioMartGOFilter.Nfurzeri

BioMartGOFilter.Nfurzeri(GO.CSV, CombineFruitFlyHomology = TRUE, CombineHumanHomology = TRUE, CombineMouseHomology = TRUE, CombineNematodeHomology = TRUE, CombineZebrafishHomology = TRUE)

write.gmt(GOList, OutputFilePath = "custom.gmt")

Use the biomaRt package to get all Nothobranchius furzeri genes with GO term annotations in GO.CSV [including all child terms (is_a, regulates, etc.)]. CombineFruitFlyHomology/CombineHumanHomology/CombineMouseHomology/CombineNematodeHomology/CombineZebrafishHomology allows complementation using the gene homology [to fly (Drosophila melanogaster)/human/mouse (Mus musculus)/nematode (Caenorhabditis elegans)/zebrafish (Danio rerio)] information. Note that this only works for Ensembl 113 (released on October 18th, 2024) or later.

The output is a list in which the name of each element is the Ensembl ID of a N. furzeri gene and the content of each element is the GO term annotations of that gene (supplemented with homology information). This list can then be converted into a .gmt file (for the gene set enrichment analysis) with write.gmt. For example, the following script generates a .gmt file of all GO term annotations of all N. furzeri genes:

BP <- BioMartGOFilter.Nfurzeri("GO:0008150")
CC <- BioMartGOFilter.Nfurzeri("GO:0005575")
MF <- BioMartGOFilter.Nfurzeri("GO:0003674")
write.gmt(c(BP, CC, MF))
# file.show("custom.gmt")

CorrelateOmics

CorrelateOmics(ProteomicsDataFilePath, UniProtIDColumnName = "Protein IDs", To = "Ensembl", GeneNameColumnName = "Gene name", ProteomicsColumnsToCalculateMean, TranscriptomicsDataFilePath, EnsemblIDColumnName = "ensembl_gene_id", TranscriptomicsColumnsToCalculateMean, RefreshGeneNames = TRUE)

CorrelateOmics.log2FoldChange(ProteomicsDataFilePath, UniProtIDColumnName = "Protein IDs", To = "Ensembl", GeneNameColumnName = "Gene name", Proteomicslog2FoldChangeColumn, TranscriptomicsDataFilePath, EnsemblIDColumnName = "ensembl_gene_id", Transcriptomicslog2FoldChangeColumn, RefreshGeneNames = TRUE)

plotCorrelateOmics(DataFrame, Alpha = 0.1, HighlightGeneNameRegex, HighlightAlpha = 1, HighlightColor = "#C40233", HighlightSize = 2.5)

plotCorrelateOmics.log2FoldChange(DataFrame, Alpha = 0.1, HighlightGeneNameRegex, HighlightAlpha = 1, HighlightColor = "#C40233", HighlightSize = 2.5)

CorrelateOmics and CorrelateOmics.log2FoldChange link proteomics data in ProteomicsDataFilePath and transcriptomics data in TranscriptomicsDataFilePath (only proteins/genes with a one-to-one mapping will be included). If RefreshGeneNames is set as FALSE, the result of CorrelateOmics will be a data frame with 5 columns: "logTranscriptomicsMean", "logTranscriptomicsStdev", "logProteomicsMean", "logProteomicsStdev", and "GeneName" (copied directly from the GeneNameColumnName column in ProteomicsDataFilePath) and the result of CorrelateOmics.log2FoldChange will be a data frame with 3 columns: "Transcriptomicslog2FoldChange", "Proteomicslog2FoldChange", and "GeneName". If RefreshGeneNames is set as TRUE, EnsemblID2Entrez (see below) will be deployed to re-download gene names from the NCBI Gene database, appending an additional CurrentEntrezGeneName column to the output data frame. The row names of the data frame are the corresponding Ensembl IDs. CorrelateOmics and CorrelateOmics.log2FoldChange pass the To input variable solely to UniProtKBAC2EnsemblID (To should be set as "WormBase" instead of the default "Ensembl" when working with C. elegans datasets).

Subsequently, plotCorrelateOmics and plotCorrelateOmics.log2FoldChange can plot the output data frame of CorrelateOmics and CorrelateOmics.log2FoldChange, respectively. To highlight certain genes, specify them by their NCBI gene names using the HighlightGeneNameRegex.

EnsemblID2Entrez

EnsemblID2Entrez(EnsemblID, Output = c("Accession", "ID", "Description", "Name"))

Converts a single Ensembl ID to its corresponding NCBI Entrez accession(s)/ID(s)/description(s)/name(s) using the rentrez package. If the mapping exists, the output will be a string; otherwise, the output will be "". This works better than using biomaRt because the mapping is more complete. And unlike using org.*.eg.db, this works for all species.

Note: the default genome assembly of Nothobranchius furzeri on Ensembl is Nfu_20140520 while OrthoDB and the NCBI have opted for the newer UI_Nfuz_MZM_1.0 and NfurGRZ-RIMD1 genomes as their defaults, respectively (the UI_Nfuz_MZM_1.0 assembly has less unknown base pairs and more annotated genes owing to the long-read sequencing method, but the Nfu_20140520 assembly has a slightly higher BUSCO score; both NfurGRZ-RIMD1 and Nfu_20140520 are based on the commonly used GRZ-AD strain). This may cause differences in annotations. The UniProt reference proteome UP000694548 is also based on Nfu_20140520, making it convenient to correlate omics.

EnsemblIDFilter

EnsemblIDFilter(ExcelDataFilePath, BioMartExportFilePaths = NA, PassedEnsemblIDVector = NA, ExcelDataFileEnsemblIDColumnName = "ensembl_gene_id", BioMartExportEnsemblIDColumnName, ReAdjustPValues = TRUE, PValueColumnName = "pvalue", AdjustedPValueColumnName = "padj")

Filters a Flaski RNAseq pipeline output Excel sheet (ExcelDataFilePath) based on an vector of desired Ensembl IDs

  • in PassedEnsemblIDVector or

  • compiled from exported Ensembl BioMart TSV files whose paths are specified in BioMartExportFilePaths, if BioMartExportFilePaths is not NA (note: this overrides the input variable PassedEnsemblIDVector).

FindUniqueGenes.EnsemblID

FindUniqueGenes.EnsemblID(TargetSpecies, CheckHomologySpecies = c("drerio", "kmarmoratus"))

Identifies genes of the TargetSpecies (returns a vector of their Ensembl IDs) without a homolog in CheckHomologySpecies.

GOFilter

GOFilter(ExcelDataFilePath, GOVector, godir, GOTermColumnName = "GO_id", ReAdjustPValues = TRUE, PValueColumnName = "pvalue", AdjustedPValueColumnName = "padj")

Filters a Flaski RNAseq pipeline output Excel sheet (ExcelDataFilePath) based on the desired GO terms in GOVector [including all child terms (is_a, regulates, etc.) defined by godir].

Note: Ensembl BioMart provides a built-in functionality to filter genes by GO term annotations (see the figure below; all child terms will also be included), which is better because a fresh download from Ensembl BioMart will reflect the most up-to-date GO term annotations. See BioMartGOFilter.Nfurzeri and EnsemblIDFilter.

plotlog2ReadDistribution

plotlog2ReadDistribution(ExcelDataFilePath, DataColumns)

Plots the smoothed empirical distribution function of all normalized reads (each gene in each sample; compiled from columns whose IDs/names are in DataColumns) to help determine a threshold to filter genes with valid expression and a meaningful fold-change. This step is helpful when picking genes for further functional studies but dispensable if only bioinformatic analyses (like a gene set enrichment analysis) are to be done.

samples.beeswarm

samples.beeswarm(GeneNameRegex, ExcelDataFilePath, GeneNameColumnName = "gene_name", ColumnOffset = 2, Group1RepNum, Group2RepNum, GroupTags, Colours = c("black", "red"), Standardized = 1, Breaks = 10, PointSize = 0.75, LineWidth = 0.5, AsteriskSignificance = TRUE, PValueColumnName = "pvalue", PValueDigit = 2)

Plots a beeswarm plot of sample reads of gene(s) whose name(s) match the GeneNameRegex. If Standardized == 1 (or Standardized == 2), the sample reads of each gene will be standardized by the mean of sample reads of group 1 (or 2) of each gene; otherwise, no standardization will be performed. The plot is automatically saved as a time-tagged .png file in the working directory.

SRX2SRR

SRX2SRR(SRXSheetFilePath, SRXColumnName = "SRX")

Converts a column (from an Excel sheet SRXSheetFilePath) of experiment numbers into corresponding run numbers (printed directly onto the console alongside the sequencing format employed). Note that this critical sequencing format info is not available in SRA_Accessions.tab (which allows batch searching using the corresponding SRP/PRJNA accession number directly) or SRA_Run_Members.tab (which allows batch searching using the corresponding SRP accession number) on the FTP site.

Recommended alternative: the SRA Run Selector tool provided by the NCBI (→ tutorial). We can get an overview of all the project's associated datasets by searching for the corresponding SRP/PRJNA/GSE accession number (see the figure attached below as an example for an overview of all datasets in Hussein et al., Developmental Cell, 2020).

UniProtKBAC2EnsemblID

UniProtKBAC2EnsemblID(UniProtKBAC.CSV, Wait = 5, To = "Ensembl")

UniProtKBAC2EnsemblID utilizes the UniProt REST API (which is more complete than biomaRt) to convert UniProtKB accession IDs in UniProtKBAC.CSV into the Ensembl IDs of corresponding genes (note: since Ensembl inherits the WormBase IDs for C. elegans genes, To should be set as "WormBase" instead of the default "Ensembl" when converting C. elegans genes). Once a job is submitted, the status will be inquired every Wait seconds until the job is finished. The downloaded output is then parsed into a matrix with two columns named uniprotsptrembl and ensembl_gene_id (consistent with BioMart). Each row corresponds to a mapping. If a protein/peptide is mapped to more than one Ensembl gene ID, multiple rows will share the same UniProtKB accession ID in column 1 but possess different Ensembl IDs in column 2.

volcano.ma

volcano.ma(Data, PlotType = "ma", HighlightIDs = NA, GeneNameColumnName = "gene_name", IDColumnName = "ensembl_gene_id", log2FoldChangeColumnName = "log2FoldChange", Invertlog2FoldChange = FALSE, abslog2FoldChangeThreshold = 1, abslog2FoldChangeLimit, baseMeanColumnName = "baseMean", log2baseMeanLowerLimit, log2baseMeanUpperLimit, AdjustedPValueColumnName = "padj", SignificanceThreshold = 0.01, negativelog10AdjustedPValueLimit, LineWidth = 0.25, Alpha = 1, NSAlpha = 0.1, UpColor = "#FFD300", DownColor = "#0087BD", HighlightColor = "#C40233", HighlightSize = 2.5, log2FoldChangeLabel, log2FoldChangeTickDistance = 1, log10AdjustedPValueTickDistance = 5)

Plots a volcano plot (PlotType = "volcano") or an MA plot (PlotType = "ma") and highlight genes with an ID in HighlightIDs. Points beyond limits (defined by ±abslog2FoldChangeLimit, log2baseMeanLowerLimit, log2baseMeanUpperLimit, and negativelog10AdjustedPValueLimit; ignored if NA) will be coerced onto the border.

Note while using plotly::ggplotly to view and interact with the graph: the axis titles should be adjusted to avoid an error. For the volcano plot, use

Plot <- Plot + xlab("log2(fold change)") + ylab("-log10(adjusted p)")
ggplotly(Plot)

For the MA plot, use

Plot <- Plot + xlab("log2(base mean)") + ylab("log2(fold change)")
ggplotly(Plot)