Skip to content

6. Importing and manipulating bins

kbseah edited this page Jun 28, 2018 · 5 revisions

With gbtools you can define bins interactively from plots, define custom bins from lists of contigs, or import the results of 3rd-party binning tools.

Basic operations have already been discussed in part 4. In this part, we describe some more advanced features.

6a. Define bins from lists of contigs

Let's say you've manually chosen a list of contigs to comprise a new genome bin. How do you make a gbtbin object from this list?

The contig names should be in a character vector in R. For this example, the list of contigs is bin1.contigNames, we are working with a gbt object called d, and we want to create a new gbtbin object called d.bin1:

d.bin1 <- gbtbin(shortlist=bin1.contigNames,x=d,slice=NA)

If you have the list in an external file (one contig name per line) that you wish to import into R, you can use the base function scan() to first import the data into R:

bin1.contigNames <- scan(file="/path/to/contig_list",what=character())

6b. Import multiple bins at the same time from Fasta files

Many third-party binning tools will produce, as their final result, a set of Fasta files, each representing a separate bin.

You could write your own scripts to parse the Fasta headers, and then import them manually one-by-one into R using scan() and gbtbin() functions, but this rapidly becomes tedious.

Fasta files representing genome bins can be parsed and imported into R with an accessory Perl script and the importBins() function in gbtools.

In the accessory_scripts/ folder is the Perl script parse_bin_fasta_files.pl that will parse the Fasta files. You will first need to make a new file that lists each of the Fasta file paths and the name that you want to assign to that bin. This bin name will be used as the object name in R, so don't use names starting with numbers.

For example, you have a set of files bin1.fasta, bin2.fasta etc. and you want to call them bin_alpha, bin_beta, then create a tab-separated file with the filenames in the first column and the bin names in the second column, which would look something like:

bin1.fasta  bin_alpha
bin2.fasta  bin_beta
bin3.fasta  bin_gamma
...         ...

From this input table (in this example, this file is called input_table), generate a table of individual contig names and their corresponding bins, for import with gbtools. Suppose you want this output table to be called contig_bin_table:

perl parse_bin_fasta_files.pl -i input_table -o contig_bin_table
perl parse_bin_fasta_files.pl -i input_table > contig_bin_table # also works

Start R, load gbtools, and define your gbt object with the parent metagenome (if not already defined, in this example, called d). Then you can import all these multiple bins defined in that contig_bin_table simultaneously to separate gbtbin objects in your workspace:

importBins(x=d,file="/path/to/contig_bin_table",to.list=FALSE)
ls() # see if the new bin objects are created

If everything went well, you should see with the list function ls() the new objects bin_alpha, bin_beta, etc. that have been created. Typing the names or using summary(bin_alpha) should give a summary of each bin.

Of course, if you have more than just a handful of bins, you might want to keep your workspace tidy and instead create a single new object -- a list of gbtbin objects:

d.bins1 <- importBins (x=d, file="/path/to/contig_bin_table",to.list=TRUE) # to.list is TRUE by default
ls()

To see summaries of your individual bins, either type d.bins1[[1]] (for the first bin), or with the name of the bin: d.bins1$bin_alpha

6c. Plotting multiple bins in a single overlay

With the points() function you can overlay single bins onto a GC-coverage or differential coverage plot. However, if you have multiple bins and want to plot them simultaneously to compare them, it can be time-consuming to do each manually.

In gbtools is the function multiBinPlot() that allows you to do just this.

The function takes the usual plotting parameters: slice=, cutoff=, log=, xlab=, xlim= etc. (see section 4). However, the option of coloring plot by taxonomic markers and GC% value is disabled, because that would interfere with the overlay colors used for the bins.

Suppose your bin objects are called bin_alpha, bin_beta, etc., and the gbt object is called d, then:

multiBinPlot(d,bins=list(bin_alpha,bin_beta,bin_gamma),slice=1) # GC-coverage plot
multiBinPlot(d,bins=list(bin_alpha,bin_beta,bin_gamma),slice=c(1,2)) # Differential coverage plot

Note that the bins= parameter takes a list rather than c() argument. (If you have a vector containing the names of gbtbin objects, you can use the base function mget to convert a list of object names to a list of objects.)

If you imported the bins to a single list object already, then just supply the name of the list (see above):

multiBinPlot(d,bins=d.bins1,binNames=names(d.bins1),slice=1)

You can add a legend that shows which bin is which; you can either specify your own bin names, or if this is not given, then it automatically numbers them sequentially as "bin 1", "bin 2" etc. in the legend (in the order they were supplied in the list):

multiBinPlot(d,bins=list(bin_alpha,bin_beta,bin_gamma), slice=1, legend=TRUE) # Automatic sequential numbering in legend
multiBinPlot(d,bins=list(bin_alpha,bin_beta,bin_gamma), slice=1, legend=TRUE, binNames=c("alpha","beta","gamma")) # Supply your own names

The bins are colored by automatically-generated rainbow colors using the rainbow() function in R. You can specify your own custom colors if you wish, but make sure the number of colors in the vector matches the number of bins you want to plot, otherwise it reverts to the default.

multiBinPlot(d,bins=list(bin_alpha,bin_beta,bin_gamma),slice=1,cols=rainbow(3)) # Identical behavior as the default
multiBinPlot(d,bins=list(bin_alpha,bin_beta,bin_gamma),slice=1,cols=heat.colors(3)) # Use a different built-in R palette
multiBinPlot(d,bins=list(bin_alpha,bin_beta,bin_gamma),slice=1,cols=c("#00A600FF","#ECB176FF","#F2F2F2FF")) # Specify colors manually in hex RBGA format

6d. Comparing two sets of bins

You might want to compare the results from two different binning tools or pipelines on the same metagenome. You have created a gbt object d, and using the importBins() function above, you have created two sets of gbtbin objects corresponding to the pipelines that you want to compare, called d.bins1 and d.bins2.

To create a table of the fraction of overlap between these two bins:

tabOverlapBins(d.bins1,d.bins2,binNames.x=names(d.bins1),binNames.y=names(d.bins2))

This returns a matrix with the row names corresponding to d.bins1 and the columns to d.bins2. Each value in the table is the fraction of overlap.

By default, the overlap is calculated in terms of number of bases. If you want to calculate overlap in number of contigs, set weight=FALSE.

For each (x,y) entry in the table, the fraction is taken as the number of overlapping bases or contigs divided by the total number in the bin for that row (x). For example, if d.bins1 comprises the following bins: bin_alpha, bin_beta, bin_gamma, and d.bins2 comprises bin_one, bin_two, bin_three, and d.bins1 are in the rows of the table (as in the example command above), then the fraction of overlap for bin_alpha and bin_one is simply:

(no. bases in bin_alpha AND bin_one) / (total bases in bin_alpha)

To flip this around, specify by="y" instead. To report the raw number of overlapping bases or contigs, specify by="raw"

6e. Merging two sets of bins

After comparing two sets of bins you might want to merge the bins. This can be done manually on individual pairs of bins with the add() command, but suppose you want to combine two sets together en masse with a certain overlap threshold.

d.mergedBins <- mergeOverlapBins(d.bins1, d.bins2, binNames.x=names(d.bins1), binNames.y=names(d.bins2), by="y", mergeto="x", threshold="0.8", out="mergedBin")

Notice the new parameter mergeto=. Suppose you find that bin set d.bins2 is more fragmented, whereas d.bins1 is more inclusive. You therefore want to merge those bins in d.bins2 that have at least 80% of their bases in a corresponding bin in d.bins1 into those bins in d.bins1. Therefore, by="y" because the overlap fraction values are calculated with respect to d.bins2, but mergeto="x" because the merger is with respect to d.bins1. Generally speaking, these two valeus should always be set to the opposite of each other.

This returns a list of gbtbin objects containing the merged bins, and outputs to the console the details of which bins were merged to create those new bins.

Next step

Advanced- Adding custom variables for plotting or analysis