Skip to content
This repository has been archived by the owner on Aug 10, 2022. It is now read-only.

Using arlecore_Fst_batch.pl script #1

Open
cklinge2 opened this issue Aug 8, 2014 · 11 comments
Open

Using arlecore_Fst_batch.pl script #1

cklinge2 opened this issue Aug 8, 2014 · 11 comments

Comments

@cklinge2
Copy link

cklinge2 commented Aug 8, 2014

I am attempting to use the arlecore_Fst_batch.pl script. I am, however, getting stuck trying to figure out how to make an appropriate count file in the Mothur format. If using ITEP to find 'core' genes and make alignments for input into arlecore, what do the rows and the "total" column represent in the Mothur count file?

Thanks in advance!

@nick-youngblut
Copy link
Owner

An example of the count file format can be found at: http://www.mothur.org/wiki/Count_File.
Each column except for ’total’ will be considered a different population by arlecore. The 'total' column is just a sum of each taxon's abundances in all populations. Each row is a difference taxon (genome), so the first value in each row should be a taxon name (FIG ID if using ITEP). The '-delimiter' flag in arlecore_Fst_batch.pl can be used to remove any additional info in gene names (e.g., PEG ID or annotations).

@cklinge2
Copy link
Author

Great, thanks for the information.

To clarify with a simplified example, let’s say I’m interested in two microbial populations that have 2 ‘core’ genes (with the first 2 strains from Pop1 and the remaining 3 strains from Pop2).
Would the count file look something like this, with identical columns between the 2 gene clusters?

arlecore_Fst_batch.pl -count mothur_countfile.txt -min 2 1 nt_alignment1_pal2nal.fasta nt_alignment2_pal2nal.fasta > Fst_res.txt

Representative_Sequence total Pop1 Pop2
fig|395491.1.peg.4227 1 1 0
fig|395491.2.peg.6069 1 1 0
fig|395491.3.peg.5806 1 0 1
fig|395491.4.peg.260 1 0 1
fig|395491.5.peg.3162 1 0 1

fig|395491.1.peg.6517 1 1 0
fig|395491.2.peg.2784 1 1 0
fig|395491.3.peg.6613 1 0 1
fig|395491.4.peg.895 1 0 1
fig|395491.5.peg.5988 1 0 1

@nick-youngblut
Copy link
Owner

You almost have the count file correct. The 'total' and 'Pop[12]' columns look good; however, the 'Represenative_Sequence' values should be the taxon names (e.g., 'fig|395491.1' or 'Methanosarcina_mazei_Go1'). The PEG IDs will make the count file specific for 1 gene cluster, but the point of the script is to run it on many gene clusters. This is where the '-delimiter' flag comes in: to remove the PEG IDs from the gene cluster fasta sequence names, you should can use '-delimiter ".peg." '. This will strip the PEG ID off the sequence names so that they can be mapped to the taxon names in the count file.

@cklinge2
Copy link
Author

Ah ok, makes perfect sense!

After adjusting my files and running the script as follows, I'm getting an error (maybe something wrong with delimiter flag?):

arlecore_Fst_batch.pl -count mothur_countfile_1.txt -min 2 1 -delimiter ".peg." 1387_nt_alignment_pal2nal.fasta 3357_nt_alignment_pal2nal.fasta > Fst_res_test.txt

/home/gtl-shared-2/ckling2-big/ITEP/clusterDbAnalysis/1387_nt_alignment_pal2nal.fasta Did_not_pass_-min
Mothur error: '
Removing group: Cstrains because all sequences have been removed.

Removing group: Nstrains because all sequences have been removed.
[ERROR]: fig|395491.21.peg.1816__1 is not in your count table. Please correct.

mothur > quit()
' at /usr/local/bin/arlecore_Fst_batch.pl line 387.

Here are truncated versions of my count file and first fasta alignment:

Representative_Sequence total Cstrains Nstrains
fig|395491.21 1 0 1
fig|395491.17 1 1 0

fig|395491.21.peg.1816
GTGCAGCAGAACATCGCCCATCTGCCGGCCGCCGACCGCGAGGCGATCGCAGCCTATCTG
AAGGCGGTGCCGGGCCAT---------------
fig|395491.17.peg.10165
GTGCAGCAGAACATCGCCCATCTGCCGACCGCCGACCGCGAGGCGATCGCCGCCTATCTG
AAGGCCGTGCCGGGACGC---------------

@nick-youngblut
Copy link
Owner

Are you missing the '>' from the sequence names in your fasta files? That would explain the error, since the script wouldn't be able to find any sequences in the fasta files.

@cklinge2
Copy link
Author

Nope, the '>' symbols are there, they just got removed when I posted them on github (This may or may not be useful, but the script seems to work fine for a single gene cluster with PEG IDs included in the count file).

@nick-youngblut
Copy link
Owner

OK. The bug should be fixed. Keep in mind that the script was tested with ARLECORE v 3.5.1.3 (17.09.11), and other versions may not work with the default *ars and *arp files produced by the script.

@cklinge2
Copy link
Author

Everything seems to be working now, thanks for all the assistance!

@cklinge2
Copy link
Author

Hi Nick,

In the help file of the script it says that a custom .ars file can be provided with the -ars flag. I have created my own .ars file in the windows version of Arlequin, but even with different settings the script only prints out pairwise Fst estimates. Is there a straightforward way to perform additional Arlecore tests (aside from Fst) on multiple gene alignments (e.g. Tajima's D) using the arlecore_Fst_batch.pl script?

@cklinge2 cklinge2 reopened this Oct 27, 2014
@nick-youngblut
Copy link
Owner

I for my requirements, I only needed the script to parse out the Fst values (and P values) from the arlecore output. Making the parser all-encompassing for all info in the arlecore would be a lot more involved. However, if you need just one item such as Tajima's D, I could probably code that without too much trouble.

@cklinge2
Copy link
Author

cklinge2 commented Nov 6, 2014

It would be great to get estimates of Tajima's D (and P values), so if it's not a big project, I would definitely get some good use out of that code.
Many thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants