-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreadme.txt
executable file
·137 lines (119 loc) · 6.95 KB
/
readme.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
# /////////////////////////////////////////////////////////
# What is this?
# The script package is a machine learning tool termed with "DomSign". (about
# algorithm, see )
# To narmally run this tool in your own machine, we hypothesize that you are
# in a normal linux environment and python 2.8+ is also well installed.
# The basic function of this tool is to hierarchically predict the EC number
# of enzymes through the machine learning approach based on Pfam-A domain
# signatures
# Considering about the expansion, this script is designed to be generalized
# to other machine learning input labels. In this case, other protein
# signatues can be also involved here. However, only Pfam-A domain signature
# based training dataset is provided here in package. In all other cases, the
# users need to prepare their own training dataset according to the basic
# data format requirement as shown below. Thus, theoretically, if the input
# labels are in an unified system (in our work, Pfam-A domain signatures) in
# both query and reference (training) dataset, they can be used to predict
# the EC numbers by the script in this package.
# Hence, here we define an unified protein signature system consisted of k
# signatures (Sign-i, i=1~k). This protein signature system is used for
# prediction in machine learning. In our work, we utilize Pfam-A domain
# signatures as a specific signature system. This system extract Pfam-A
# (version 26.0) domain signatures for every protein, which means Pfam-A
# domain architecture without considering domain order or recurrence.
# /////////////////////////////////////////////////////////
# This machine learning model consist of two steps.
# ////////////////////////////////////////
# The first step is to differentiate between non-enzymes and enzymes. In this
# step, one file called "specific enzyme signature" is in need. This dataset
# should be organized in a format like this:
# Sign-iSign-jSign-l\n
# Sign-iSign-l\n
# ...
# All of the signatures above is personally defined by users to represent
# enzymes rather than non-enzymes. And each signature should be sorted
# according to the default python list sort command (according to the alphabet
# seqeunces of the charactors). About the construction method for this dataset
# in Pfam-A domain signatures, please check .
# /////////////////////////////////////////
# The second step is machine learning approach to predict the EC number
# hierarchically. In this step, one file called "training dataset" is in need.
# This dataset should be organized in a format like this:
# proteinID\tEC1,EC2\tSign-i,Sign-j,Sign-l\n
# proteinID\tEC1\tSign-i\n
# proteinID\tEC2\tSign-i,Sign-j\n
# ...
# Here, each EC number should be organized as the complete four-digit
# (EC=x.x.x.x) or incomplete (EC=x.x.-.-) or "Non-enzyme" format. The protein
# signature system should be the same as that used in the first step.
# /////////////////////////////////////////////////////////////
# This package has two main functions: prediction and cross-validation
# ///////////////////////////////////////////
# For prediction
# You should provide three files, query, "specific enzyme signature" as
# described in the first step and "training dataset" as described in the
# second step.
# All these three files should use one unified protein signature system.
# Meanwhile, for Pfam-A protein signature system, we have provide one
# "training dataset" called " " under the directory of
# working_direcotry/reference/ of this package.
# Likewise, we also provide one "specific enzyme domain signature" called "
# " under the directory of working_direcotry/reference/ of this package.
# For query data, it should be organized in this format:
# proteinID\tSign-i,Sign-j,Sign-l\n
# proteinID\tSign-j\n
# proteinID\tSign-l\n
# ...
# For this function, we need to type in a command like.
# dir/DomSing.tool/DomSign.prediction.sh -i dir/query -r dir/"traning_dataset"
# -e "specific_enzyme_signature" -s (specificity threshold, a number between
# 0.5 to 1.0, default 0.8. About details, see ) -o dir/output_file
# The output file will be also located in the query directory with data format
# proteinID\tEC1\n
# proteinID\tEC2\n
# ...
# Here also we have the complete four-digit (EC=x.x.x.x) or incomplete
# (EC=x.x.-.-) or "Non-enzyme" format for every EC shown above.
# ////////////////////////////////////////////
# For cross-validation
# Perhaps you are interested in developing another kind of protein signature
# system to predict enzyme function, or you just want to simply reproduce some
# of the results in our paper. In both cases, you'd better have some
# cross-validation on a reliable dataset.
# This function is a little bit different from prediction. First of all, you
# need to prepare two basic files and two additional files.
# Two basic files include the "training dataset" and "specific enzyme
# signature" which have all the same format with that in prediction module.
# If you want to have the test in a so-called "homolog unavailable" scenario,
# which means to remove some query homolog from the reference in every fold of
# cross-validation to simula te the situation where simple blast doesn't work
# well, you need to prepare another two additional files: the "training
# dataset blast all against all" file (-outfmt 6) and "trainin g
# dataset.fasta" file. For the former one, the format should be derived from
# standard blast or blast+ package -outfmt 6 format. For the second one, this
# fasta file should be in ei ther nucleotide or amino acid format according
# to the format used in "training dataset blast all against all" file.
# For this function, we need to type in a command like.
# dir/DomSing.tool/DomSign.crossVal.sh -r dir/"traning_dataset" -e
# "specific_enzyme_signature" -s (specificity threshold, a num ber between
# 0.5 to 1.0, default 0.8. About details, see ) -o output_file_name -f
# (fold of cross-validation) -m (number of fold conducted, for example, you
# set 1000-fold cross validation, then you can choose here to only conduct
# 100 of them)
# Then, the program will ask you whether test in "homolog unavailable"
# scenario or not. If so, we need to provide the two additioanl files and
# their absolute path from keyboard.
# Subsequently, the threshold (query coverage and identity) of balst based
# reference purification will also be asked to type in from keyboard.
# Finally, one txt file will be processed to the query directory, providing
# all the performance information in this cross-validation test. The
# evaluation is based on a statistical hierarchical metric system designed in
# . This system is designed to provide high-resolution result evaluation for
# hierarchical labels such as EC number.
# /////////////////////////////////////////////
# For any content question, please contact
# For any technique question, please contact [email protected]
# This tool is developed by Kurokawa&Nakashima&Yamada lab in Tokyo Institute
# of Technology and released as additional file in publication ""
# All rights reserved