-
Notifications
You must be signed in to change notification settings - Fork 0
yamada-lab/domsign
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
# ///////////////////////////////////////////////////////// # What is this? # The script package is a machine learning tool termed with "DomSign". (about # algorithm, see ) # To narmally run this tool in your own machine, we hypothesize that you are # in a normal linux environment and python 2.8+ is also well installed. # The basic function of this tool is to hierarchically predict the EC number # of enzymes through the machine learning approach based on Pfam-A domain # signatures # Considering about the expansion, this script is designed to be generalized # to other machine learning input labels. In this case, other protein # signatues can be also involved here. However, only Pfam-A domain signature # based training dataset is provided here in package. In all other cases, the # users need to prepare their own training dataset according to the basic # data format requirement as shown below. Thus, theoretically, if the input # labels are in an unified system (in our work, Pfam-A domain signatures) in # both query and reference (training) dataset, they can be used to predict # the EC numbers by the script in this package. # Hence, here we define an unified protein signature system consisted of k # signatures (Sign-i, i=1~k). This protein signature system is used for # prediction in machine learning. In our work, we utilize Pfam-A domain # signatures as a specific signature system. This system extract Pfam-A # (version 26.0) domain signatures for every protein, which means Pfam-A # domain architecture without considering domain order or recurrence. # ///////////////////////////////////////////////////////// # This machine learning model consist of two steps. # //////////////////////////////////////// # The first step is to differentiate between non-enzymes and enzymes. In this # step, one file called "specific enzyme signature" is in need. This dataset # should be organized in a format like this: # Sign-iSign-jSign-l\n # Sign-iSign-l\n # ... # All of the signatures above is personally defined by users to represent # enzymes rather than non-enzymes. And each signature should be sorted # according to the default python list sort command (according to the alphabet # seqeunces of the charactors). About the construction method for this dataset # in Pfam-A domain signatures, please check . # ///////////////////////////////////////// # The second step is machine learning approach to predict the EC number # hierarchically. In this step, one file called "training dataset" is in need. # This dataset should be organized in a format like this: # proteinID\tEC1,EC2\tSign-i,Sign-j,Sign-l\n # proteinID\tEC1\tSign-i\n # proteinID\tEC2\tSign-i,Sign-j\n # ... # Here, each EC number should be organized as the complete four-digit # (EC=x.x.x.x) or incomplete (EC=x.x.-.-) or "Non-enzyme" format. The protein # signature system should be the same as that used in the first step. # ///////////////////////////////////////////////////////////// # This package has two main functions: prediction and cross-validation # /////////////////////////////////////////// # For prediction # You should provide three files, query, "specific enzyme signature" as # described in the first step and "training dataset" as described in the # second step. # All these three files should use one unified protein signature system. # Meanwhile, for Pfam-A protein signature system, we have provide one # "training dataset" called " " under the directory of # working_direcotry/reference/ of this package. # Likewise, we also provide one "specific enzyme domain signature" called " # " under the directory of working_direcotry/reference/ of this package. # For query data, it should be organized in this format: # proteinID\tSign-i,Sign-j,Sign-l\n # proteinID\tSign-j\n # proteinID\tSign-l\n # ... # For this function, we need to type in a command like. # dir/DomSing.tool/DomSign.prediction.sh -i dir/query -r dir/"traning_dataset" # -e "specific_enzyme_signature" -s (specificity threshold, a number between # 0.5 to 1.0, default 0.8. About details, see ) -o dir/output_file # The output file will be also located in the query directory with data format # proteinID\tEC1\n # proteinID\tEC2\n # ... # Here also we have the complete four-digit (EC=x.x.x.x) or incomplete # (EC=x.x.-.-) or "Non-enzyme" format for every EC shown above. # //////////////////////////////////////////// # For cross-validation # Perhaps you are interested in developing another kind of protein signature # system to predict enzyme function, or you just want to simply reproduce some # of the results in our paper. In both cases, you'd better have some # cross-validation on a reliable dataset. # This function is a little bit different from prediction. First of all, you # need to prepare two basic files and two additional files. # Two basic files include the "training dataset" and "specific enzyme # signature" which have all the same format with that in prediction module. # If you want to have the test in a so-called "homolog unavailable" scenario, # which means to remove some query homolog from the reference in every fold of # cross-validation to simula te the situation where simple blast doesn't work # well, you need to prepare another two additional files: the "training # dataset blast all against all" file (-outfmt 6) and "trainin g # dataset.fasta" file. For the former one, the format should be derived from # standard blast or blast+ package -outfmt 6 format. For the second one, this # fasta file should be in ei ther nucleotide or amino acid format according # to the format used in "training dataset blast all against all" file. # For this function, we need to type in a command like. # dir/DomSing.tool/DomSign.crossVal.sh -r dir/"traning_dataset" -e # "specific_enzyme_signature" -s (specificity threshold, a num ber between # 0.5 to 1.0, default 0.8. About details, see ) -o output_file_name -f # (fold of cross-validation) -m (number of fold conducted, for example, you # set 1000-fold cross validation, then you can choose here to only conduct # 100 of them) # Then, the program will ask you whether test in "homolog unavailable" # scenario or not. If so, we need to provide the two additioanl files and # their absolute path from keyboard. # Subsequently, the threshold (query coverage and identity) of balst based # reference purification will also be asked to type in from keyboard. # Finally, one txt file will be processed to the query directory, providing # all the performance information in this cross-validation test. The # evaluation is based on a statistical hierarchical metric system designed in # . This system is designed to provide high-resolution result evaluation for # hierarchical labels such as EC number. # ///////////////////////////////////////////// # For any content question, please contact # For any technique question, please contact [email protected] # This tool is developed by Kurokawa&Nakashima&Yamada lab in Tokyo Institute # of Technology and released as additional file in publication "" # All rights reserved
About
DomSign: a top-down annotation pipeline to enlarge enzyme space in the protein universe
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published