Submit

OR
Load demo data

Dataset

The protein sequences belonging to all plants were downloaded fromthe UniProt database( release 2013_02)(http://www.uniprot.org ) by searching[ keywords: SUBCELLULAR LOCATION AND reviewed: yes ]. Sequences whose annotations were marked as ‘PROBABLE’, ‘POSSIBLE’, and ‘BY SIMILARITY’ were discarded. We ended up with 16494 sequences of proteins, annotated to 14 different single and multi-label subcellular localizations as detailed bellow. However, the sequence number drastically reduced to 6892 after we put a sequence identity cutoff of 30% (Table 1) on each of them using BlastClust.To avoid homology bias in machine learning, a 25 or 30% sequence identity cutoff threshold is needed to guarantee that none of the proteins included in the benchmark datasets has greater than this threshold identity to any other sequences in the dataset. This was done within class as well as across the classes. Further, about 10% of the data was kept aside for later independent testing of the models. Testing on independent datasets that are not used in a machine learning process has been reported to be the best benchmark to test the performance of various prediction modes. Further, we discarded the fragment proteins those have less than 50 amino acids. Finally, we got 5879 protein sequences for training and 629 sequences for independent test.


Type Subcellular location # sequences retrieved # sequences after redundancy check (40% cut-off) Training dataset Training dataset filtered Independent dataset
Single Label Plastid 11302 2979 2678 2468 248
Cytoplasm 739 403 361 351 40
Extracellular 237 186 166 140 14
Nucleus 734 636 571 568 63
Mitochondrion 759 537 481 447 52
Cell Membrane 1256 927 830 829 92
Golgi Apparatus 277 229 204 204 23
Endoplasmic Reticulum 393 320 285 280 29
Vacuole 260 198 176 176 20
Peroxisome 80 63 57 57 6
Cell Wall 52 47 42 37 5
Dual Label Mito-Plastid 141 133 118 118 13
Cyto-Nucleus 210 196 175 170 20
Cyto-Golgi 54 38 34 34 4
Total 16494 6892 6178 5879 629


About

Identification of the subcellular location of protein has been an area of interest in protein science. It has been extensively studied in past few decades in a computational paradigm. However, most of the methods are focused on single location prediction. The proposed approach successfully predicts 11 single locations (cell membrane, cell wall, plastid, cytoplasm, endoplasmic reticulum, extracellular, golgi apparatus, mitochondrion, nucleus, peroxisome and vacuole) and three significant multi location proteins (cytoplasm-nucleus, mitochondrion-plastid and cytoplasm-golgi apparatus). Various sequence derived features based on composition and physicochemical properties such as amino acid composition, pseudo amino acid composition, dipeptide composition and hybrid of these are used to represent the protein. Here a brief description of the sequences features used by the application. (AAC): Amino Acid Composition based, it is a 20-vector descriptor with the percentages of each of the aminoacid in the total sequence. (Dipep): Dipeptide Composition based, it is a 400-vector descriptor with the percentages of each pair of subsequent aminoacids in the total sequence. (PseAAC): Pseudo Amino Acid Composition based, it is a 30-vector descriptor with the information of AAC plus 10 more descriptors based on hydropholicity and hydrophilicity values of the aminoacids present. (NCC): N-Center-C terminal Composition based, it is a 60-vector descriptor with the AAC information of each of the three subsequences of the total lenght sequence. N is a subsequence of the 25 first aminoacids, C-terminal is a subsequence of the 25 last aminoacids and C-Center is the rest of the aminoacids. (CTDC): Composition according of the hydrophobicity, normalized van der Waals volume, polarity, and polarizability attributes. 21-vector descriptor.(protr R package) (CTDT): Transition according of the hydrophobicity, normalized van der Waals volume, polarity, and polarizability attributes. 21-vector descriptor.(protr R package) (QSO): Quasi Order Descriptor, normalized occurence for aminoacids. 100-vector descriptor(protr R package) (PseAACNCCDipep): Hybrid of Pseudo AAC, N-Center-C terminal (3-parts) and Dipeptide Composition based. 490-vector descriptor. (NCCDipepCTDCCTDTQSO): Hybrid N-Center-C terminal (3-parts), Dipeptide Composition, Composition and Transition (Dubchack) and Quasi Order Descriptor based. 602-vector descriptor.

Author

This application use R protr package(Nan Xiao, Dong-Sheng Cao, Min-Feng Zhu, and Qing-Song Xu. (2015). protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences. Bioinformatics 31 (11), 1857-1859.) for extract CTDC, CTDT, QSO and Dipep features. Support vector machine process uses e1071 R package. It was built by Cristian Loaiza, Graduate Student of Plant, Soils and Climate Departament in Utah State University.The backend is based on scripts developed in matlab by former Postdoc Sitanshu Sekhar Sahu under Dr Rakesh Kaundal supervision.

Help

You can either upload a multifasta file or paste your sequence(s) in the box, after that, you must select which kind of prediction method do you want to use and then click Submit button. When the result is ready you can download a file with that information to be open in a Spreadsheet (e.g: Excel) and also you can clear the result to Submit a new set of sequences. Minimum amioacid number required per sequence Sequences must have a length higher than 50 aminoacids, that is because some feature representation need at least that length number to calculate their values. Sequence number threshold It is up to a 1000 sequences, this is a threshold that could change after users feedback. Because depending of the prediction strategy used the time could vary substantially, for example, AAC support vector machine aproach is the faster one, compared to one of the hybrids or even the blast strategy that are much slower and consuming for computational resources, based on that this 1000 sequences threshold will change. Where are the sequences used to train the SVM model from In the dataset Tab, you could find a table in which you have access to the query performed in UniProtKB database and also the number of sequences per model and a link to download the training set and the indepent test set. Problems/Suggestion about Plant-mSubP If you can't get something to work or you want to suggest something that make this application works better, please send an email to crissloaiza@gmail.com.