Plant subcellular location predictor

Submit

Upload query protein sequence(s) in FASTA format:

Browse...

Paste sequences below (FASTA only)

Load demo data

Select prediction module

AAC

Dipep

PseAACNCCDipep

NCCDipepCTDCCTDTQSO

Protein Blast (simmilarity search)

Download

Download Blast Alignments

Dataset

The protein sequences belonging to all plants were downloaded fromthe UniProt database( release 2013_02)(http://www.uniprot.org ) by searching[ keywords: SUBCELLULAR LOCATION AND reviewed: yes ]. Sequences whose annotations were marked as ‘PROBABLE’, ‘POSSIBLE’, and ‘BY SIMILARITY’ were discarded. We ended up with 16494 sequences of proteins, annotated to 14 different single and multi-label subcellular localizations as detailed bellow. However, the sequence number drastically reduced to 6892 after we put a sequence identity cutoff of 30% (Table 1) on each of them using BlastClust.To avoid homology bias in machine learning, a 25 or 30% sequence identity cutoff threshold is needed to guarantee that none of the proteins included in the benchmark datasets has greater than this threshold identity to any other sequences in the dataset. This was done within class as well as across the classes. Further, about 10% of the data was kept aside for later independent testing of the models. Testing on independent datasets that are not used in a machine learning process has been reported to be the best benchmark to test the performance of various prediction modes. Further, we discarded the fragment proteins those have less than 50 amino acids. Finally, we got 5879 protein sequences for training and 629 sequences for independent test.

Type	Subcellular location	# sequences retrieved	# sequences after redundancy check (40% cut-off)	Training dataset	Training dataset filtered	Independent dataset
Single Label	Plastid	11302	2979	2678	2468	248
	Cytoplasm	739	403	361	351	40
	Extracellular	237	186	166	140	14
	Nucleus	734	636	571	568	63
	Mitochondrion	759	537	481	447	52
	Cell Membrane	1256	927	830	829	92
	Golgi Apparatus	277	229	204	204	23
	Endoplasmic Reticulum	393	320	285	280	29
	Vacuole	260	198	176	176	20
	Peroxisome	80	63	57	57	6
	Cell Wall	52	47	42	37	5
Dual Label	Mito-Plastid	141	133	118	118	13
	Cyto-Nucleus	210	196	175	170	20
	Cyto-Golgi	54	38	34	34	4
	Total	16494	6892	6178	5879	629

About

Identification of the subcellular location of protein has been an area of interest in protein science. It has been extensively studied in past few decades in a computational paradigm. However, most of the methods are focused on single location prediction. The proposed approach successfully predicts 11 single locations (cell membrane, cell wall, plastid, cytoplasm, endoplasmic reticulum, extracellular, golgi apparatus, mitochondrion, nucleus, peroxisome and vacuole) and three significant multi location proteins (cytoplasm-nucleus, mitochondrion-plastid and cytoplasm-golgi apparatus). Various sequence derived features based on composition and physicochemical properties such as amino acid composition, pseudo amino acid composition, dipeptide composition and hybrid of these are used to represent the protein. Here a brief description of the sequences features used by the application. (AAC): Amino Acid Composition based, it is a 20-vector descriptor with the percentages of each of the aminoacid in the total sequence. (Dipep): Dipeptide Composition based, it is a 400-vector descriptor with the percentages of each pair of subsequent aminoacids in the total sequence. (PseAAC): Pseudo Amino Acid Composition based, it is a 30-vector descriptor with the information of AAC plus 10 more descriptors based on hydropholicity and hydrophilicity values of the aminoacids present. (NCC): N-Center-C terminal Composition based, it is a 60-vector descriptor with the AAC information of each of the three subsequences of the total lenght sequence. N is a subsequence of the 25 first aminoacids, C-terminal is a subsequence of the 25 last aminoacids and C-Center is the rest of the aminoacids. (CTDC): Composition according of the hydrophobicity, normalized van der Waals volume, polarity, and polarizability attributes. 21-vector descriptor.(protr R package) (CTDT): Transition according of the hydrophobicity, normalized van der Waals volume, polarity, and polarizability attributes. 21-vector descriptor.(protr R package) (QSO): Quasi Order Descriptor, normalized occurence for aminoacids. 100-vector descriptor(protr R package) (PseAACNCCDipep): Hybrid of Pseudo AAC, N-Center-C terminal (3-parts) and Dipeptide Composition based. 490-vector descriptor. (NCCDipepCTDCCTDTQSO): Hybrid N-Center-C terminal (3-parts), Dipeptide Composition, Composition and Transition (Dubchack) and Quasi Order Descriptor based. 602-vector descriptor.

Author

This application use R protr package(Nan Xiao, Dong-Sheng Cao, Min-Feng Zhu, and Qing-Song Xu. (2015). protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences. Bioinformatics 31 (11), 1857-1859.) for extract CTDC, CTDT, QSO and Dipep features. Support vector machine process uses e1071 R package. It was built by Cristian Loaiza, Graduate Student of Plant, Soils and Climate Departament in Utah State University.The backend is based on scripts developed in matlab by former Postdoc Sitanshu Sekhar Sahu under Dr Rakesh Kaundal supervision.

Help

You can either upload a multifasta file or paste your sequence(s) in the box, after that, you must select which kind of prediction method do you want to use and then click Submit button. When the result is ready you can download a file with that information to be open in a Spreadsheet (e.g: Excel) and also you can clear the result to Submit a new set of sequences. Minimum amioacid number required per sequence Sequences must have a length higher than 50 aminoacids, that is because some feature representation need at least that length number to calculate their values. Sequence number threshold It is up to a 1000 sequences, this is a threshold that could change after users feedback. Because depending of the prediction strategy used the time could vary substantially, for example, AAC support vector machine aproach is the faster one, compared to one of the hybrids or even the blast strategy that are much slower and consuming for computational resources, based on that this 1000 sequences threshold will change. Where are the sequences used to train the SVM model from In the dataset Tab, you could find a table in which you have access to the query performed in UniProtKB database and also the number of sequences per model and a link to download the training set and the indepent test set. Problems/Suggestion about Plant-mSubP If you can't get something to work or you want to suggest something that make this application works better, please send an email to crissloaiza@gmail.com.