Home Submit Help Datasets

Datasets used for model development for Arabidopsis Thaliana and Pseudomonas Syringae system

The prediction capability of SVM model depends on the quality of positive and negative control dataset used for the study.For training and testing the following datasets were used:

Positive Set:

we collected well curated experimentally evidenced interactions to use as a positive control for the support vector models. The largest dataset for Arabidopsis-Pseudomonas interaction available to date is the experiment done by Mukhtar et al., 2011 which contains 153 PPIs. Again we collected 13 PPIs from HPIDB database and another 21 from various databases such as BIND, DIP, MINT, and iRefIndex. Thus, total 187 experimental PPIs are collected and after removing the duplicate pairs, 166 unique pairs are used as positive dataset. Then to reduce the redundancy pairs form the dataset we run CDHIT at 40% cutoff on the dataset and resulted 34 PPIs are kept as training dataset. The rest 132 PPIs are used as independent test set.

Negative Set:

We collected a set of keywords related to intraspecies and interspecies interaction by carefully searching the literature (Leucine, coiled, resistance, kinase, binding, disease, defense, defensin, interaction, receptor etc..). We searched both intra and inter species keywords in the sequence annotation of the whole proteome of Arabidopsis (35386 proteins) collected from TAIR and 10048 proteins are selected potential candidates for interaction. Then we searched swissprot knowledge base for Arabidopsis with the interspecies keywords and 13832 proteins are collected as positive hits. Also we added the Arabidopsis proteins from the positive dataset mentioned above as interacting candidates with Pseudomonas. After these processing steps, 21458 unique Arabidopsis proteins are collected as potential positive candidates for interaction. Again to prepare a better negative control dataset, we extract the homologous of these positive candidates in the rest 13928 proteins by BLAST with E-value < 10-4. Thus removing these positive like candidates, the rest 5955 proteins are considered as negative control dataset.

Since the proteins localized in the cytoplasm of bacteria may not be involved in interaction, all the proteins of Pseudomonas (all three pathovars tomato DC3000,Phaseolicola and syringae) are processed through the Psortb3.0 [widely used tool for protein localization in bacteria (www.psort.org/psortb)] and those predicted as cytoplasmic or cytoplasmic membrane are considered as negative candidates. The rest proteins are considered to be positive candidates. Again we search the whole proteome of all three pathovars of Pseudomonas through the effector database (http://www.effectors.org/), which is an integrated database for secreted type proteins for bacteria. Those identified as secreted are considered as positive candidates for interaction. Combining these two steps, a positive dataset is constructed and rest is considered as negatives. This positive dataset is BLASTED against the negative to remove their homologous proteins from the negative dataset. Then we searched the negative candidate proteins of Pseudomonas with the keywords related to interaction (see mentioned above) and the hits are removed from the negative dataset. After processing these steps, the remaining proteins are considered as negative control. Finally, total 3383 Pseudomonas proteins are constructed as negative dataset.

Note: The Positive & Negative Dataset can be provided upon request to authors.