Datasets
Host-pathogen protein-protein interactions (HPIs) were retrieved primarly from two sources, HPIDB, which is a database that contains many interactions from a plethora of disease systems; and from the comprehensive literature revision presented in the implementation of the PredHPI prediction server. Both datasets were joined and duplicates were removed based on both the identifiers and sequence of proteins.
Subsequently the whole dataset was separated based on if host protein organism was human, animal or plant. Then, data was divided in training and testing datasets in order to build our deep learning models.
To develop machine learning models, negative data must be present as well. Hypothetically, the ratio of non interactions against true interactions from the total of possible protein pairs should be large. Meaning that way more non interactions exist for each true interaction. In order to represent this differences in nature within our model we gather a large non-interaction dataset from here. This negative dataset was merge to each of the human, animal and plants dataset to generate our finals datasets.
Datasets are avaliable upon request. If you would like to obtain those datasets for your own experiments please send an email to crissloaiza@gmail.com or rkaundal@usu.edu.