Nitrification is an important microbial two-step transformation in the nitrogen cycle, as it is the only natural process that produces nitrate within a system. The functional annotation of nitrification-related enzymes has a broad range of applications in metagenomics, industrial biotechnology, etc. The time and resources needed for determining the function of enzymes experimentally are restrictively costly. Therefore, an accurate computational prediction of the nitrification-related enzymes has become much more important.
In this study, we developed an alignment-free computational approach to determine the nitrification-related enzymes from the sequence itself. We propose deepNEC, a novel end-to-end feature selection and classification model training approach for nitrification-related enzymes prediction. The algorithm has been developed using Deep Learning, a class of machine learning algorithms that uses multiple layers to progressively extract higher level features from the raw input data. The raw protein sequences encoding is used as an input, extracting sequential and convolutional features from raw encoded protein sequences based on classification rather than using traditional alignment-based methods for nitrification-related enzymes prediction. Two large datasets of protein sequences, enzymes and non-enzymes were used to train the models with protein sequence features like amino acid composition, dipeptide composition, conformation transition and distribution (CTD), NMBroto, conjoint, quasi order, etc. The K-fold cross-validation and independent testing were performed to validate our model training. deepNEC uses a three-tier approach for prediction; in the first phase, it will predict a query sequence as enzyme or non-enzyme; in second phase, it will predict whether the enzyme is an oxidoreductase or a non-oxidoreductase; in the third phase, for those enzymes predicted as oxidoreductases in phase-II, it will further predict and classify them into various nitrification-related enzyme classes. Among all, the DPC+NMBrot hybrid feature gave the best prediction performance (accuracy of 96.15% in k-fold training and 93.43% in independent testing) with a MCC (0.92 training and 0.87 independent testing) in phase I; in phase-II, the DPC feature gave the best prediction performance (accuracy of 99.82% in training and 94% independent testing) with a MCC (0.99 training and 0.88 independent testing); in phase-III, DPC feature gave the best prediction performance for 13 nitrification-related enzyme classes. We have also implemented homology-based method to remove false negatives.