RESEARCH PAPER
Prediction of protein subcellular localization using support vector machine with the choice of proper kernel
 
More details
Hide details
1
Department of Computer Science and Engineering, University of Rajshahi, Rajshahi, Bangladesh
 
 
Submission date: 2016-11-15
 
 
Final revision date: 2017-01-15
 
 
Acceptance date: 2017-01-27
 
 
Publication date: 2017-07-18
 
 
BioTechnologia 2017;98(2):85-96
 
KEYWORDS
TOPICS
ABSTRACT
The prediction of subcellular locations of proteins can provide useful hints for revealing their functions as well as for understanding the mechanisms of some diseases and, finally, for developing novel drugs. As the number of newly discovered proteins has been growing exponentially, laboratory-based experiments to determine the location of an uncharacterized protein in a living cell have become both expensive and time-consuming. Consequently, to tackle these challenges, computational methods are being developed as an alternative to help biologists in selecting target proteins and designing related experiments. However, the success of protein subcellular localization prediction is still a complicated and challenging problem, particularly when query proteins may have multi-label characteristics, i.e. their simultaneous existence in more than one subcellular location, or if they move between two or more different subcellular locations as well. At this point, to get rid of this problem, several types of subcellular localization prediction methods with different levels of accuracy have been proposed. The support vector machine (SVM) has been employed to provide potential solutions for problems connected with the prediction of protein subcellular localization. However, the practicability of SVM is affected by difficulties in selecting its appropriate kernel as well as in selecting the parameters of that selected kernel. The literature survey has shown that most researchers apply the radial basis function (RBF) kernel to build a SVM based subcellular localization prediction system. Surprisingly, there are still many other kernel functions which have not yet been applied in the prediction of protein subcellular localization. However, the nature of this classification problem requires the application of different kernels for SVM to ensure an optimal result. From this viewpoint, this paper presents the work to apply different kernels for SVM in protein subcellular localization prediction to find out which kernel is the best for SVM. We have evaluated our system on a combined dataset containing 5447 single-localized proteins (originally published as part of the Höglund dataset) and 3056 multi-localized proteins (originally published as part of the DBMLoc set). This dataset was used by Briesemeister et al. in their extensive comparison of multilocalization prediction system. The experimental results indicate that the system based on SVM with the Laplace kernel, termed LKLoc, not only achieves a higher accuracy than the system using other kernels but also shows significantly better results than those obtained from other top systems (MDLoc, BNCs, YLoc+). The source code of this prediction system is available upon request.
REFERENCES (57)
1.
Bannai H., Tamada Y., Maruyama O., Nakai K., Miyano S. (2002) Extensive feature detection of N-terminal protein sorting signals. Bioinformatics 18(2): 298-305.
 
2.
Brady S., Shatkay H. (2008) EpiLoc: a (working) text-based system for predicting protein subcellular location. Pacific Symp. Biocomput. 13: 604-615.
 
3.
Blum T., Briesemeister S., Kohlbacher O. (2009) MultiLoc2: integrating phylogeny and Gene Ontology terms improves subcellular protein localization prediction. BMC Bioinformatics 10(1): 1.
 
4.
Ben-Hur A., Weston J. (2010) A user’s guide to support vector machines. Meth. Mol. Biol. 609: 223-239.
 
5.
Briesemeister S., Rahnenführer J., Kohlbacher O. (2010a) Going from where to why – interpretable prediction of protein subcellular localization. Bioinformatics 26(9): 1232-1238.
 
6.
Briesemeister S., Rahnenführer J., Kohlbacher O. (2010b) YLoc – an interpretable web server for predicting subcellular localization. Nucl. Acids Res. 38(suppl 2): W497- W502.
 
7.
Chou K.C., Cai Y.D. (2002) Using functional domain composition and support vector machines for prediction of protein subcellular location. J. Biol. Chem. 277(48): 45765-45769.
 
8.
Chou K.C., Cai Y.D. (2003) Prediction and classification of protein subcellular location-sequence order effect and pseudo amino acid composition. J.Cell Biochem. 90(6): 1250-1260.
 
9.
Chou K.C., Shen H.B. (2006) Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-nearest neighbor classifiers. J. Proteome Res. 5(8): 1888-1897.
 
10.
Chou K.C., Shen H.B. (2007a) Recent progress in protein subcellular location prediction. Anal. Biochem. 370(1): 1-16.
 
11.
Chou K.C., Shen H.B. (2007b) Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites. J. Proteome Res. 6(5): 1728-1734.
 
12.
Chou K.C., Shen H.B. (2010) Cell-PLoc 2.0: an improved package of web-servers for predicting subcellular localization of proteins in various organisms. Nat. Sci. 2(10): 1090.
 
13.
Du P., Xu C. (2013) Predicting multisite protein subcellular locations: progress and challenges. Exp. Rev. Proteom. 10(3): 227-237.
 
14.
Emanuelsson O., Nielsen H., Brunak S., von Heijne G. (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol. 300(4): 1005-1016.
 
15.
Fyshe A., Liu Y., Szafron D., Greiner R., Lu P. (2008) Improving subcellular localization prediction using text classification and the gene ontology. Bioinformatics 24(21): 2512-2517.
 
16.
Gu Q., Ding Y.S., Jiang X.Y., Zhang T.L. (2010) Prediction of subcellular location apoptosis proteins with ensemble classifier and feature selection. Amino Acids 38(4): 975-983.
 
17.
Höglund A., Dönnes P., Blum T., Adolph H.W., Kohlbacher O. (2006) MultiLoc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition. Bioinformatics 22(10): 1158-1165.
 
18.
Horton P., Park K.J., Obayashi T., Fujita N., Harada H., Adams-Collier C.J., Nakai K. (2007) WoLF PSORT: protein localization predictor. Nucl. Acids Res. 35(suppl 2): W585-W587.
 
19.
Huang W.L., Tung C.W., Ho S.W., Hwang S.F., Ho S.Y. (2008) ProLoc-GO: utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization. BMC Bioinformatics 9(1): 1.
 
20.
He J., Gu H., Liu W. (2012) Imbalanced multi-modal multi-label learning for subcellular localization prediction of human proteins with both single and multiple sites. PloS One 7(6): e37155.
 
21.
Hasan M.A.M., Nasser M., Pal B. (2013) On the KDD’99 Dataset: support vector machine based intrusion detection system (IDS) with different kernels. Intern. J. Electron. Commun. Comp. Eng. 4(4): 1164-1170.
 
22.
Hasan M.A.M., Nasser M., Pal B., Ahmad S. (2014) Support vector machine and random forest modeling for intrusion detection system (IDS). J. Int. Learn. Syst. Appl. 6(1): 45.
 
23.
Hasan M.A.M., Nasser M., Pal B., Ahmad S., Molla M.K.I. (2015) Prediction of multi-label protein subcellular location using support vector machine with proper kernel selection. Second International Conference on Theory and Application of Statistics 32.
 
24.
King B.R., Guda C. (2007) ngLOC: an n-gram-based Bayesian method for estimating the subcellular proteomes of eukaryotes. Genome Biol. 8(5): R68.
 
25.
Lu Z., Szafron D., Greiner R., Lu P., Wishart D.S., Poulin B., Eisner R. (2004) Predicting subcellular localization of proteins using machine-learned classifiers. Bioinformatics 20(4): 547-556.
 
26.
Lee K., Chuang H.Y., Beyer A., Sung M.K., Huh W.K., Lee B., Ideker T. (2008) Protein networks markedly improve prediction of subcellular localization in multiple eukaryotic species. Nucl. Acids Res. 36(20): e136-e136.
 
27.
Lin H.N., Chen C.T., Sung T.Y., Ho S.Y., Hsu W.L. (2009) Protein subcellular localization prediction of eukaryotes using a knowledge-based approach. BMC Bioinformat. 10(15): 1.
 
28.
Li L.Q., Kuang H., Zhang Y., Zhou Y., Wang K.F., Wan Y. (2011) Prediction of eukaryotic protein subcellular multilocalisation with a combined KNN-SVM ensemble classifier. J. Comput. Biol. Bioinform. Res. 3(2): 15-24.
 
29.
Li L., Zhang Y., Zou L., Li C., Yu B., Zheng X., Zhou Y. (2012) An ensemble classifier for eukaryotic protein subcellular location prediction using gene ontology categories and amino acid hydrophobicity. PLoS One 7(1): e31057.
 
30.
Mak M.W., Guo J., Kung S.Y. (2008) PairProSVM: protein subcellular localization based on local pairwise profile alignment and SVM. IEEE/ACM Trans. Comput. Biol. Bioinform. 5(3): 416-422.
 
31.
Nakai K., Kanehisa M. (1991) Expert system for predicting protein localization sites in gram negative bacteria. Proteins 11(2): 95-110.
 
32.
Nakashima H., Nishikawa K. (1994) Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J. Mol. Biol. 238(1): 54-61.
 
33.
Nielsen H., Engelbrecht J., Brunak S., Heijne G.V. (1997) A neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Int. J. Neural Syst. 8(05n06): 581-599.
 
34.
Nair R., Rost B. (2002) Inferring sub-cellular localization through automated lexical analysis. Bioinformatics 18(suppl. 1): S78-S86.
 
35.
Park K.J., Kanehisa M. (2003) Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics 19(13): 1656-1663.
 
36.
Petsalaki E.I., Bagos P.G., Litou Z.I., Hamodrakas S.J. (2006) PredSL: a tool for the N-terminal sequence-based prediction of protein subcellular localization. Genom. Proteom. Bioinform. 4(1): 48-55.
 
37.
Schölkopf B., Smola A.J. (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press.
 
38.
Scott M.S., Thomas D.Y., Hallett M.T. (2004) Predicting subcellular localization via protein motif co-occurrence. Genome Res. 14(10a): 1957-1966.
 
39.
Shatkay H., Höglund A., Brady S., Blum T., Dönnes P., Kohlbacher O. (2007) SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. Bioinformatics 23(11): 1410-1417.
 
40.
Shen H.B., Yang J., Chou K.C. (2007) Euk-PLoc: an ensemble classifier for large-scale eukaryotic protein subcellular location prediction. Amino Acids 33(1): 57-67.
 
41.
Shin C.J., Wong S., Davis M.J., Ragan M.A. (2009) Proteinprotein interaction as a predictor of subcellular location. BMC Syst. Biol. 3(1): 1.
 
42.
Simha R., Shatkay H. (2014) Protein (multi-) location prediction: using location inter-dependencies in a probabilistic framework. Algorithms Mol. Biol. 9(1): 1.
 
43.
Simha R., Briesemeister S., Kohlbacher O., Shatkay H. (2015) Protein (multi-) location prediction: utilizing interdependencies via a generative model. Bioinformatics 31(12): i365-i374.
 
44.
Tsoumakas G., Katakis I., Vlahavas I. (2009) Mining multilabel data. Data mining and knowledge discovery handbook. Springer US.
 
45.
Vladimir N.V. (1995) The nature of statistical learning theory. Springer-Verlag New York.
 
46.
Wang X., Li G.Z., Liu J.M., Zhao R.W. (2011) Multi-label learning for protein subcellular location prediction. Bioinformatics and Biomedicine (BIBM), 2011 IEEE International Conference 282-285.
 
47.
Wan S., Mak M.W., Kung S.Y. (2012) mGOASVM: Multi-label protein subcellular localization based on gene ontology and support vector machines. BMC Bioinformat. 13(1): 1.
 
48.
Wan S., Mak M.W., Kung S.Y. (2013) GOASVM: a subcellular location predictor by incorporating term-frequency gene ontology into the general form of Chou’s pseudo-amino acid composition. J. Theor. Biol. 323: 40-48.
 
49.
Wan S., Mak M.W., Kung S.Y. (2014) HybridGO-Loc: mining hybrid features on gene ontology for predicting subcellular localization of multi-location proteins. PloS One 9(3): e89545.
 
50.
Wan S., Mak M.W., Kung S.Y. (2015) mPLR-Loc: An adaptive decision multi-label classifier based on penalized logistic regression for protein subcellular localization prediction. Anal. Biochem. 473: 14-27.
 
51.
Wang X., Zhang J., Li G.Z. (2015) Multi-location gram-positive and gram-negative bacterial protein subcellular localization using gene ontology and multi-label classifier ensemble. BMC Bioinformatics 16(Suppl 12): S1.
 
52.
Xiao X., Wu Z.C., Chou K.C. (2011a) A multi-label classifier for predicting the subcellular localization of gram-negative bacterial proteins with both single and multiple sites. PloS One 6(6): e20592.
 
53.
Xiao X., Wu Z.C., Chou K.C. (2011b) iLoc-Virus: A multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites. J. Theor. Biol. 284(1): 42-51.
 
54.
Yang W.Y., Lu B.L., Yang Y. (2006) A comparative study on feature extraction from protein sequences for subcellular localization prediction. Computational Intelligence and Bioinformatics and Computational Biology, 2006 IEEE Symposium 1-8.
 
55.
Yu C.S., Cheng C.W., Su W.C., Chang K.C., Huang S.W., Hwang J.K., Lu C.H. (2014) CELLO2GO: a web server for protein subCELlular LOcalization prediction with functional gene ontology annotation. PloS One 9(6): e99368.
 
56.
Zou L., Wang Z., Huang J. (2007) Prediction of subcellular localization of eukaryotic proteins using position-specific profiles and neural network with weighted inputs. J. Genet. Genomics 34(12): 1080-1087.
 
57.
Zhang S., Xia X., Shen J., Zhou Y., Sun Z. (2008) DBMLoc: a Database of proteins with multiple subcellular localizations. BMC Bioinformatics 9(1): 127.
 
eISSN:2353-9461
ISSN:0860-7796
Journals System - logo
Scroll to top