TY - GEN
T1 - Unsupervised identification of redundant domain entries in InterPro database using clustering techniques
AU - Rifaioǧlu, Ahmet Süreyya
AU - Doǧan, Tunca
AU - Can, Tolga
N1 - Publisher Copyright:
Copyright 2015 ACM.
PY - 2015/9/9
Y1 - 2015/9/9
N2 - InterPro is a widely used database that integrates functional signatures provided by different protein sequence annotation databases with manual curation; in order to present a comprehensive database of functional sequence annotation. However, the integration of the signatures causes inconsistent and/or redundant annotations in some cases. In this study, we proposed an unsupervised method for the automatic detection of inconsistent and redundant entries in the InterPro database. Two clustering methods: Markov Cluster Algorithm (MCL) and hierarchical clustering are employed in order to investigate to what extent these signatures can be detected. Results show that a considerable amount of (~75%) redundant entries can be identified. The future goal is to develop a system that does the identification of redundant and inconsistent signatures with very high performance using machine learning techniques in a supervised fashion. The findings of the study may aid InterPro curators to fix the problematic entries. It may also be used by curators as a road map before the integration of new signatures.
AB - InterPro is a widely used database that integrates functional signatures provided by different protein sequence annotation databases with manual curation; in order to present a comprehensive database of functional sequence annotation. However, the integration of the signatures causes inconsistent and/or redundant annotations in some cases. In this study, we proposed an unsupervised method for the automatic detection of inconsistent and redundant entries in the InterPro database. Two clustering methods: Markov Cluster Algorithm (MCL) and hierarchical clustering are employed in order to investigate to what extent these signatures can be detected. Results show that a considerable amount of (~75%) redundant entries can be identified. The future goal is to develop a system that does the identification of redundant and inconsistent signatures with very high performance using machine learning techniques in a supervised fashion. The findings of the study may aid InterPro curators to fix the problematic entries. It may also be used by curators as a road map before the integration of new signatures.
KW - Clustering
KW - HMM alignments
KW - Hidden markov models
KW - Protein sequence databases
KW - Redundancy analysis
KW - Similarity detection
UR - https://www.scopus.com/pages/publications/84963579180
U2 - 10.1145/2808719.2811430
DO - 10.1145/2808719.2811430
M3 - Conference contribution
AN - SCOPUS:84963579180
T3 - BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
SP - 505
EP - 506
BT - BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
PB - Association for Computing Machinery, Inc
T2 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2015
Y2 - 9 September 2015 through 12 September 2015
ER -