FiBRE aims at developing methods to improve biomedical text mining. Biomedical information being published in the scientific literature is increasing at a fast pace, making it difficult for researchers to keep up to date with what is being published. Published literature is an important source of data to populate databases, however the manual extraction of important data from the literature by expert database curators is a arduous and time consuming process. Text mining can provide tools to help speed up and improve this process.
In this context a chemical entity recognition system that follows a machine learning approach using an implementation of Conditional Random Fields (CRF) has been developed. The chemical entities recognized can then be mapped to the ChEBI ontology, through a resolution module that uses a lexical similarity method to propose which compound has been identified. Knowing which terms are present in a document can be used for validation of the entity identification process. The underlying assumption is that most often a text fragment such as a paragraph has a limited scope, and therefore normally contains entities that are somehow related to each other. With this in mind, we developed an algorithm that takes as input the entities mapped to ChEBI within a text fragment and searches for relationships between them. The output is for each input entity a validation score and the most similar entity in the text fragment, by calculating the semantic similarity of the recognized compounds by using the ChEBI ontology. The score can be used for validation or filtering according to its high or low value, and the most similar entity can be useful evidence to support the decision to discard or not that entity.
A machine learning method has also been developed for automatic filtering of entity recognition errors. The idea is that in the process of entity recognition, albeit most of the recognitions might be correct, there will always be a small amount of recognition errors. Thus, we developed a method that uses its own recognition results to, through machine learning, filter its errors. This method has been used in Bioalma's Text Detective gene identification tool.
Available at: http://www.lasige.di.fc.ul.pt/webtools/ice/
- Period: 1-Jan-2008 to 31-Dec-2011
- SFRH/BD/36015/2007, Doctoral research scholarship for Tiago Grego
Tiago Grego, Francisco Pinto, Francisco Couto, LASIGE: using Conditional Random Fields and ChEBI ontology.Proceedings of the International Workshop on Semantic Evaluation (SemEval2013) 2013.
Tiago Grego, Francisco Couto 2013: Enhancement of Chemical Entity Identification in Text Using Semantic Similarity Validation. PLOS ONE 5(8), e62984.
Tiago Grego, Francisco Pinto, Francisco Couto, Identifying Chemical Entities based on ChEBI.Proceedings of ICBO 2012.
Tiago Grego, Catia Pesquita, Hugo Bastos, Francisco M. Couto 2012: Chemical Entity Recognition and Resolution to ChEBI. ISRN Bioinformatics (2012), 9.
Tiago Grego, Piotr Pezik, Francisco Couto, Dietrich Rebholz-Schuhmann 2009: Identification of Chemical Entities in Patent Documents. Lecture Notes in Computer Science (5518), 942-949.
Francisco Couto, Tiago Grego, Catia Pesquita, Hugo Bastos, R. Torres, P. Sanchez, L. Pascual, C. Blaschke, Identifying Bioentity Recognition Errors of Rule-Based Text-Mining Systems.IEEE Third International Conference on Digital Information Management (ICDIM) 2008.
Francisco Couto, Tiago Grego, Rafael Torres, Pablo Sanchez, Leandro Pascual, Christian Blaschke, Filtering Bioentity Recognition Errors in Bioliterature using a Case-based Approach.BioLINK SIG at 15th International Conference on Intelligent Systems for Molecular Biology / 6th European Conference on Computational Biology (ISMB/ECCB) July, 2007.