Rembrandt

From XLDB

Jump to: navigation, search
Self-portrait. Taken from the PT Wikipedia page of Rembrandt

This resource is no longer supported by the XLDB Group. Please check the new location.



REMBRANDT ( Reconhecimento de Entidades Mencionadas Baseado em Relações e ANálise Detalhada do Texto) is a language-dependent named-entity recognition (NER) system that uses Wikipedia as a raw knowledge resource, and explores the Wikipedia document structure to classify all kinds of named entities in the text. By using Wikipedia, Rembrandt obtains additional knowledge on every named entity that can be useful for understanding the context, detecting relationships with other named entities, and use this information to contextualise and classify surrounding named entities in the text.

Description

One example of this additional knowledge in practice is the use of the Wikipedia page categories to derive implicit geographic evidence for each named entity. Rembrandt handles category strings as text sentences and searches for place names in a similar way as it is performed on normal texts, generating a list of captured place names that are considered as implicit geographic evidence for the given named entity.

Classification categories

Rembrandt currently classifies named entities using the 9 main categories and 47 sub-categories defined by the second edition of HAREM, a NER system evaluation contest for Portuguese. The main categories are: PERSON, ORGANIZATION, PLACE, DATETIME, VALUE, ABSTRACTION, EVENT, THING and MASTERPIECE. Rembrandt can handle vagueness in named entities, by tagging the named entities with more than one category or sub-category.

REMBRANDT's core strategy

The REMBRANDT classification strategy relies on mapping each named entity to a Wikipedia page and subsequently analysing its document structure, links and categories, searching for suggestive evidences. Rembrandt also relies on manually crafted rules for capturing internal and external evidence of named entities for both Portuguese and English texts. These rules are used to classify named entities that were not mapped to a Wikipedia page or mapped to a page with insufficient information, and to contextualise named entities that have a different meaning (for example, in I live in Portugal street, where the named entity Portugal designates a street, not a country).

The classification is best illustrated by following how the example named entity, Empire State Building, is handled: the english Wikipedia page of the Empire State Building (en.wikipedia.org/wiki/Empire_State_Building) is labelled with 10 categories, such as Skyscrapers in New York City and Office buildings in the United States. With this information, Rembrandt classifies the named entity as a PLACE/HUMAN/CONSTRUCTION. In the hypothetical case that this named entity could not be mapped to a Wikipedia page, internal evidence rules, such as the presence of the term Building in the end, can classify the named entity as a PLACE/HUMAN/CONSTRUCTION. Finally, external evidence rules check the context on which the named entity is inserted, ensuring that the named entity is not referred in another context (for example, as an hypothetical movie, street or restaurant name). For the detection of implicit geographic evidence, the categories Skyscrapers in New York City and Office buildings in the United States are handled by Rembrandt as additional text, and the place names New York City and United States are captured and listed as implicit geographic evidence associated to the named entity Empire State Building.

REMBRANDT evaluation

A first prototype of Rembrandt recently participated in the second edition of HAREM, obtaining an F-measure of 56.7% for the full NER task and ranking as the 2nd best system out of 10, and ranking 1st out of 8 systems for the PLACE only scenario task with an F-measure of 62.5%. It also achieved the best results for the ReRelEM subtask, an entity relation detection task.

Disclaimer

Rembrandt was developed by Nuno Cardoso on behalf of his PhD work. Rembrandt is free of use and released under GPL public license. Feel free to test it and walthrough the code, although the manual is still being written. Rembrandt was written in Groovy Language, and it requires a computer with a Java 6 VM environment, some JAR and Perl module dependencies, a database and Perl >5.8.0 installed in order to work. Rembrandt is compatible with Hadoop to distribute processing among several machines.

Future of REMBRANDT

A version 0.9 is on the forge. It will have:

  • A client-server architecture, for web services.
  • Less dependencies, namely an adaptation of the Linguateca tokenizer into Java
  • DBpedia usage, able to classify over DBpedia ontology
  • Better rule engine

Downloads


Current version is Rembrandt 0.8.3. Please refer to the Rembrandt download page.

Other dependencies and downloads


  • Wikipedia's database for Portuguese (MySQL dump, Gzip, 185MB)
  • Wikipedia's database for English: soon. Contact me for more details on building your own English database.

(still not available).

  • SQL EN links information (still not available).

How to Cite


Please cite:

Nuno Cardoso, REMBRANDT - Reconhecimento de Entidades Mencionadas Baseado em Relações e ANálise Detalhada do Texto. In Cristina Mota & Diana Santos (eds.). Desafios na avaliação conjunta do reconhecimento de entidades mencionadas: O Segundo HAREM. Linguateca. 2008

Personal tools
Information Management
Internal Information