Renoir

From XLDB

Jump to: navigation, search
Renoir. Taken from the EN Wikipedia page of Renoir

RENOIR (REMBRADNT's Extended NER On Interactive Retrievals) is an interactive tool to perform semantic queries. RENOIR makes extensive use of REMBRANDT, a named entity recognition module which explores the Wikipedia document structure, links and categories, to identify and classify named entities (NEs) in texts written in Portuguese and English.

The goal is to research new semantically-flavored query approaches to describe information needs, and take advantage of a semantic retrieval engine in order to perform more complex queries, such as "Portuguese rivers that bathe cities with more than 150.000 people" or "Places where Goethe lived".

RENOIR is developed by Nuno Cardoso in the scope of his PhD. RENOIR will be released under GPL public license soon. Feel free to joinm in and ask for a copy of the snapshot jars, test it and walkthrough the code, although the manual is still being written. RENOIR is written in Groovy Language, and it requires a computer with a Java 6 VM environment, some JAR and Perl module dependencies, a database and Perl >5.8.0 installed in order to work.

Contents

Motivation for RENOIR

Classic IR limitations

RENOIR aims to provide a testbed to experiment new ways to query search engine tools. The typical interaction between users and search engines is normally confined to a simple submission of query strings and presentation of result lists. For simple queries, such as navigational queries (that is, looking for a specific named page, such as the UE homepage) or hot topic queries(for instance, news about some results in Olympic games), this search model suffices in retrieving pertinent documents that the user might find relevant.

Nonetheless, for more elaborated information needs, the user may have a hard time in expressing his information need on few terms (for example, "I want to read biographies about painters who died before they were 40 years old"), the query string may reflect this difficulty by being vague and/or ambiguous, or too strict for term-matching search engines. It is common to experience searches that become a pipeline of trial-and-error interactions, where the user refines his searches each time, struggling to find the magic keyword list that will retrieve the documents that contains such information.

These failed searches happen mostly because search engines do not understand what do we want. Search engines are normally built over statistical models, where terms are used as matching units, and their frequency over documents and collections work as approximated measures for relevance. There is no effort in trying to understand the subjects at stake before performing the retrieval.

Towards semantic queries

As we are becoming more dependent on search tools to satisfy our needs, information retrieval is being challenged to answer to this demand on more semantically-aware searches, to cope with the new trends on search. For instance, there is a special interest on researching question answering (QA) systems, that aim to respond with simple facts or entities to user queries that are intended to have a specific answer, as in "what is the name of the Prime-minister of Canada". Geographic information retrieval (GIR) is also drawing some attention, to best handle searches that have a geographic area of interest, as in "prison riots in Brazil". These specific research areas have a common trend: they rely on natural language techniques to parse text and extract knowledge from it, and use this knowledge to reason over the answer.

One example is the way search engines deal with terms that might mean different entities, depending on the context. For example, querying for Jack London and Hotels London, for a human, it is clear that the term London is used in different contexts. For the search engine, the term London is treated in the same way for both queries, regardless of the fact that the first one does not refer to the city, because search engines do not have the capability to understand the true meaning of the entities at stake.

Now, if the search engine allowed the use of some sort of semantic labels on its query syntax, where it would be possible to issue queries such as PERSON:Jack London or Hotels LOCAL:London, we can better describe our search context in the query string. Assuming that the search engine previously processed all documents, recognizing all named entities and disambiguating their meaning, the retrieval algorithm could therefore use this semantic layer on documents and queries to confine the retrieval to the documents with London in the desired context, which will likely improve the retrieval results and the search experience.

The queries can be even more complex, and even be used as questions to be answered. For example, if we are interested on information about the rivers of Morocco, one could issue a query as RIVER:? LOCAL:Morocco, and let the system reason that all documents referring rivers and also places in Morocco are relevant. This shows how we can switch from a plain list of terms to express our needs with the help of semantic labels that convey also a significant search criteria.

Integration of RENOIR

RENOIR was initially designed to work with REMBRANDT, profiting from the semantic layer that REMBRANDT generates when recognizing named entities in the document. This information can be mapped to Wikipedia pages and further knowledge can be extracted from it, so that we can build a web of knowledge that pushes the boundaries of search once again to another level. This is in fact one of the goals of the GikiP evaluation pilot task, where RENOIR was used to answer questions such as Places where Goethe lived or French bridges built between 1980 and 1990.

Nonetheless, RENOIR is meant to be a testbed for developing semantic queries, and interact with any kind of machine-understandable knowledge that can assist on the retrievals. As such, REMBRANDT is just an annotated tool that feeds RENOIR with classified named entities. Other interesting projects can be coupled to RENOIR, namely the DBpedia.org project. The DBpedia aims to extract knowledge from Wikipeida in what they call "triplets", that is, RDF triplets that relate two concepts, such as Norway part-of Europe, or Bear is-a mammal. They claim to have 100 million triplets, and have an UI that allows queries in SPARQL, a kind on semantically-updated SQL.

Evaluation of RENOIR

RENOIR participated on the GikiP evaluation pilot task, organized by Linguateca in the scope of GeoCLEF task of the CLEF evaluation conference. The task involved 15 questions that were available in Portuguese, English and German, and to retrieve Wikipedia documents that would satisfy the questions. RENOIR achieved an average precision of 0.554 and a score of 10.946, for the overall results in Portuguese and English. Although the run was generated semi-automatically, RENOIR is intended to be also fully automatic, as well as interactive when needed.

Details on RENOIR

RENOIR works over a collection of Wikipedia documents, allowing the execution of query procedures, that is, as a group of pipelined actions that describe a given information need. The query actions are performed automatically (RENOIR performs the action alone), semi-automatic (The action is supervised) and manual (the action is made manually).

  1. Retrieval actions
    1. SEARCH TERM: Performs a simple term query search, and returns a list of Wikipedia documents.
    2. SEARCH CATEGORY: Searches the Wikipedia dumps for documents with the given Wikipedia category, and returns a list of Wikipedia documents.
    3. SEARCH INLINKS: Searches the Wikipedia snapshots for documents that link to a given Wikipedia document.
  2. Mapping actions
    1. MAP DOC:Maps a document from the Wikipedia dump to its counterpart in other collection.
    2. MAP NE: Maps a NE to its corresponding document in the Wikipedia collection.
  3. Annotation actions
    1. REMBRANDT: Annotates selected Wikipedia document(s) with Rembrandt, generating lists of NEs for each document.
    2. REMB. DOC TO NE: Invokes Rembrandt to classify the title of a given Wikipedia document, generating the respective NE.
  4. Filtering actions
    1. FILT. NE BY TYPE: Filters a list of NEs of a given classification category, generating a subset of NEs.
    2. FILT. DOC BY TERM: Filters a list of Wikipedia documents by having (or not) a given term/pattern.
    3. FILT. DOC BY EVAL:Filters a list of Wikipedia document by evaluating a condition for a given subset of NEs. For instance, if the document has a number NE greater than 1000, or if it has a place name NE within Europe.

Downloads


Soon.

Other dependencies and downloads


Quick start


To come soon.

Personal tools
Information Management
Internal Information