GeoSSM Geographic Similarity Calculator
Given two ontology concepts, Semantic Similarity Measures (SSM) return a numerical value reflecting the closeness in meaning between them. Different approaches are available to quantify the semantic similarity between concepts of an ontology represented as a direct acyclic graph (DAG), such as Geo-Net-PT. The relationship part-of structures the geographic ontology as a DAG which is the typical structure where SSM can be applied to.
Different approaches are available to quantify the semantic similarity between concepts of an ontology represented as a DAG, one technique commonly used is Information Content (IC), which gives a measure of how specific and informative a term is.
To calculate the Information Content for each concept in Geo-Net-PT we used the WPT 05. Each geographic concept in Geo-Net-PT is associated to a name which is represented by three variations:
- simple ASCII.
ex: Dão-Lafões, dão-lafões, dao-lafoes
We calculated two different versions of the Information Content for any given concept. One version is based on the number of occurences of the capitalized version of the geographic name of the concept in n-grams extracted from the WPT 05. We could only count the frequency for names up to 5 words, since we only extracted word n-grams up to the fifht order.
The other version is based on the number of documents in WPT 05 that contain at least a reference to the geographic name of a concept. We indexed the Portuguese documents of WPT05 in Lucene and queried the index using the capitalized version of each geographic name to gather the number of documents containing the geographic name. For both approaches we used all the geographic names with exception of postal codes. Geo-Net-PT contains 77 748 unique names, not couting with postal codes.
Semantic Similarity Measures
A SSM functions takes two concepts and returns a value, reflecting how related they are, for instance, having extracted the terms Lisboa and Santa Catarina from a document, the following associations to concepts in Geo-Net-PT can be made:
- Lisboa is a municipality (ID1 )
- Lisboa is a place in the municipality of Monção (ID2 )
- Santa Catarina is a civil parish in the municipality Lisboa (ID3 )
- Santa Catarina is a street in the municipality of Porto (ID4 )
Calculating the semantic similarity for each pair of concepts with names (Lisboa, Santa Catarina), we obtain:
- SSM(ID1,ID3) = 0.58
- SSM(ID1,ID4) = 0.06
- SSM(ID2,ID3) = 0.06
- SSM(ID2,ID4) = 0.14
For this example, the pair Lisboa(ID1) and Santa Catarina (ID3 ) have the highest value, meaning that those are most geographically related.
Geo-Net-PT Semantic Similarity Calculator
Daniel Amoedo, Medidas de Semelhança Semântica Aplicadas às Ontologias Geográficas Master Thesis, University of Lisbon, Faculty of Sciences, September 2010.
David Batista, Mário J. Silva, Francisco Couto, Bibek Behera, Geographic Signatures for Semantic Retrieval.6th Workshop on Geographic Information Retrieval Zurich, Switzerland, 18-19 February, 2010.