Towards a Generic Approach for Schema Matcher Selection: Leveraging User Pre- and Post-match Effort for Improving Quality and Time Performance

Fabien Duchateau 1
1 ZENITH - Scientific Data Management
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier, CRISAM - Inria Sophia Antipolis - Méditerranée
Abstract : Interoperability between applications or bridges between data sources are required to allow optimal information exchanges. Yet, some processes needed to bring this integra- tion cannot be fully automatized due to their complexity. One of these processes is called matching and it has now been studied for years. It aims at discovering semantic corre- spondences between data sources elements and is still largely performed manually. Thus, deploying large data sharing systems requires the (semi-)automatization of this matching process. Many schema matching tools were designed to discover mappings between schemas. However, some of these tools intend to fulfill matching tasks with specific criteria, like a large scale scenario or the discovery of complex mappings. And contrary to ontology alignment research field, there is no common platform to evaluate them. The abundance of schema matching tools, added to the two previously mentioned issues, does not facil- itate the choice, by an user, of the most appropriate tool to match a given scenario. In this dissertation, our first contribution deals with a benchmark, XBenchMatch, to evaluate schema matching tools. It consists of several schema matching scenarios, which features one or more criteria. Besides, we have designed new measures to evaluate the quality of integrated schemas and the user post-match effort. This study and analysis of existing matching tools enables a better understanding of the matching process. Without external resources, most matching tools are mainly not able to detect a mapping between elements with totally dissimilar labels. On the contrary, they cannot infirm a mapping between elements with similar labels. Our second contribu- tion, BMatch, is a matching tool which includes a structural similarity measure and it aims at solving these issues by only using the schema structure. Terminological measures en- able the discovery of mappings whose schema elements share similar labels. Conversely, structural measures, based on cosine measure, detects mappings when schema elements have the same neighbourhood. BMatch's second aspect aims at improving the time per- formance by using an indexing structure, the B-tree, to accelerate the schema matching process. We empirically demonstrate the benefits and the limits of our approach. Like most schema matching tools, BMatch uses an aggregation function to combine similarity values, thus implying several drawbacks in terms of quality and performance. Tuning the parameters is another burden for the user. To tackle these issues, MatchPlanner introduces a new method to combine similarity measures by relying on decision trees. As decision trees can be learned, parameters are automatically tuned and similarity measures are only computed when necessary. We show that our approach provides an increase in terms of matching quality and better time performance with regards to other matching tools. We also present the possibility to let users choose a preference between precision and recall. Even with tuning capabilities, schema matching tools are still not generic enough to provide acceptable quality results for most schema matching scenarios. We finally extend MatchPlanner by proposing a factory of schema matchers, named YAM (for Yet Another Matcher). This tool brings more flexibility since it generates an 'a la carte' matcher for a given schema matching scenario. Indeed, schema matchers can be seen as machine learn- ing classifiers since they classify pairs of schema elements either as relevant or irrelevant. Thus, the best matcher in terms of matching quality is built and selected from a set of different classifiers. We also show impact on the quality when user provides some inputs, namely a list of expert mappings and a preference between precision and recall.
Document type :
Theses
Complete list of metadatas

Cited literature [134 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00436547
Contributor : Fabien Duchateau <>
Submitted on : Friday, November 27, 2009 - 2:53:17 AM
Last modification on : Friday, May 17, 2019 - 11:41:30 AM
Long-term archiving on : Tuesday, October 16, 2012 - 3:00:42 PM

Identifiers

  • HAL Id : tel-00436547, version 1

Citation

Fabien Duchateau. Towards a Generic Approach for Schema Matcher Selection: Leveraging User Pre- and Post-match Effort for Improving Quality and Time Performance. Human-Computer Interaction [cs.HC]. Université Montpellier II - Sciences et Techniques du Languedoc, 2009. English. ⟨tel-00436547⟩

Share

Metrics

Record views

523

Files downloads

619