Schema Matching and Integration in Large Scale Scenarios

Khalid Saleem 1
1 ZENITH - Scientific Data Management
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier, CRISAM - Inria Sophia Antipolis - Méditerranée
Abstract : Semantic matching of schemas in heterogeneous data sharing systems is time consuming and error prone. The dissertation presents a new robust automatic method which integrates a large set of domain specific schemas, represented as tree structures, based upon semantic correspondences among them. The method also creates the mappings from source schemas to the integrated schema. Secondly, the report gives an automatic technique to compute complex matchings between two schemas.

Existing mapping tools employ semi-automatic techniques for mapping two schemas at a time. In a large-scale scenario, where data sharing involves a large number of data sources, such techniques are not suitable. Semi-automatic matching requires user intervention to finalize a certain mapping. Although it provides the flexibilty to compute the best possible mapping but time performance wise abates the whole matching process. At first, the dissertation gives a detail discussion about the state of the art in schema matching. We summarize the deficiencies in the currently available tools and techniques for meeting the requirements of large scale schema matching scenarios. Our approach, PORSCHE (Performance ORiented SCHEma mediation) is juxtaposed to these shortcomings and its advantages are highlighted with sound experimental support.

PORSCHE associated algorithms, first cluster the tree nodes based on linguistic label similarity. Then, it applies a tree mining technique using node ranks calculated during depth-first traversal. This minimises the target node search space and improves time performance, which makes the technique suitable for large scale data sharing. PORSCHE implements a hybrid approach, which also in parallel, incrementally creates an integrated schema encompassing all schema trees, and defines mappings from the contributing schemas to the integrated schema. The approach discovers 1:1 mappings for integration and mediation purposes. Formal experiments on real and synthetic data sets show that PORSCHE is scalable in time performance for large scale scenarios. The quality of mappings and integrity of the integrated schema is also verified by the experimental evaluation.

Moreover, we present a technique for discovering complex match (1:n, n:1 and n:m), CMPV (Complex Match Proposition and Validation), between two schemas, validated by mini-taxonomies. The complex match proposition part is an extended version of schema matching part of PORSCHE. The mini-taxonomies are extracted from the large set of domain specific metadata instances represented as tree structures. We propose a framework, called ExSTax (Extracting Structurally Coherent Mini-Taxonomies) based on frequent sub-tree mining, to support our idea. It is the extension of the tree mining method of PORSCHE. We further utilise the ExSTax technique for extracting a reliable domain specific taxonomy.
Document type :
Complete list of metadatas
Contributor : Khalid Saleem <>
Submitted on : Monday, January 12, 2009 - 5:49:00 PM
Last modification on : Thursday, May 24, 2018 - 3:59:21 PM
Long-term archiving on: Tuesday, June 8, 2010 - 7:45:01 PM


  • HAL Id : tel-00352352, version 1



Khalid Saleem. Schema Matching and Integration in Large Scale Scenarios. Computer Science [cs]. Université Montpellier II - Sciences et Techniques du Languedoc, 2008. English. ⟨tel-00352352⟩



Record views


Files downloads