VDB-MR: MapReduce-Based Distributed Data Integration Using Virtual Database

Abstract

Data Integration is becoming very important in many commercial applications and scientific research. A lot of algorithms and systems have been proposed and developed to address related issues from different aspects. Virtual database systems are well-recognized as one of the effective solutions of data integration. The existing execution modules in virtual database systems are very ineffective. MapReduce (MR) is a new computing model for parallel processing and has a good performance on large-scale data execution. In this paper, we propose a new distributed data integration system, called VDB-MR, which is based on the MapReduce technology, to efficiently integrate data from heterogeneous data sources. With VDB-MR, a unified view (i.e., a single virtual database) of multiple databases can be provided to users. We also conducted a series of experiments to evaluate VDB-MR by comparing it with an open source data integration system OGSA-DAI and two DBMSs in parallel. Experiment results show that VDB-MR significantly outperforms OGSA-DAI and the DBMSs in parallel.

Publication
Future Generation Computer Systems, 2010, Vol. 26, No. 8, pp.1418-1425