Thesis Type: Doctorate

Institution Of The Thesis: Yildiz Technical University, Graduate School of Natural and Applied Sciences, Turkey

Approval Date: 2019

Thesis Language: English


Consultant: Veli Hakkoymaz


The development of data base systems technology has created its own theoretical foundations and has qualified a large number of applications. Similarly, the growth of computer networks enabled the association of multiple computers for interchange of data and resources. The role of centralized Data Management System and its accessibility to multiple users concurrently has made it impossible for the Data Benefactors to focus the data at one large mainframe site. The superior network traffic and reduced efficiency have forced splitting of data at many sites with each location having their own storage and local processing abilities. This directed to the development of Distributed Databases (DDBs) that play a noteworthy role in today’s era where dependence on reliable and accurate data has become a compulsion. The innovations in hardware, software, protocols, storage and networks have transformed the position of the business necessities by making the handling of DDBs a feasible and operational decision. The supremacy of distributed databases lies in the capability to deliver interconnected data from any physically separated site to any other site. Distributed Database Management System (DDBMS) fits to the class of application software that manages distributed database and offers transparent access ability to multiple users across multiple sites by integrating parallelism and modularity. Though efficient, the designing of DDB has many practical limitations in selecting efficient methods for fragmentation, allocation and replication of data.

This research thesis focuses on developing efficient solutions for the DDB design issues. The main aim of this thesis is to propose powerful schemes for data fragmentation, allocation and replication for enhancing the query processing in DDBs for better performance.  The first approach concentrates on the limitations of utilizing the observed data about the queries to decide the fragmentation issue ineffective at the preliminary distributed database design where the efficiency is estimated only through proper design and network communication cost between sites. To resolve this issue, the improved model of hierarchical agglomerative clustering (IHAC) algorithm to derive semantic fragmentation of the distributed databases. The IHAC constructs the data representation matrix by considering all data objects instead of data counts while the traditional hierarchical agglomerative clustering algorithm constructs the data representation matrix based on the data count or frequency to select and compute similarity measures. This enhances the performance of clustering the data objects and hence the data fragmentation can be achieved efficiently.

The second approach focuses on the performance degradation in DDBs due to the communication cost by query remote access and retrieval of data. This can be optimized through an efficient data allocation approach that will provide flexible retrieval of a query by low cost accessible sites. For this process, Chicken Swarm Optimization (CSO) algorithm is utilized which characterizes the Data Allocation Problem (DAP) into optimal problem of choosing the appropriate and minimal communication cost provoking sites for the data fragments. Then the CSO algorithm optimally chooses the sites for each of the data fragments without creating much overhead and data route diversions. This enhances the overall distributed database design and subsequently ensures quality replication.

The third approach considers the issue of optimal replica selection and placement. Initially, the snapshot replication and merge replication process for suitable databases are illustrated. The MGSO approach is employed for selecting the location and number of replica for placement in the network. This approach utilizes the random patterns of read-write requests for the dynamic window mechanism for replication while also modelling the replication problem and a multi-objective optimization problem that is resolved using MGSO.

Evaluation of the proposed techniques is performed in Hadoop cluster environment     using master-slave dedicated machines. The evaluations are performed over a large dataset from three major sources, Twitter, Facebook and YouTube containing various types of data namely text, audio and video files with varying sizes. The evaluation and comparison results show that the proposed techniques in this research thesis perform better than the compared fragmentation, allocation and replication techniques. Hence it can proved that this work significantly enhance the design of DDBs by solving the problems of data fragmentation, data allocation and data replication.