Development and Verification of Distributed (Federated) Data Models

Introduction

Distributed databases and federated databases are two types of data management systems that handle large and heterogeneous data sources in a distributed environment. They offer many benefits, such as improved availability, scalability, and data quality, but they also face many challenges, such as increased complexity, reduced consistency, and network failures. This essay will explore how these systems are developed and verified, focusing on their design, implementation, performance, architecture, characteristics, and verification methods.

Body

Development of Distributed Databases

A distributed database (DDB) is a single logical database stored across multiple networked nodes or sites [3]. The nodes or sites may be far apart or in the same LAN. The distributed database system manages the DDB and lets users and applications access and manipulate the data. The design of a DDB involves data fragmentation, data allocation, data replication, and data distribution [4]. Data fragmentation splits the data into smaller fragments that can be stored at different nodes or sites based on some criteria (e.g., horizontal or vertical fragmentation). Data allocation assigns the fragments to the nodes or sites (e.g., static or dynamic allocation). Data replication creates and maintains multiple copies of the same fragment at different nodes or sites for improving availability, fault tolerance, throughput, latency, or scalability. Data distribution determines how the data is accessed and manipulated across the nodes or sites (e.g., centralized or decentralized distribution). The implementation of a DDB requires components such as data dictionary, query processor, transaction manager, concurrency control, and recovery manager. The data dictionary stores the metadata about the structure, location, fragmentation, allocation, replication, and distribution of the data in the DDB. The query processor translates queries from users or applications into subqueries that can be executed at different nodes or sites. The transaction manager coordinates the execution of transactions across the nodes or sites while ensuring atomicity, consistency, isolation, and durability (ACID properties). The concurrency control prevents conflicts (e.g., lost updates, dirty reads, unrepeatable reads) or deadlocks (e.g., circular waiting) among concurrent transactions. The recovery manager restores the data to a consistent state in case of failures by using techniques such as logging, checkpointing, rollback, or commit. The performance of a DDB depends on factors such as data fragmentation and data allocation. The degree of data fragmentation affects the performance of queries and transactions. For example, fine-grained fragmentation may increase the overhead, while coarse-grained fragmentation may reduce the parallelism and load balancing. The strategy of data allocation affects the performance of queries and transactions. Thus static allocation may be more efficient for stable workloads, while dynamic allocation may be more adaptive for changing workloads.

Development of Federated Databases

A federated database (FD) is a system that integrates autonomous databases into a single logical database with a unified view of the data [2]. The autonomous databases can be heterogeneous (having different data models, schemas, or semantics) or homogeneous (having the same data model, schema, or semantics). The federated database system (FDBS) manages the FD and provides an interface for users and applications. To understand an FD, we need to consider:

• The local and global databases. A local database is managed by a local DBMS and has its own data model, schema and semantics. A global database is derived from the local databases by applying integration techniques and has a global data model, schema, semantics, and interface. The federated schema defines the structure and meaning of the data in the FD. The federated query is a query in terms of the federated schema that can be executed over the FD. The federated transaction is a transaction that accesses or modifies data in different local databases while ensuring ACID properties.

• The architecture of an FD. It involves three layers: the local layer (the local databases and their local DBMSs), the integration layer (the global database and its FDBS), and the application layer (the users and applications that interact with the FD).

• The characteristics of an FD. They involve three aspects: autonomy (the degree of independence of the local databases from each other and FDBS), heterogeneity (the degree of difference among the local databases in their data models, schemas, semantics, etc.), and distribution (the degree of physical dispersion of the local databases across multiple nodes or sites that are networked). For example, full autonomy means the local databases control their own data aspects, while partial autonomy means some aspects are controlled by the FDBS. Syntactic heterogeneity means the local databases use different languages or formats for their data, while semantic heterogeneity means the local databases have different meanings or interpretations for their data. Geographic distribution means the local databases are in different regions or countries, while organizational distribution means the local databases belong to different entities or domains.

Verification of Distributed (Federated) Databases

Distributed (federated) databases store and manage data across multiple heterogeneous sources, such as servers, networks, platforms, or organizations. They aim to give users a unified and consistent view of the data, regardless of its location and format. However, verifying the data in these databases is challenging, as it requires addressing various issues from the distributed and heterogeneous nature of the sources. One issue is data heterogeneity [1], which means the differences in data models, schemas, semantics, and formats across sources. For instance, a source may use a relational or non-relational data model. Another source may have a different name, definition, or type for the same attribute or entity. A third source may use a different unit, scale, or format for the same value or measurement. To verify the data, these differences need to be resolved or reconciled by using techniques like schema matching, schema mapping, ontology alignment, or data transformation. Another issue is data security [1], which means the protection of the data from unauthorized access or modification. For example, a source may contain sensitive or confidential data that should not be exposed or leaked to others. A different source may have different access rights or privileges for different users or roles. A third source may have different encryption or authentication mechanisms for securing the data. To verify the data, these security requirements need to be enforced or preserved by using techniques like encryption, decryption, hashing, digital signatures, certificates, or tokens. A third issue is data privacy [1], which means the preservation of the confidentiality and anonymity of the data and its owners. For instance, a source may contain personal or identifiable data that should not be disclosed or linked to others. Another source may have different privacy policies or regulations for collecting, storing, processing, or sharing the data. A third source may have different preferences or consent for how their data is used or accessed. To verify the data, these privacy preferences need to be respected or honored by using techniques like anonymization, pseudonymization, masking, aggregation, or differential privacy. A fourth issue is data governance [1], which means the definition and enforcement of policies and standards for the data quality, integrity, and usage. For example, a source may have different levels of quality or accuracy for their data. A different source may have different rules or constraints for ensuring the integrity or validity of their data. A third source may have different norms or expectations for how their data is used or consumed. To verify the data, these policies and standards need to be defined and enforced by using techniques like metadata management, data quality assessment, data cleansing, data auditing, or data lineage. A fifth challenge of verifying data in distributed databases is data provenance [1]. It means tracking and documenting the origin, history, and lineage of the data from different sources. To capture and record this information, techniques such as provenance models, graphs, queries, or annotations are used. To verify the data across multiple heterogeneous sources, modern federated database systems use data federation. This allows users to access and query data from multiple sources without moving or copying it. Some examples of such systems are Google Fusion Tables, Amazon Redshift, Apache Ignite, and Apache Hadoop. Data federation improves data quality and accessibility by reducing data duplication, inconsistency, and latency and providing a unified and customizable view of the data.

Conclusion

In this essay we discussed the development and verification of distributed (federated) data models, which enable data integration and management across multiple heterogeneous sources. We explored how these models are developed using various techniques such as fragmentation, allocation, replication, and distribution, and how they are verified using various factors such as federation, quality, accessibility, and heterogeneity. We also illustrated some examples of modern federated database systems and technologies. We concluded that distributed (federated) data models pose significant challenges and opportunities for data management in a distributed environment, and that they have important implications for the status and role of data in society.

References

[1] Federated database management system issues. https://www.geeksforgeeks.org/ federated-database-management-system-issues/. Accessed: 202310-21.

[2] Federated database system. https://en.wikipedia.org/wiki/ Federated_database_system. Accessed: 2023-10-21.

[3] Alexander Fridman. Distributed database architecture: What is it? https://www.influxdata.com/blog/ distributed-database-architecture-what-is-it/. Accessed: 2023-1021.

[4] Bosko Marijan. What is a distributed database? Features, Benefits & Drawbacks. https://phoenixnap.com/kb/distributed-database. Accessed: 2023-10-21.