The evolution of data modeling techniques: from relational to NoSQL

In today’s world, data processing plays a key role in decision-making and technology development. Effective data management, including storage and processing, is critically important to ensure the correct operation of a wide range of software that covers all spheres of human activity.

The process of creating a formalized representation of data, structures, and relationships between various data elements is called data modeling. Data modeling is used to ensure consistency, stability, and scalability of information systems when business processes or requirements change.

Data models define the structure and meaning of data that can be used for exchange between applications or systems, as well as for their integration and joint use within a single database. Data modeling helps organize information in such a way that it is useful and understandable to all stakeholders. High-quality data models allow for[1]:

Minimizing data duplication and system integration costs.
Improving data consistency and reducing the number of errors when transferring data between systems.
Ensuring the stability of information systems when data requirements change.
Reducing development and maintenance costs through data standardization.

Data modeling has undergone a long path of development, starting from the 1960s when network and hierarchical models were proposed for the first database management systems. The main task of these early models was to organize and structure data that were part of file systems. These models, such as Honeywell’s Integrated Data Store (IDS) and IBM’s Information Management System (IMS), represented data as linked records organized into tree-like (hierarchical) or network structures.

The hierarchical model represented data in a tree-like structure, where each data element had only one parent, creating a strict hierarchy. Network models, such as those proposed by CODASYL, allowed for more complex relationships between data, where one element could be associated with multiple parent elements. These models enabled the organization of large volumes of data and supported some basic operations for their processing but had limitations. They were too rigid in their structure, complicating changes and system modernization.[2]

Often, finding the necessary data required writing complex procedural code. For example, a simple query about the number of employees retiring in the next three years required expertise and programming skills. Additionally, when changing the database structure, it was necessary to rewrite existing programs. Therefore, such databases were costly to operate, as they required low-level programming and constant support when data changed.

In 1970, Edgar F. Codd proposed the relational data model, which became a fundamentally new approach in database management. Codd suggested representing all information in the form of simple tabular structures — relations — and accessing data via a high-level declarative query language. Instead of writing algorithms for sequential record retrieval, the programmer only needed to specify the conditions that the desired data should meet. Then the database management system, using a query optimizer, transformed these conditions into an efficient access algorithm. This approach significantly simplified working with databases compared to the navigational languages of previous models. High-level declarative query languages increased programmer productivity and allowed end-users to interact with data independently without deep programming knowledge.

Throughout the 1970s, the research community actively studied the concept of relational DBMSs. High-level query languages were developed to facilitate the use of systems by both programmers and end-users. Theories and algorithms for query optimization allowed transforming high-level queries into efficient data access plans, comparable in performance to manually written code. The theory of data normalization was also formulated, helping to eliminate redundancy and logical anomalies in databases.

In addition, algorithms were created for distributing data on physical media to minimize access costs to records. Methods for buffer management and indexing were developed to accelerate information retrieval and processing. An important contribution was also the development of transaction theory. The concept of a transaction ensured atomicity and consistency of data operations, which was critically important for system reliability. Researchers developed methods for managing concurrent data access and recovery after failures, enhancing the reliability and resilience of databases. Prototypes of relational DBMSs created during these studies became the foundation for many commercial systems that emerged in the 1980s.

As a result of the conducted research, relational databases became widespread. Today, commercial relational DBMSs are available on various hardware platforms — from personal computers to mainframes — and have become the de facto standard in data management.[3]

Despite the widespread adoption and advantages of the relational model, over time, its limitations began to manifest, especially in the context of growing data volumes and modern application requirements. One of the main problems of relational databases was insufficient scalability when processing large volumes of unstructured and semi-structured data. The strict data schema and the need for predefined structures complicated adaptation to rapidly changing business requirements and hindered working with dynamic data.

Moreover, relational databases faced performance issues when distributing the load across multiple servers. Vertical scaling, associated with improving the hardware characteristics of a single server, had its limits and became economically inefficient. Horizontal scaling, involving the addition of new servers to the system, was complicated due to the complexity of data synchronization and maintaining transaction integrity in a distributed environment.[4]

In an attempt to overcome the limitations of relational databases, object-oriented databases began to be developed in the 1980s. These systems aimed to combine the advantages of object-oriented programming with database capabilities, allowing data to be stored as objects with support for inheritance, encapsulation, and polymorphism. Object-oriented databases offered a more flexible data model capable of better reflecting complex real-world structures.

However, despite their potential, object-oriented databases did not gain widespread adoption. This was due to the lack of standardization, complexity in integrating with existing systems, and insufficient performance compared to relational DBMSs. As a result, the industry continued to seek new approaches to data management capable of meeting the needs of rapidly developing technologies.[5]

With the advent of web technologies, social networks, mobile devices, and the Internet of Things, data volumes began to grow exponentially. There arose a need for systems capable of efficiently processing big data, ensuring high availability, and quickly adapting to changes. Traditional relational databases could not fully meet these requirements due to scalability limitations and rigid data schemas.

In response to these challenges, NoSQL (Not Only SQL) databases emerged in the early 2000s. NoSQL databases represent non-classical data management systems that offer alternative models for storing and processing data. They were developed to solve specific problems related to scalability, performance, and flexibility in working with diverse and rapidly changing data. There are several main types of NoSQL databases[6]:

Key-Value Stores: The simplest and fastest systems where data is stored as “key-value” pairs. They are ideal for cases where data access is performed by a unique key and complex queries are not required. Examples: Redis, Riak.[7]
Document-Oriented Databases: Store data as documents, usually in JSON or XML format. This allows storing complex, semi-structured data and performing queries based on the content of documents. They provide flexibility in working with changing data schemas. Examples: MongoDB, CouchDB.[8]
Columnar Databases: Optimized for working with large volumes of data and perform operations on columns instead of rows, which increases efficiency in analytical queries. They are well-suited for distributed systems and big data. Examples: Apache Cassandra, HBase.[9]
Graph Databases: Designed for storing and processing data presented as graphs. They are effective when working with data containing complex relationships, such as in social networks or recommendation systems. Examples: Neo4j, OrientDB.[10]

NoSQL databases offer flexibility in choosing data models and scaling. They support horizontal scaling by adding new nodes to the system without significantly complicating the architecture. Moreover, many NoSQL systems provide high availability and fault tolerance, which is especially important for modern web applications and services operating 24/7.[11]

However, it is worth noting that NoSQL databases are not a replacement for relational DBMSs but complement them by offering alternative solutions for specific tasks. The choice between relational and NoSQL databases depends on the specific project requirements, the nature of the data, and the necessary functions.[12]

Thus, the evolution of data modeling methods from relational to NoSQL databases reflects the industry’s drive toward more flexible, scalable, and high-performance data management systems. The emergence of NoSQL became a response to new challenges associated with big data and high loads, opening up new opportunities for the development and implementation of innovative applications.

References

West, M. (2010). Developing High Quality Data Models. [online] Available at: https://www.researchgate.net/publication/286610894_Developing_High_Quality_Data_Models [Accessed 20 Oct. 2024]. “Definition of data modeling of its tasks and functions”
Navathe, S.B. (1992). Evolution of data modeling for databases. Communications of the ACM, [online] 35(9), pp.112–123. doi:[https://doi.org/10.1145/130994.131001](https://doi.org/10.1145/130994.131001). “The history of data modeling”
Silberschatz, A., Stonebraker, M. and Ullman, J. eds., (1991). Database systems: achievements and opportunities. Communications of the ACM, 34(10), pp.110–120. doi:[https://doi.org/10.1145/125223.125272](https://doi.org/10.1145/125223.125272). “The history of relational databases”
Jatana, N., Puri, S., Ahuja, M., Kathuria, I. and Gosain, D. (2012). A Survey and Comparison of Relational and Non-Relational Database. [online] undefined. Available at: [https://www.semanticscholar.org/paper/A-Survey-and-Comparison-of-Relational-and-Database-Jatana-Puri/2791b66e550e4fc0193333f5d97c0de33128e13b](https://www.semanticscholar.org/paper/A-Survey-and-Comparison-of-Relational-and-Database-Jatana-Puri/2791b66e550e4fc0193333f5d97c0de33128e13b). “Disadvantages of relational databases”
Kim, W. (1990). Object-oriented databases: definition and research directions. IEEE Transactions on Knowledge and Data Engineering, 2(3), pp.327–341. doi:[https://doi.org/10.1109/69.60796](https://doi.org/10.1109/69.60796). “Description of object-oriented databases”
Davoudian, A., Chen, L. and Liu, M. (2018). A Survey on NoSQL Stores. ACM Computing Surveys, [online] 51(2), pp.1–43. doi:[https://doi.org/10.1145/3158661](https://doi.org/10.1145/3158661). “The history of NoSQL databases, their varieties”
Seeger, M. (2009). Key-Value stores: a practical overview. [online] Available at: [https://blog.marc-seeger.de/assets/papers/Ultra_Large_Sites_SS09-Seeger_Key_Value_Stores.pdf](https://blog.marc-seeger.de/assets/papers/Ultra_Large_Sites_SS09-Seeger_Key_Value_Stores.pdf). “Description and examples of Key-value databases”
Vera, H., Boaventura, W., Holanda, M., Guimarães, V. and Hondo, F. (n.d.). Data Modeling for NoSQL Document-Oriented Databases. [online] Available at: [https://suriweb.com.ar/wp/tecnologia/wp-content/uploads/sites/18/2019/07/Data-Modeling-for-No-Sql.pdf](https://suriweb.com.ar/wp/tecnologia/wp-content/uploads/sites/18/2019/07/Data-Modeling-for-No-Sql.pdf) [Accessed 22 Oct. 2024]. “Description and examples of Document-Oriented databases”
Liu, Z., Hsiao, H.-I. and Chen, Y. (2011). Efficien and Scalable Data Evolution with Column Oriented Databases. [online] Available at: [https://web.njit.edu/~ychen/edbt11.pdf](https://web.njit.edu/~ychen/edbt11.pdf) [Accessed 22 Oct. 2024]. “Description and application of Column Oriented databases”
Armbruster, S. (2013). Tutorial Neo4j. [online] Available at: [http://nosqlroadshow.com/dl/NoSQL-Munich-2013/Presentations/neo4j_Stefan_Armbruster_Tutorial.pdf](http://nosqlroadshow.com/dl/NoSQL-Munich-2013/Presentations/neo4j_Stefan_Armbruster_Tutorial.pdf) [Accessed 22 Oct. 2024]. “Description and application of graph databases”
Leavitt, N. (2010). Will NoSQL Databases Live Up to Their Promise? Computer, 43(2), pp.12–14. doi:[https://doi.org/10.1109/mc.2010.58](https://doi.org/10.1109/mc.2010.58). “NoSQL pros and cons”
Nayak, A., Poriya, A. and Poojary, D. (2013). ISSN : 2249-0868 Foundation of Computer Science FCS. International Journal of Applied Information Systems (IJAIS), [online] 5(4). Available at: [https://research.ijais.org/volume5/number4/ijais12-450888.pdf](https://research.ijais.org/volume5/number4/ijais12-450888.pdf). “Comparison of NoSQL and relational databases”