Modern NoSQL Data Lake Modeling Techniques

NoSQL data lakes are a leading component of modern data engineering, providing a powerful tool for storing, managing, and analyzing large amount of structured, semi-structured, and unstructured data. Unlike traditional relational databases, NoSQL data lakes are schema-less, allowing for greater flexibility and scalability in handling diverse and evolving data types. In this essay I would like to talk about modern modeling techniques for NoSQL data lake solutions.

It is really hard to imaging a data lake solution without NoSQL databases. A data lake is a methodology of additionally storing data in a raw format. The basic idea behind the data lake is to have an enterprise data being stored on each step of the transformation: beginning with the raw format in which the data gets into the data lake and ending with constructed after several transformations processes marts.

Data lake can be used as a centralized repository where organizations can efficiently gain raw data in its original format. Therefore, NoSQL databases are a perfect choice for data lakes due to their adaptable schema, simplifying the storage of a diverse range of data types as well as different level structured data.

One of the primary features of NoSQL databases is their schema-less design. Unlike traditional relational databases, NoSQL databases do not require a predefined schema. This flexibility allows organizations to collect and store data without extensive upfront planning, making data lakes ideal for handling ever-changing and diverse data sources.

There are several types of data modeling techniques based on a type of a NoSQL database. In the following subsections we will go through some of the most popular of them.

Document-oriented databases store data in a flexible, JSON-like format. Each document can have its structure, allowing for easy adaptation to different data types and sources. This makes them a popular choice for data lake modeling, as data can be loaded and stored without worrying about its structure. There are several approaches that might be used for modeling document-oriented databases [1].

The first one is document-based modeling (reference relationship) where data is organized into documents and each document represents an entity. The relations between the documents are stored as links or references so that the application can resolve them to retrieve the data.

The second one is an embedded documents modeling where a single document stores a structure where the embedded documents are located. Such structures can be used to represent relationships and hierarchies in the data.

Document-oriented databases are a great choice for cases that require flexibility and fast, continual development. For example, in a case of developing a user profile data adding a new field to it. The most popular examples of document-oriented databases are MongoDB [2] and Amazon DocumentDB [3].

Key-value stores is a type of NoSQL database where each data item is associated with a unique key. The key can be either a composite key constructed from id and several or just one field. The choice of the key mainly relies on query optimizations and should be chosen due to main business cases of using the database. The value can be any data type including complex structures, binary files or even another pair of key-value.

Unlike traditional relational databases, key-value databases do not require a predefined structure which offers more flexibility and an advantage in performance. Without having to rely on placeholders, key-value databases are a lighter solution as they require fewer resources. These features are suitable for large databases that deal with simple data such as caching, storing, and managing user sessions, ad servicing, and recommendations.

There are several examples of key-value stores: Amazon DynamoDB [4] and Redis [5].

Column-family stores unlike traditional relational database approach store information in columns rather than in rows. Each column family may contain several columns while each column may can have multiple versions of timestamps.

Such design approach allows storing huge amounts of data with various structures which makes it a perfect solution for systems that require high scalability and performance.

Column-family stores are commonly used when it comes to time-series data management such as some sensors data, event logs. There are several examples of column-family stores such as Apache Cassandra [6] and HBase [7].

Graph databases are increasingly used in data lakes when relationships between data points are essential. They are used in modeling complex relationships, such as social networks or supply chain systems, allowing for efficient querying and analysis.

In such approach both nodes and relationships can have associated properties and attributes which provide some valuable information about them. One more interesting thing to mention is that querying basically involves traversing the graph in a search of related nodes or patterns. The most popular solution for a graph database is Neo4j [8].

To create a well-functioning data lake with NoSQL databases, a robust data transformation process is critical. Data must be collected from various sources, such as social media and enterprise applications, and transformed into a suitable format for storage and analysis.

There are many examples of tools used by data engineers to create such pipelines, even some companies develop their own ELT frameworks in order to speed up the data engineering process. The most popular tools for transformation and loading data are Apache Nifi [9] and Kafka [10].

It worth mentioning that, of course, the enterprise companies do not use only one of described approaches designing their NoSQL database. They combine them into more complex ones that fulfils the business needs.

Overall, modern NoSQL data lake modeling techniques have revolutionized the way organizations manage and utilize data. With the right data ingestion and transformation processes, data governance, and query tools, NoSQL data lakes enable organizations to reach the full potential of their data, making their decisions in a more data-driven way.

[1] Data Modeling for NoSQL Document-Oriented Databases, Harley Vera, Wagner Boaventura, Maristela Holanda, Valeria Guimaraes, Fernanda Hondo, 2015.

[2] https://www.mongodb.com/document-databases

[3] https://aws.amazon.com/ru/documentdb/

[4] https://aws.amazon.com/ru/dynamodb/

[5] https://redis.io/

[6] https://cassandra.apache.org/_/cassandra-basics.html

[7] https://hbase.apache.org/

[8] https://neo4j.com/

[9] https://nifi.apache.org/

[10] https://kafka.apache.org/