Table of Contents

Data modeling languages for data warehouses and data lakes

By — Vadim Zhdanov

Introduction

All data modeling methods can be divided into 2 groups: structural and process. The relational paradigm, relevant for typical QCD, focuses on defining entities and relationships between them, in order to then represent these concepts in the form of related tables using ER-diagrams (Entity-Relationship). Data Flow Diagrams (DFDs) are used to describe data flows in business analysis. DFD diagrams clearly show the exchange of information between data warehouses (not in the sense of QCD, but rather, source databases, consumers and other storage locations), internal processes of the system and entities external to it. This dynamic method can be used in the conceptual design of Big Data solutions, when the main sources and consumers of data, as well as their processing processes, in particular, ETL, are determined. Such visualization will be useful when designing modern QCD integrated with data lakes and cloud services of advanced Big Data analytics. Returning to the structural design of DWH, here are the 3 most common modeling techniques:

Инструментально все эти методы реализованы в виде различных CASE-средств, которые позволяют создавать ER-диаграммы и проверять их корректность. Часто они входят в состав платформ для создания DWH, например, как это сделано в SAP BW∕4 HANA.

Multi-dimensional modeling method

A multidimensional model is visually represented using a cube (or, in the case of more than three dimensions, a hypercube).

Basic concepts of the method

Multivariate modeling is a method of modeling and visualizing data as a set of numerical or linguistic metrics or parameters that describe general aspects of an organization's operations. Typically, multidimensional modeling focuses on numerical data such as sales, balance sheet, profit, weight, or objects that can be recalculated, such as articles, patents, books. The multivariate modeling method is based on the following basic concepts: facts, attributes, dimensions, parameters (metrics), hierarchy, granularity. A fact is a collection of related data items containing metrics and descriptive data. Each fact usually represents a data element that numerically describes an organization's activities, business transactions or events that can be used to analyze the organization's activities or business processes. An attribute is a description of the characteristics of a real object of the domain. Typically, an attribute contains a pre-known value that characterizes the fact. Measurement is the interpretation of a fact from some point of view in the real world. Dimensions contain text values ​​that are strongly related in meaning to each other. Dimensions are usually represented as axes of a multidimensional space, the points of which are the facts associated with them. Dimensions are specified by listing their members. Dimension Member - A unique name or identifier used to determine the position of an element. Often, the members of a dimension are in a part-whole or parent-child relationship, which allows one or more hierarchies to be introduced on a dimension. Each hierarchy can have multiple levels of hierarchy. Each member of a dimension must belong to only one level of the hierarchy, thus generating a split into disjoint subsets. A parameter, metric or indicator is a numerical characteristic of a fact that determines the effectiveness of an organization's activities or business actions in terms of measurement. As a rule, a metric contains a previously unknown value of a fact characteristic. The specific values ​​of the metric are described using variables.

Facts

From the point of view of the relationship between dimensions and facts, the latter can be divided into the following classes:

Table of facts

Facts in a multidimensional model are usually presented in the form of a fact table. In the logical entity-relationship model, a fact table is represented by an entity whose attributes are facts (metrics or descriptions) and a composite key that links the fact table to the dimension tables with a one-to-many relationship. Fact tables are divided into three main categories based on the level of detail of the facts.

The main characteristics of the fact table are as follows.

  1. The fact table contains numerical parameters (metrics).
  2. Each fact table has a composite key made up of the primary keys of the dimension tables.
  3. The fact table has, as a rule, a small number of fields, no more than 20.
  4. The data in the fact table has the following properties:
    1. numerical parameters are used for aggregation and summation;
    2. data values ​​must have the properties of additivity or semi-additivity with respect to dimensions in order to be summarized;
    3. All data in the fact table must be uniquely identified through the keys of the dimension tables.

Measurement

The main characteristics of the measurement table are as follows.

Basic schemas

There are several schemas for multidimensional data modeling. Two of these are considered basic: the star schema and the snowflake schema. In more complex cases, so-called “multi-star” or multiple fact table schemas are used. A star schema has one fact table and multiple dimension tables. Dimension tables are denormalized. A snowflake schema has one fact table and multiple normalized dimension tables.

IDEF1X

IDEF1X is a data modeling language for developing semantics of data models. IDEF1X is used to form graphical representations of information models that reflect the structure and semantics of information within an environment or system. This modeling technique is used to model data in a standard, consistent, and predictable manner in order to manage it as a resource. The main goals of the IDEF1X standard are to provide:

The main purpose of IDEF1X is to support integration. The integration approach focuses on capturing, managing and using a single semantic definition of a data resource called a “conceptual schema”. A “conceptual schema” provides a single, integrated definition of data within an enterprise that is not biased towards any single use of the data and is independent of how the data is physically stored or accessed. The main purpose of a conceptual schema is to provide a consistent definition of values ​​and relationships between data that can be used to integrate, share, and manage data integrity. A conceptual diagram should have three important characteristics:

DATA VAULT

Data Vault is one of the data modeling methods used in QCD design. To maintain simplicity of design and ensure maximum flexibility of QCD, a minimum of basic concepts are used:

Model building rules

The Data Vault model is created according to the following algorithm:

  1. defining Hubs based on key business entities and their use in the subject area;
  2. identifying Links by identifying possible relationships between business keys - and understanding the context of its work;
  3. Definition of Satellites - modeling the context of each business entity and transaction (Links) connecting the Hubs;
  4. Simulation of point-in-time tables derived from Satellites.

In this case, you must adhere to the following rules:

  1. Hub keys cannot be migrated to other Hubs, so as not to violate the flexibility and extensibility of the Data Vault modeling technique;
  2. Business keys and primary keys of the Hub never change;
  3. Hubs are connected only using entities of the Link type;
  4. The link must link at least two Hubs and can link with other entities of the Link type;
  5. Surrogate keys can be used in Hubs and Links, but not in Satellites;
  6. Hub Keys always migrate to Links and a child Satellite with contextual data;
  7. The satellite can be linked to Hubs and Links;
  8. The satellite always contains a Load Date Time Stamp or a numeric foreign key that refers to a stand-alone table with a sequence of timestamps, such as a calendar;
  9. If the Hub has 2 or more Satellites, you can create a point-in-time table for the convenience of join operations;
  10. Satellites record only changes without duplicating lines;
  11. Data is distributed according to the structure of the Satellites based on the type of information and the rate of its change.

Conclusion

Since historically DWHs evolved from relational DBMSs, they also inherited the design approaches adopted in this area. In particular, the use of 3NF - the third normal form, when each non-key attribute of the table should provide information about only the full key and nothing more. However, this application of this approach showed its following disadvantages:

To get around these shortcomings, in the second half of the 1980s, the star schema was used in data modeling, convenient for storing multidimensional indicators. Later they began to use the “star” scheme. This model is also characterized by a number of problems that complicate its application in modern QCD:

For this, the Data Vault method is now used as the most promising.

Sources

  1. https://ru.wikipedia.org/wiki/Схема_снежинки
  2. https://ru.wikipedia.org/wiki/Схема_звезды