Data warehousing (DWH) architectures: best practices and challenges

Data warehousing (DWH) architectures: best practices and challenges

By Lokhmatikov Sergey splokhmatikov@edu.hse.ru

Introduction

Nowadays, in the rise of digital age and deep learning applied applications the rate of generated data is increasing at astonishing rate. The growth of technologies like social media, online shops, mobile devices, etc. has led to an explosion of data from various sources. According to IDC research, the global data volume is expected to reach 175 zettabytes by 2025 [1] Which will be an immense increase from 33 zettabytes of 2018.

Businesses today have to tackle a problem of processing and analyzing these vast volumes of data, since making decisions based on big data analyzation could be extremely beneficial and give you necessary advantage over your competitors. To deal with the problem mentioned above the Data Warehousing (DWH) is used. This tool became crucial for managing and analyzing immense datasets. Data warehouse provides a repository where data from different sources is collected and stored, it also implements tools for efficient querying and data analyzation.

In order to move and process data into the warehouses it is common to use one of the two main frameworks: Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT). ETL involves extracting data from source systems, transforming it to fit analytical needs, and then loading it into the data warehouse(for example ClickHouse database). ELT, on the other hand, loads raw data into the warehouse first and performs transformations there. The choice between ETL and ELT can significantly impact the performance and efficiency of data processing. So which approach could be more beneficial after all? What key factors developers should consider when choosing between frameworks?

ETL vs ELT

History

The concept of ETL systems has been around for several decades. However it should be understood as a general architecture or framework rather than a specific product or implementation. Long before early 1990s, companies which needed such a solution most often developed their own ETL pipelines. These solutions were limited to a specific organization, lacked scalability, compatibility and required a lot of effort to code. By the mid 1990s, software companies such as Microsoft began offering enterprise ETL tools and platforms with user-friendly interfaces for designing data workflows [2]. This shift led to the great adoption of ETL systems and allowed smaller companies to adopt practices related to DWH systems. And nowadays ETL became crucial instrument for business intelligence for identifying errors in the companies, discover new opportunities and overall company management [3].

Extract, Load, Transform (ELT) systems emerged slightly later and started to gain popularity around the 2010s. Rapid growth of data storage and processing power capabilities resulted in the shift to ELT architecture in enterprise DWH systems. Powerful OLAP and OLTP databases, Hadoop Distributed File System and other tools provided the necessary infrastructure to store large volumes of raw data and perform batch processing within the target system [4]. This architecture allowed companies to handle big data more effectively, using the computational capabilities of modern DWH's to transform data just in time. ELT became most relevant in the field of real-time analytics, since flexibility and scalability are crucial there.

ETL

Describing all algorithms here and later let's abstract from specific tools and their implementations, since our aim is to compare architectures and approaches of handling vast amounts of data.

ETL stands for “Extract Transform Load” - these are the 3 stages of dealing with data. According to [5] here is brief description of each step:

1. Extract: Data is collected from multiple heterogenous sources such as OLTP databases, CRM systems, files on disk [6]. The extraction process adopted to different data formats in order to maintain broad support of formats.

2. Transform: Extracted data experiences transformation to meet the target system's requirements. Data undergoes cleaning, normalization, aggregation, and enrichment. All transformations are aimed at a single target - to make the data easier to merge and create a harmonious relationship out of it from the various sources

3. Load: The transformed data is then loaded into the target data warehouse or database. The loading process ensures data integrity and optimizes loading performance to handle large volumes efficiently.

The ETL process is typically executed in batch mode, often during off-peak hours, to minimize the impact on source systems and networks. After all process steps, developers and analytics end up with aggregated, cleaned, clustered and indexed data ready for efficient querying.

ELT

In ELT, the traditional steps of ETL are rearranged to fit better with modern data warehousing needs [7]:

1. Extract: This part is pretty much the same as in ETL. Data is taken from its source, like databases or applications, without any changes yet. This data is grabbed in its “raw” form, meaning it’s exactly how it was in the source.

2. Load: Once the raw data is extracted, it gets loaded directly into the data warehouse, without any cleaning or modifications. This stage allows businesses to get data into the system quickly and manage huge amounts of information. The idea is that, instead of slowing down to process the data first, you keep everything in its natural state and then handle transformations later.

3. Transform: In the ELT process, the real processing of data, like cleaning or aggregating, happens inside the data warehouse after everything is already loaded. The data warehouse itself usually has a lot of processing power to handle complex calculations and transformations on large datasets. This setup makes it easier to process data for specific analyses without waiting for everything to be transformed before.

ELT is especially useful for modern data warehousing because of the growth in cloud technology, where there’s a lot of computational power. It lets analysts and data scientists work with data directly in the warehouse, doing things like filtering and organizing the data only when they need it.

Comparison

Comparison of ETL and ELT architectures have gone multiple stages. There is no simple answer which is better, but it is beneficial to look through the strengths and weaknesses of each in order for more specific comparison [7] [8] [9]:

Table 1. Benefits and drawbacks of ETL and ELT architecture.

As we could see ETL and ELT each suited to different use cases depending on organisation’s specific needs. ETL, is the traditional approach, this method is particularly beneficial for companies that need strict data quality and consistency before analysis, as the transformation phase occurs before loading, ensuring only clean, validated data enters the warehouse. It is ideal for data pipelines with complex, predefined transformations.

On the other hand, ELT, loads raw data into the data warehouse before any transformation takes place. This architecture benefits companies working by incremental and iterative methodologies. For example cross functional agile development team would benefit from ELT DWH pipelines since it features on-demand data processing even on large datasets. Because the transformation is done within the data warehouse modern cloud-based enterprise solutions are highly advisable and will be more cost effective than traditional ETL tools. ELT is best for companies looking for flexibility, as business rules and transformations can be modified without reloading large volumes of data.

All in all, a company should go for ETL when data quality is a top priority and data have to go through complex transformations before it will be possible to make any business decisions based on it. However, ELT is recommended when flexibility, speed, and scalability are in high demand, it is allowing businesses to handle large data volumes more effectively within the storage system.

Conclusion

All in all, in today's exponential growth of data flows, world businesses have to deal with enormous amounts of data generated every second. Organizations often acquiring DWH's, in order to save, process and transform all related data. Being the traditional method of data delivery to data warehouse, ETL, is preferable when data quality and consistency are top priorities. It helps maintain high quality validation and verification of datasets, which is crucial for reliable analysis. However it might not be as flexible and scalable when dealing with very large and rapidly changing datasets or in real time environment [10].

On the other hand, ELT takes advantage of modern DWH computational power. By loading raw data first and transforming it within the system, ELT provides greater flexibility and scalability. This approach is more suitable for businesses that need to process large datasets quickly and adapt to rapidly changing requirements. Real-time analytics and agile development will benefit significantly from ELT data delivery framework. Though, it may pose challenges in terms of data governance and security due to storing unprocessed data.

In the end, there's no go-to solution. The choice between ETL and ELT while designing DWH architecture depends on various factors including the specific needs and capabilities of the organization. Companies should consider their data requirements and restrictions, considering factors like data volume, processing speed, needed data quality, and scalability. By selecting the right approach, businesses can more easily make relevant decisions, conduct A/B experiments [11], and maintain a necessary advantage over their competitors.

References

1. Rydning, D.R.J.G.J., Reinsel, J. and Gantz, J., 2018. The digitization of the world from edge to core. _Framingham: International Data Corporation_, _16_.

2. Simitsis, A., Skiadopoulos, S. and Vassiliadis, P., 2023, March. The History, Present, and Future of ETL Technology. In _DOLAP_.

3. Titirisca, A., 2013. ETL as a Necessity for Business Architectures. _Database Systems Journal_, _4_(2).

4. ChatGPT 4o, When does ELT appeared? https://chatgpt.com

5. Seenivasan, D., 2022. ETL vs ELT: Choosing the right approach for your data warehouse. _International Journal for Research Trends and Innovation (www.ijrti.org), ISSN_.

6. Tayade, D.M., 2019. Comparative study of ETL and E-LT in data warehousing. _Int. Res. J. Eng. Technol_, _6_. 7. Zode, M., The Evolution of ETL. _Retrieved on_, _6_(06).

8. Rambabu, V.P., Althati, C. and Selvaraj, A., 2023. ETL vs. ELT: Optimizing Data Integration for Retail and Insurance Analytics. _Journal of Computational Intelligence and Robotics_, _3_(1).

9. Macura, M., 2014. Integration of data from heterogeneous sources using ETL technology. _Computer Science_, _15_.

10. SABTU, A., MOHD AZMI, N.F., AMIR SJARIF, N.N., Adli Ismail, S.A.I.F.U.L., Mohd Yusop, O., Sarkan, H. and Chuprat, S., 2017. THE CHALLENGES OF EXTRACT, TRANSFORM AND LOAD (ETL) FOR DATA INTEGRATION IN NEAR REALTIME ENVIRONMENT. _Journal of Theoretical & Applied Information Technology_, _95_(22).

11. Mantri, A., 2022. Role of Data Engineering ELT/ETL in A/B Testing Experimentation. _Journal of Scientific and Engineering Research_, _9_(2). https://doi.org/10.5281/zenodo.12798426

Table of Contents

Data warehousing (DWH) architectures: best practices and challenges

Introduction

ETL vs ELT

Conclusion

References