The best data infrastructure for your company: Data Warehouse vs. Data Lake
As we’ve seen in a previous article, a fundamental step in defining the Data Strategy of your company is the choice of the central data management architecture.
This decision is indeed influenced by both the characteristic of the data you want to use for analysis, their sources’ peculiarities and business logics, and the ideal goals of your Data Strategy.
Let’s examine the most popular architectures employed for data storage and management: data Warehouses and Data Lakes. First of all, we need to know on what these infrastructures consist and what are all the advantages, to decide, carefully considering the differences, which one best fits our necessities.
Sommario
Data Watehouse (DWH)
A Data Warehouse is an infrastructure which contains all the enterprise data (i.e., data coming from operational sources as ERP,CRM,SCM, website…), storing them in a clear and ordinate manner after Extract, Transform and Load (ETL) procedures have been operated. In other words, Data Warehouse is the beating heart of your enterprise data management system, as it prepares data for business analysts and executives, preserving maximum data quality and historical depth.
Data Lake
A Data Lake is a data repository suited for storing heterogeneous data, both structured, semi-structured or unstructured in multiple formats. As the Data Warehouse it ideally gathers all enterprise data, with the fundamental difference that in a Data Lake you can put the original data, without the necessity to operate ETL procedures or other transformations.
Which one to choose?
To understand which architecture best fits our needs we have to carefully evaluate the advantages and disadvantages associated with every infrastructure. It is important to mention that this choice is not of the either-or kind, as a firm could actually choose to implement both architectures in order to meet different needs.
Generally speaking, a Data Warehouse contains structured data, which are data that we can represent in tabular format. This kind of data typically comes from operational sources and needs to be “cleaned” before entering in the DWH.
Conversely, in a Data Lake we can put data coming from heterogeneous sources as IoT devices, websites, social media, mobile apps… which not necessarily present a defined structure (think about a tweet or a picture). In other words, in a Data Warehouse we can find only processed and cleaned data, which are presentable in a tabular format and are ready to be analyzed by business analysts, while a Data Lake also incorporates a huge amount of “raw” and unstructured data, which need to be treated by data specialists. The major risk of a Data Lake is that it may become a Data Swamp, filled by a hoard of unstructured and raw data that are completely useless without the expertise of highly trained data scientists and engineers.
Neverthless, a Data Lake, if properly managed, could be much more useful for Machine Learning, Predictive Analytics and Data Mining activities, due to its flexibility and ability to store any kind of data in its original format. Moreover, this architecture, despite requiring more expertise, it is generally less expensive and quicker to implement.
On the other hand, a Data Warehouse is better in terms of query performance, data quality and availability. The data stored in a DWH doesn’t need many manipulations and is ready to be used by analysts from every part of the organization, although the architecture construction requires a careful ex-ante evaluation and it is usually more costly.
Below, is a comprehensive outline of the general differences between a Data Warehouse and a Data Lake.
Data Warehouse | Data Lake | |
Data | Structured, ordered, cleaned | Heterogeneous: structured and unstructured, “raw”
|
Design | Ex-ante (“Schema-on-write”) | Ex-post (“Schema-on-read”) |
Final Users | Analysts, Executives | Data Scientists, Developers, Analysts |
Best for | Reporting, Business Intelligence, Visualization | Machine Learning, Predictive Analytics, Data Mining |
Data Quality | Very high | Not necessarily high |
Implementation and maintenance costs | Relatively high | Relatively low |
Expertise required | Relatively low | Relatively high |
Ability to engage more final users | Relatively high | Relatively low |
So it is clear that a Data Warehouse is desirable when we have to deal with structured data and our firm lacks a strong “data culture” with highly trained experts. On the other hand, we would prefer a Data Lake when our data is changing really fast, has a large volume and is often unstructured.
Eventually, we would suggest – resources permitting – building up a Data Warehouse for Business Intelligence activities that employ “traditional” data (as sales, customers, items…) and rely on a Data Lake for storing unstructured data (emails, tweets, images, documents…) which need to be used for Machine Learning, Predictive Analytics and Data Mining activities.
Comments are closed.