radsilikon.blogg.se - Lakehouse architecture

#LAKEHOUSE ARCHITECTURE DRIVER#
#LAKEHOUSE ARCHITECTURE SOFTWARE#

But this data just compliments the EDW, it does not replace or extend it. Hadoop deployment is a good framework that simplifies deployment of such a data processing pipeline that runs at the large scale. This extracted data can then be placed into EDW and subsequently used for analysis. In order to process this kind of information, you need to develop and deploy data processing pipelines that extract properly structured data from it (image recognition, sentiment analysis, etc.). And other systems do not need to query them as well, because this is raw information not suitable for direct querying. Data warehouses do not need to query “video, audio, and text documents” kind of information. Ultimately, these were just a placeholder for the lack of actual use cases of the Hadoop technology in the traditional enterprise.

Only the lazy one didn’t try to sell Hadoop at least once, right? And to drive the sales, the marketing specialists have invented the mythical BigData and “unstructured data”.

Companies that packaged these open source solutions and sold them to traditional businesses under the kool-aid of “be like Big Tech” emerged en masse.

Yahoo and Facebook have developed a set of open-source solutions based on these papers as internally they had the same problem of data processing at scale.

Google published a set of papers on how it solves data processing at scale.

You can read some more details in my older post, but let me briefly reiterate how it happened: I was always skeptical about the hype around BigData and the “unstructured data”. Second, not only were datasets growing rapidly, but more and more datasets were completely unstructured, e.g., video, audio, and text documents, which data warehouses could not store and query at all. And at the very large scale you will anyway be asked to provision for the peak. Cloud is ultimately just someone else’s computer, and even in the cloud the “pay as you go” approach has limitations: it works well when you’re running at small scale, but the larger your EDW scale is the more likely you will need a custom agreement with cloud vendor to have some guarantees you will be able to scale up when you need to. So the push from “pay for the peak” to “pay as you go” in EDW space was driven by the general cloud transformation trend and not the fact that “pay for the peak” became very costly for data warehouses. It is not totally fair to put them into the same line as each have their own specific background, but they share the same common point – they are designed on top of elastic by design platforms that very well adjust to the cloud deployment.

New Cloud-native EDW offerings have emerged: Snowflake, Databricks Data Lakehouse, Cloudera.

As a result, these ported solutions have fallen behind the native offerings of cloud vendors for a long period of time. However, cloud integration is not simple: solutions designed for on-premise hardware with directly attached storage do not map well into the cloud world with independent storage and compute with different tiers in both.

Traditional EDW vendors have added a cloud deployment option to their solutions.

All these solutions had the goal of being adjusted to the cloud deployment mode out of the box, allowing the clients to take advantage of the same “pay-as-you-go” model.

Cloud vendors introduced their own EDW solutions either through purchasing and adjusting the already available ones ( Paraccell into AWS Redshift), developing a totally new one ( Microsoft’s Parallel Data Warehouse) or porting internal solution for the external use case ( Google’s Dremel into BigQuery).It is obvious that cloud transformation also made some impact on the EDW market: The push for “ pay-as-you-go” approach was first introduced by the cloud vendors to incentivize cloud transformation of traditional businesses.

#LAKEHOUSE ARCHITECTURE SOFTWARE#

However, “pay for the peak of user load” stays true for all on-premises software deployments, not only EDW ones.

#LAKEHOUSE ARCHITECTURE DRIVER#

For some the main driver behind it was custom hardware ( Netezza), for some the potential extra price margin. Most EDW platforms preferred selling appliances versus software-only licenses. This forced enterprises to provision and pay for the peak of user load and data under management, which became very costly as datasets grew. First, they typically coupled compute and storage into an on-premises appliance. A decade ago, the first generation systems started to face several challenges.