
Data lakes tend to become data swamps for the three reasons I mentioned earlier, we don’t want to encourage more of that as it’s not good for enterprises. Why use another new term, “Lakehouse” to describe data lakes?īecause they’re so radically different from data lakes that it warrants a different term. This breakthrough was so profound that it was published in the top academic conferences (VLDB, CIDR etc). It started with enabling ACID transactions, but soon went beyond that with performance, indexing, security, etc. They brought structure, reliability, and performance to these massive datasets sitting in data lakes. The big technological breakthrough came around 2017 when three projects simultaneously enabled building warehousing-like capabilities directly on the data lake: Delta Lake, Hudi, Iceberg.

Today, we have a rich set of capabilities that deliver things previously only possible in relational data warehouses. That massive accumulation of data pushed the innovation that was necessary to access the data directly where it lived versus trying to keep up with copies to traditional databases. Cheap, infinitely scalable, and extremely easy to use, cloud storage became the default choice for people to land their cloud-scale data coming out of web and IoT applications. In the past few years, application developers simply took the easiest path to storing their large datasets, which was to dump it in cloud storage. What do you see as the biggest changes in the last several years to overcome some of those challenges?Ī defacto upstream architecture decision is what really got the ball rolling. A lot has changed, especially in the past couple of years, but those seem to be some of the common early issues. Finally, like any new technology, it lacked some of the mature aspects of databases such as robust governance and security. I think this happened due to overly complex on-premises ecosystems without the right technology to seamlessly and quickly allow the data consumers to get the insights they needed directly from the data in the lake. The second is that data lakes became more like swamps where data just sat and accumulated without delivering real insight to the business. This was an understandable association, but the capabilities now available in data lake architectures are much more advanced and easier than anything we saw in the on-prem Hadoop ecosystem. First, there was a high correlation between the words “data lake” and “Hadoop”. millions of tiny comma-separated-files (CSVs).Īll technologies evolve, so rather than think about “what went wrong” I think it’s more useful to understand what the first iterations were like. Third, it was hard to get performance because the data layout might not be organized for performance, e.g. Second, it was hard to govern because it’s a file store, and reasoning about data security is hard if the only thing you see are files. First, it was hard to guarantee that the quality of the data was good because data was just dumped into it. It turned out it was hard to provide business value because the data lakes often became data swamps. But at some point, just collecting data for the sake of collecting it is not useful, and nobody cares about how many petabytes you’ve collected, but what have you done for the business, what business value did you provide? Tensorflow, Pytorch), which can directly operate on these data lakes. Parquet and ORC), there is also a vast ecosystem of tools, often open sourced (e.g. Because they’re based on open formats and standards (e.g. Today, thanks to this the vast majority of the data, especially in the cloud, is in data lakes. They enabled enterprises to capture all their data – video/audio/logs – not just the relational data, and they did so in a cheap and open way. Let’s start with one good thing before we get to the problems.


Lakehouse architecture software#
Billy has spent 30 years in the world of data as a developer, database administrator, and author has served as CEO and senior executive at software companies specializing in databases has served on public company boards, and is currently the CEO of Dremio. Ali has spent more than 10 years on the forefront of research into distributed data management systems is an adjunct professor at UC Berkeley and is the cofounder and now CEO of Databricks. Coming from different backgrounds, they each provide unique and valuable insights into this market. With the fast-moving evolution of the data lake, Billy Bosworth and Ali Ghodsi share their mutual thoughts on the top 5 common questions they get asked about data warehouses, data lakes, and Lakehouses.
