1,007 reads

The Data Table Format Wars

by Shawn GordonJuly 4th, 2022

Too Long; Didn't Read

If you are considering the data lake, you're going to want to know about the table formats that are available that give you database-like functionality on your data lake. I talk about that here.

Companies Mentioned

featured image - The Data Table Format Wars

As I write this on June 29, 2022, the “Data + AI Summit” is on its last day in San Francisco. I’d been thinking about writing about this topic for nearly a year now, but the from to open source all Delta Lake APIs as part of the Delta Lake 2.0 release. They also announced that it will be contributing all enhancements of Delta Lake to The Linux Foundation. That brings us to the “Table Format Wars” subject of this article where we will look at , (created at Uber), and (created at Netflix). For the sake of this article, I’m going to assume you know what the data lakehouse is and how we got here. If you are unfamiliar, this is a to catch you up.

Background

In the long long ago, in the before time, we had databases, where compute, storage, security, indexing, and all that good stuff was all in one place. As computing needs advanced at a crazy rate, software was developed to address those needs. Now we were dumping structured, semi-structured, and unstructured data (audio, video, images) into storage like AWS S3. We needed a way to understand the structure of what was there. This gave rise to catalogs such as the ____or AWS Glue that would describe the data. That was a start, but we needed to be able to deal with the data in the files like in a database and be able to insert/delete/update records, and that gave rise to the Hive table format. It was quite primitive compared to the options available today, and if I recall correctly, didn’t work with metastore yet, so users had to know the layouts of the files.

Today

Databricks built and released Deltlake as a table format in April 2019, but the open-source version was pretty limited and Databricks was really the only company doing any serious work on it. That’s what made their announcement about open-sourcing a bunch of stuff so interesting. I read it as a bit of a panic move as Iceberg is getting so much vendor adoption and so many vendor contributors. Iceberg is also the newer kid on the block open-sourced in May 2020. Hudi started life at Uber in 2016 and was open-sourced in 2017, and is used at some dang big companies other than Uber, like the Robinhood trading platform, Amazon, Bytedance, creators of TikTok, and many others.

Features

The way these table formats work to handle upserts, deletes, etc., is generally one of two methods, they are:

Copy on Write (CoW)
Merge on Read (MoR)

Keep in mind that these changes to the files are coming in as change logs, which means the latest versions need to be dealt with for queries, or time travel. Time travel is a cool feature of this configuration, but I’m not going to address it in this article. MoR tends to be faster than CoW, but this is a pretty detailed topic that you should research in-depth. Let’s do a high-level comparison of the three table formats to get you started. For the below grid, I’m borrowing some research from a webinar.

Feature Overview	Delta Lake	Hudi	Iceberg
ACID Transactions	Yes	Yes	Yes
Partition Evolution	No	No	Yes
Schema Evolution	Partial	Partial	Yes
Time Travel	Yes	Yes	Yes
File Formats Supported	Parquet	Orc, Parquet	Avro, Orc, Parquet
Schema Evolution
Add Column	Yes	Yes	Yes
Drop Column	No	Yes w/Spark	Yes
Rename Column	Yes	Yes w/Spark	Yes
Update Column	Yes	Yes w/Spark	Yes
Reorder Column	Yes	Yes w/Spark	Yes
Change partitioning w/out rewriting table	No	No	Yes
Use transforms of columns to specify partitions	Partial	No	Yes
Require understanding of table partitioning	Yes	No	Yes
File Pruning	Yes	Yes	Yes
Read Support
	Yes	Yes	Yes
	Yes	No	No
	No	Yes	No
	Yes	No	No
	No	No	Yes
	Yes	Yes	Yes
	Yes	Yes	Yes
	No	Yes	Yes
	Yes	Yes	Yes
	Yes	Yes	No
	Yes	No	Yes
	Yes	No	Yes
	Yes	Yes	Yes
	Yes	Yes	Yes
Write Support
	No	No	Yes
	Yes	Yes	Yes
	No	Yes	Yes
	Yes	No	No
	No	No	Yes
	No	No	Yes
	Yes	Yes	Yes
	Yes	No	Yes

Observations

Part of the reason this ecosystem evolved was around performance, but part of it had to do with how cloud providers charge for their platforms. It is much cheaper for storage than it is for compute, so if you can just query your raw storage without putting it into a conventional database, you’ll reduce your costs. By NOT putting it in a database of some sort, we’ve had to develop a variety of file formats like Parquet, catalog systems to describe a variety of file formats so they present as a schema, query engines, security plug-ins, table formats to deal with transactions, indexing, and more. The appeal of systems like Redshift, Snowflake, Yugabyte, CockroachDB, and others, is that all those things tend to be built in, just like the databases of old. Yes, there is a lot of flexibility in this scenario as you can use all the bits and pieces that best suit your situation, but imagine if Amazon were to suddenly change their pricing policies with Graviton 5 (I’m just throwing out an idea here) because they got compute so cheap that they decided to make compute cheaper than storage and really drop egress fees. You could see a massive collapse of a segment of the tech sector. Kind of a scary thought.

Summary

DataBeans published a comparing Delta, Hudi, and Iceberg where Delta comes out on top, the day before I was starting this article. The next day, Onehouse, a commercialization of Hudi by the Hudi developers as I understand it, published a between Delta and Hudi. Both provide details and access to the files so you can try them yourself. I think Onehouse makes some good arguments and it is odd that the DataBeans article is not attributed to any writers, and it came out at the same time as the Delta announcement.

The tests performed were ones that would show Delta in a better light with default configurations. It all seems just a little too convenient, meaning, it seems like it was more of a paid placement than an organic comparison, but I could be wrong (this article is not paid for, I did it on my own time and dime). While Delta and Hudi have broad consumer adoption, the former because it is part of the Databricks product and the latter because of being first to market, I think Iceberg is likely going to be the eventual winner in this space based on the commercial support that I’m seeing, but you never know how the market can change from some unexpected innovation.

As to what is best for your environment, that’s going to be up to your specific needs, I’m simply talking about the tech adoption. The nice thing about open-source is that it isn’t company-dependent, so you can keep using that tech regardless of if the commercial company that sold you support continues to exist or not.

L O A D I N G
. . . comments & more!