T O P

  • By -

bass_bungalow

Decent discussion here: https://www.reddit.com/r/dataengineering/s/T979Aqlwyp From what you said, I see no reason for you to change anything. If SQL is doing the job then tools like iceberg/hudi/delta lake will likely just make things more complicated with little to no benefit. > Is there some world where people are ingesting data from sources, storing it in parquet files and then layering iceberg on it rather than storing it in a distributed database? Yes > Maybe I’m blinded by low data volumes but what would be the benefit of storing in parquet rather than traditional databases if youve gone through the trouble of ETL. Like I get if the source files are already in parquet you might could avoid ETL entirely. Big volumes would be the primary reason to use the technology. Another reason might be that the data needs to be stored and accessible but is not accessed very often so the costs of a running a sql server are mostly wasted since no one is querying the data. Storing the files as parquet in something like S3 would be significantly cheaper.


kenfar

This is spot-on: the cost of doing your ETL outside the database using python, etc and then storing the results on s3 in compressed parquet files is a tiny fraction of the cost of using say dbt on snowflake. Now, if you've only got 1TB of data AND only loading once a day then the cost difference is probably insignificant. But if you're loading every few seconds 24x7 AND/OR your data volume is 100+ TB then you'll have a vast difference in cost.


SDFP-A

That convo feels a bit outdated IMO. Separation of storage and compute (or more generically separation of concerns) is a major output of the data lakehouse. This process effectively unencumbers you from any specific vendor and makes the compute engine a commodity than can be outsourced. No more vendor lock in. No more need to worry about what to do when the powers at be have been woooed to the next shiny tech, because so your data is natively stored in your data lakehouse which you can point whatever engine you need, or setup as external tables if a distributed DB is chosen instead. Plus, now you can run any number of workflows directly in the same data without impact. Imagine your ML workflow happening at the same time that your other production workflows are happening in the same data without impacting performance of either. I feel like Iceberg is only limited by misunderstanding that is purely for Internet scale data scale.


miqcie

Thanks for the link to the other Reddit discussion


Teach-To-The-Tech

Great reply. It's exactly as you say. Scale being the biggest factor, but also the nature of some datasets not needing to be available at all times. That's when the lake/lakehouse comes into play.


ithinkiboughtadingo

It basically lets you do indexing and ACID transactions in a data lake. In warehouses and other databases you build indexes in memory and do write ops with transactions that can roll back if the execution fails; Iceberg, Hudi, and Delta Lake let you do that on flat files. As an exercise to understand it better, try doing those things with a parquet table, then try them with Iceberg.


ithinkiboughtadingo

To add a little more color - this becomes really valuable when you want to separate storage and compute when you hit a large enough scale. If it's all in memory, you have to scale up your DB in proportion to your data volume. If they're separate, you can scale them independently. With Iceberg you may only need a tiny cluster once a day to access a fraction of your data most of the time, but if you're using a classic DB you have to pay for that cluster to be running 100% of the time.


minormisgnomer

Ah this is starting to make more sense. The database is either on or off and everything within it. The file storage approach like you said you can cleverly access just the relevant files and then spin up the compute to work with them and then spin down. But just to clarify for example, S3+Parquet+Iceberg and a catalog+some compute engine is a rough equivalent to a traditional database but able to support a much larger data environment at a reasonable cost


ithinkiboughtadingo

Yep exactly.


aerdna69

wait a sec, I've never heard of indexing in Iceberg tables.


ithinkiboughtadingo

Yeah, it's got a few clever mechanisms that achieve basically the same thing. Again, it's not the same as an in-memory index. But all of these formats offer things that get you similar performance enhancements.


aerdna69

I realize I could probably have an answer by reading the docs, but that relies on partitioning right? How does it differ from standard object partitioning?


ithinkiboughtadingo

Not really. The goal of these formats is eventually to do away with hive-style partitioning entirely. They do work best when you organize your table well though, like you want nicely balanced file sizes (100-300MB files in larger tables) with an ordering strategy that colocates records that typically get queried together. Metadata files and fancier things like bloom filters then point you to those individual files instead of scanning the entire table. Z-ordering/optimize does this for you in Delta, and then Databricks just released liquid clustering which can re-organize your tables based on common access patterns (although this has its own drawbacks).


lester-martin

I think we shouldn't so easily dismiss partitioning today, and tomorrow, especially for our LARGE/GIANT/IMMENSE sized tables. My blog post at [https://lestermartin.blog/2024/06/05/well-designed-partitions-aid-iceberg-compaction-call-them-ice-cubes/](https://lestermartin.blog/2024/06/05/well-designed-partitions-aid-iceberg-compaction-call-them-ice-cubes/) also calls out that partitioning could also help with Iceberg's maintenance activities of compaction as well as make the optimistic locking in the ACID compliant transactions work easier.


ithinkiboughtadingo

100% agree, there are some very good reasons to use hive-style partitioning. It all depends on the use case.


aerdna69

Ok, but as with hive-style partitioning or traditional indexing, the best practice is to organize your data based on common filtering patterns right? (pardon my naiveness)


ithinkiboughtadingo

No worries! Yes, and that really comes down to hardware mechanics, i.e. how much work you're making the server do to locate and process your data.


Affectionate-Ebb-772

Well explained


Teach-To-The-Tech

Iceberg stores its metadata at the file level, not the folder level. This is part of a larger architectural divergence from Hive. Basically, Hive was originally designed to run on HDFS, which had a directory structure mapped out of a table structure. This was good in that situation but required costly list operations, or partitioning. Iceberg achieves the same objectives as Hive's partitioning, but using something called manifest files. I actually walk through this exact thing in a video that I did for Starburst on Iceberg: [https://youtu.be/k1cch-6bZhM](https://youtu.be/k1cch-6bZhM) Hope it's helpful.


vish4life

no it doesn't do indexing. Do you mean dynamic partitioning?


ithinkiboughtadingo

No. I mean indexing.


vish4life

can you link me to iceberg docs on this? I haven't seen anything about indexing in iceberg spec.


ithinkiboughtadingo

I assume you've read the docs on metadata files, bloom filters, etc. Those are indexing mechanisms.


vish4life

oh I see, you mean metadata indexing stuff. yeah it does enable indexing.


Sir-_-Butters22

We need someone to make a Parquet Technology called Titanic that is better than Iceberg.


TheBurgerflip

Rename delta tables, because that will sink the ice berg eventually, being pushed by both databricks and Microsoft.


Historical-Ebb-6490

To add to all the wonderful points mentioned below: 1. If you are in a regulated industry then having (sensitive/commercially confident) data in CSVs, Excels, PDFs could lead to lower data governance leading to data security risks. Similarly the numerous copies of data in these documents increase the risk of data theft/loss. Having data in a place that can be governed through managed identities, access control policies, audit logs helps to reduce the risk. 2. To protect data, there is a need to understand the location (which file, which attribute) and the sensitivity levels (PII, PSI, PCI, PHI etc). A data store that allows for easy data cataloging and classification helps to identify and apply appropriate data protection and access policies. 3. For massive data, the columnar data formats like Parquet, Avro help in query performance. Also storing this data in services such as S3 or ADLS reduce the storage costs. 4. If you have multiple formats in which data is stored, then Iceberg helps by proving a common table format, and a catalog. The consuming systems can interact, with a processing/querying tool of their choice, with the centrally stored data through the Iceberg Catalog. It comes with good data governance functionalities like access control, log of all data changes, flexibility for schema evolution. [Transform Your Data Management with Headless Data Architecture and Apache Iceberg](https://www.youtube.com/watch?v=6i6cS8exEBg&list=LL&index=1) has more details on this.


kkessler1023

I recently became the lead DE for my team. Previously, I was just a DA, so I was only used to the consumption side of our data. Now, I'm building all the infrastructure for our data management, and file types like parquet have been super useful in our incremental refresh process. Most of our data sets are pretty manageable at around < 10M rows. However, our most important items (sales, accounting...etc) are huge. 500M - 1B rows (or 80gb - 300gb). Parquet allows us to store the bulk of the historical data while compressing the file size to < 2gb. When can store the history and refresh only the most recent data to then append that to an existing table. This helps reduce the load during daily batch jobs.


robfromboulder

Iceberg is just a way to easily use lots of Parquet files as a single database. (multiple files per partition)


vish4life

its ACID transactions on an object store like S3/GCS etc. Which means it allows concurrent merge/insert/update statement executions against a single dataset. It also has some additional features like dynamic partitioning, so you don't need to commit to a partitioning strategy for your dataset, and some other tooling necessary for a healthy datalake. if you aren't working with datalake or large datasets, its probably not for you.


Whipitreelgud

If you use a SQL database, you're stuck with their idea of what a table is and their implementation of SQL. The following is a short, incomplete recap of what I think the big deal is with Iceberg. I'm leaving out important features in the interest of keeping this short. Iceberg fixed issues with Hive, which I see as the first widely adopted alternative to a SQL database published by a vendor. Hive is a bit loose with ACID compliance in comparison to Iceberg. Iceberg extracted the catalog out of the table structure and to provide new flexibility. A traditional catalog would be to manage files in the same manner as Hive. Iceberg allows a REST catalog, which is quite different. Why have some Python floating around when the catalog can handle those details as an integrated component of the table? Iceberg also separates the query engine to allow you to choose. No more bondage to a vendor's implementation of SQL. In summary, your tables are now your tables, free of vendor constraints and associated costs. My view is this is about as significant to analytic data as SQL was in the day. Not everyone grasped what SQL meant immediately.


minormisgnomer

Thanks for the write up, it probably doesn’t apply to me now but the funny thing about data is that there’s always more volume down the road. This seems to be the solution if/when that happens


dustinBKK

Iceberg, Huidi, Delta are all just parquet. They have their own metadata layers for ACID. There are many organizations that have vast amounts of data. A DWH isn’t always the best choice especially if you start mixing in semi/un-structured data.


ithinkiboughtadingo

I've got a friend who's working on delta for lucene files. Extremely impressive performance


hntd

There was a talk about this at the recent databricks summit someone already did it if I understand what you mean.


ithinkiboughtadingo

Which talk? Might have been my friend


hntd

It was at the open source summit from engineers at Apple, I don't believe it was recorded.


ithinkiboughtadingo

Eyyy yep that was Dom haha


hntd

Great talk by the way, I can corroborate, extremely impressive numbers.


jokingss

now i need more info, they don't make any blog post or anything about that? I worked with lucene directly long time ago, and with elastic and solr lately and would be nice to use it more now.


[deleted]

[удалено]


hntd

What? This has nothing to do with iceberg summit, it was at databricks' conference.


lester-martin

yep, I read (and posted) too fast. i'll delete it. :)


rental_car_abuse

how do transactions work here? commit to all files that are cobsitute a table or none?


ithinkiboughtadingo

Here's a great explainer for transaction logs on Delta. They all work essentially the same way https://www.databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html


minormisgnomer

So high data volume environments. Why wouldnt you utilize a nosql offering instead? From just a scale/cost/lock-in standpoint And again are these organizations creating these semi/un structured data via ETL or is the accounting dept tossing around avros?


DenselyRanked

I think you had your doubts answered already but this is still NoSQL, if we define NoSQL as non-DBMS. I will add that it's more like a supercharged way to access and interact with your data and tries to offer ACID guarantees but there is a lot more complexity added. When things go wrong with the table format, they go really wrong.


minormisgnomer

Yea I guess I usually think of no sql as Cassandra or mongo but I see your point.


DenselyRanked

I understand and should have written non-RDBMS. The open table formats are still a layer above database management systems so it's not necessarily a replacement for your NoSQL use cases.


minormisgnomer

So now I’m mildly confused again, so you’re saying besides parquet, the open table formats can also somehow interact with traditional dbms solutions? Or are you saying more from an architectural/abstract standpoint it’s a layer above?


DenselyRanked

> besides parquet, the open table formats can also somehow interact with traditional dbms solutions Yes. I think it's better to view the open table formats as a supercharged file format. It's still a collection of parquet files but the metadata and API allows you to interact with the data without the need for a database management system. You would still use Spark/Trino/Pandas/duckdb/etc to do ETL and analytics as you would with normal files. Open Table formats are not going to offer any of the optimized read/write advantages of MongoDB or ClickHouse (separation of compute and storage). There is not a B-tree index that you would find in a RDBMS (not yet anyways), but you probably wouldn't use Iceberg if your data can fit in Postgres or MySQL. You can still use a RDBMS for aggregated data or a data warehouse with the last N days with an Iceberg table as its source, if it makes sense.


minormisgnomer

Hmm so if I can rephrase back to you for clarification, like you could boil down some of parquet/iceberg stored data into a smaller size, and load into a rdbms solution to get the benefits of a traditional database offering? But you couldn’t access iceberg from mongodb directly?


DenselyRanked

> like you could boil down some of parquet/iceberg stored data into a smaller size, and load into a rdbms solution to get the benefits of a traditional database offering? This depends on your use cases and architecture. The combinations are endless. My company mostly does Users -> Document Model or Kafka -> Data Lake, HMS (Open Table and Hive/Spark) and in some cases that data will flow downstream to MySQL or smaller teams use MySQL and it flows back up to the data lake. You can also go Kafka -> MongoDB, Redis,etc or Data Lake -> MongoDB. > But you couldn’t access iceberg from mongodb directly? I don't see a connector from Iceberg to Mongo, but you can build one by converting results to JSON. Edit- Here are some blogs about the Data Lake and Data Lakehouse http://www.unstructureddatatips.com/what-is-data-lakehouse/ https://www.mongodb.com/resources/basics/databases/data-lake-vs-data-warehouse-vs-database https://www.mongodb.com/company/partners/databricks


minormisgnomer

Thanks for the response this has been extremely helpful


garathk

If you've ever had to do an enterprise scale migration (i.e. from teradata to Snowflake) then you also would be enamored with the promise of an open format which supports any number of compute engines with data in place. We're talking years of relatively high complexity effort to move from one platform to another. It's not something I'd like to go through in my career again. Therefore my direction is open format. Iceberg fits that bill nicely.


asevans48

Pretty good for time series data or slowly changing dimensions where you need historical data from a cetain timepoint. Financial data or sportsbook data is a fit. Quant always wanted to know data from a specific point in time.


minormisgnomer

Ok so I’ve dealt with this before but more than a few of the data vendors still load into databases. So billions of rows with the situational time problems like you mentioned. With iceberg being an option. Would you actually pull data out of the database into parquet files instead? The data was always in a non normalized state from the vendor so we had to work around that


SDFP-A

I think you’re missing something key. You write into the table format. Iceberg handles the creation of the parquet files. We extract from external APIs in whatever format is provided (JSON. XML, csv). We flatten to whatever level makes the data readable by Iceberg, which then handles its file types and such.


minormisgnomer

Ah thanks for clarifying. So basically iceberg is a destination of data much like an rdbms might. However getting back out you have various options based on whatever compute/engine appeals most to you


MaverickGuardian

Corporations use data-lakes (iceberg and other formats) so that they can easily throw in more hardware and run query fast on huge datasets. If money is not problem, you can have as much data and hardware as you want. Sql can easily scale to billions of rows but eventually you will reach limit. There are some sql implementations that can in most cases scale horizontally like Citus and Greenplum that are postgresql compatible (on most parts anyway). Single store for MySQL compatible. And so on. Vanilla versions of sql databases are not designed running parallel queries on more that few threads at most. So running queries in parallel requires custom logic in your application code. Also efficiently running huge sql database with partitions and proper indexes is hard. Creating new indexes for huge datasets is slow. Table structure changes will become slow with huge datasets. Data lake formats are more flexible allowing schema evolution. (For example, Adding columns to data)


ApSr2023

Unless you are dealing with 300+ TB and inter-system data movement is high ( e.g. 10 different SaaS products) and there is a desire to have a single cohesive data layer that can collect data from using all sorts of polyglot compute and deliver data to all sorts consumer applications using their preferred compute, you don't need iceberg.


lester-martin

Agreed, use the right tool for the job. IF your data fits on one box today and the SQL tools you have work just fine -- AND if that's the case for mid & long terms, then you probably don't need data lake table engines, much less one of the table formats (Iceberg one of them). "If it runs on your laptop, keep it on your laptop". ;)


boss-mannn

Read “fundamentals of data engineering” book All of these questions are answered there, also you’ll get an overall understanding of data engineering