Top LinkedIn Content on Data Lakehouse Solutions

194,214 followers 3mo

Data pipelines are evolving from the "copy-paste" era to the "access-anywhere" era. Are you evolving with them? At its core, data pipelines answers one thing - How does data moves from "happened" to "decision made"? Here's how the evolution looks like: 𝗘𝗧𝗟 (The Old Way) Extract → Transform → Load • Like preprocessing your groceries in the parking lot before bringing them inside • Great for strict governance, legacy systems • Slow. Rigid. Breaks on every schema change • Tools: Glue, Talend, NiFi 𝗘𝗟𝗧 (The Cloud Era) Load raw → Transform in warehouse • Load raw, transform in warehouse. Like, Dump groceries in fridge, prep later • Scales with cloud compute • Better, but still copying data everywhere • Tools: dbt, Snowflake, BigQuery 𝗖𝗗𝗖 (Real-Time Tax) Change Data Capture • Stream every database change via transaction logs • Captures inserts/updates/deletes from source. Near real‑time, low overhead • Perfect for fraud detection. Expensive for everything else • Tools: Debezium, AWS DMS 𝗦𝘁𝗿𝗲𝗮𝗺𝗶𝗻𝗴 (Over-Engineering Champion) • Real‑time processing, no waiting. Just like Food truck serving as you arrive • Necessary for IoT, trading, Fraud and live dashboards • Overkill for your Monday morning dashboard • Tools: Kafka, Flink, Kinesis. Enter 𝗭𝗲𝗿𝗼 𝗘𝗧𝗟: The "Why Are We Doing This?" Moment The modern lakehouse game‑changer One city library for everyone – no need to photocopy books. Data lives in one place (lakehouse). Query it directly with any engine. No copying. 🤔 How Modern Lakehouses Make This Real? 𝗢𝗽𝗲𝗻 𝗧𝗮𝗯𝗹𝗲 𝗙𝗼𝗿𝗺𝗮𝘁𝘀 (Iceberg, Hudi, Delta) Think: Git for data. ACID transactions, time travel, schema evolution—on cheap object storage. 𝗦𝘁𝗼𝗿𝗮𝗴𝗲 ≠ 𝗖𝗼𝗺𝗽𝘂𝘁𝗲 Data sits in S3. Spark reads it. Snowflake queries it. Trino analyzes it. Same files. No copies. No vendor lock-in. 𝗡𝗮𝘁𝗶𝘃𝗲 𝗖𝗼𝗻𝗻𝗲𝗰𝘁𝗼𝗿𝘀 AWS Aurora → S3? Managed replication. No custom Glue jobs. No 3 AM debugging. What Changes for You? You stop: - Writing extraction code for the 47th time - Maintaining Airflow DAGs that "mysteriously fail on Wednesdays" - Explaining why reports are 6 hours delayed - Paying for storage in four different systems You start: - Designing smart partitioning strategies - Building dbt models where transformation actually matters - Focusing on data quality, not data plumbing - Solving business problems instead of pipeline problems The Real Talk? Zero ETL isn't "no data engineering." It's engineering that matters. Traditional ETL won't vanish overnight—too much legacy infrastructure. But for new projects? Start lakehouse-first: - Store once in open formats - Let compute engines come to data - Create copies only when necessary Think of it this way: You wouldn't photocopy a book every time someone wants to read a different chapter. So why copy your data every time a new team needs access?

60 Comments

Dunith Danushka

Technical Product Marketing at EDB | Author of “Practical Data Engineering with Apache Projects”

6,805 followers 1y

🔄 Perspective Shift: Data Lakehouse as an Unbundled Database Ever thought about how a data lakehouse is essentially a deconstructed database? Let me break this down: Traditional databases bundle various components tightly together: 🔷 Data Storage → Stores actual data records on disk as a set of files in a database-specific format. 🔷 Storage Engine → Manages how data is laid out on the disk as well as retrieved from the disk 🔷 Compute Engine → Parses, optimizes, and executes SQL queries against stored data 🔷 Metadata Catalog → Stores information about database structure, schemas, and statistics The data lakehouse architecture takes these same components but distributes them across a modern data stack: 🟩 Data Storage → Cost-efficient object storage (S3, GCS, ADLS) 🟩 Compute Engine → Distributed compute engines (Spark, Presto) doing query planning, optimization, and distributed SQL query execution 🟩 Metadata Catalog → External catalog services (Hive Metastore, Unity Catalog, AWS Glue, etc) So, what’s the value addition? The reimagined storage engine consists of two layers: open file formats and table formats. ✴️ File Formats → Standardized, vendor-neutral data storage formats like Parquet, ORC, and Avro that define how data is encoded and compressed at the binary level. These formats are optimized for analytics with features like columnar storage, predicate pushdown, and efficient compression ✴️ Table Formats → Delta Lake, Iceberg, and Hudi build on top of these file formats to add ACID transactions, time travel, schema evolution, and other database-like features In addition to that the metadata catalogs in the lakehouse architecture serve as the central nervous system, managing crucial metadata and providing several key functions like schema management, data discovery, and access control. The game changer for data lakehouses is that all these components communicate through standardized interfaces and open formats, enabling unprecedented interoperability. This architecture allows you to mix and match tools while maintaining enterprise-grade reliability. The unbundled approach delivers flexibility without compromising functionality—ultimately enabling a truly composable data platform. #DataEngineering #DataLakehouse #BigData #DataArchitecture What's your take on this perspective? Share your thoughts below! 👇

1 Comment

Darshil Parmar

Founder @DataVidhya | Crack Data Engineering Interview with Us | 🎥YouTube (200K+) @Darshil Parmar

136,684 followers 2mo

🚕 Ride Stream: Real-Time Data Lakehouse Data Engineering Project ⬇️ Not a toy ETL. Not a fake CSV pipeline. A proper production-style streaming + lakehouse system. This is the kind of architecture companies actually run. Here’s what this project teaches you by building it end-to-end 👇 You start with real-time events (like a ride-booking app): -> A Python producer sends events -> Data flows into Kafka (MSK) -> Then into Firehose -> Lands in S3 Raw From there: -> Glue handles cataloging -> EMR + Spark does heavy transformations -> Step Functions orchestrate the whole pipeline -> Data moves from Raw → Refined → Business -> dbt handles analytics transformations -> Athena is used for querying And the whole thing is deployed using: -> CloudFormation (IaC) -> CodeBuild + CodePipeline So you’re not just learning Spark or AWS. You’re learning: ✅ How real streaming systems are designed ✅ How lakehouse layers actually work in practice ✅ How orchestration works in production ✅ How analytics engineering (dbt) fits into data platforms ✅ How to think like a data platform engineer, not just an ETL writer This is exactly the kind of project: -> That makes your resume stand out -> That gives you real system design confidence -> That helps you crack “design a data platform” interviews -> I’m adding this as a full guided project inside DataVidhya. If you’ve ever felt: “I know tools… but I don’t know how everything fits together in real life” This project is for you.🫵🏻 You can find the project below 👇🏻 #dataengineer #dataengineering

6 Comments

Ankur Ranjan

Senior Machine Learning Engineer @ Razorpay

54,000 followers 7mo

Forget about Schema Evolution; even schema enforcement is a major and problematic issue in a Data Lake. As a Data Engineer, we often discuss and use fancy methods to address Schema evolution, but I rarely see DEs focusing on Schema Enforcement in their pipelines. Most of the time, we simply dump or load data into a pure Data Lake without actually having a care for schema. There is a huge focus on keeping a schema while reading from the source, but while loading data to the target or sink, we hardly really care. If there is no schema enforcement, then multiple pipelines or changing versions & code of a single pipeline can actually keep adding multiple schemas to different Parquet or ORC files in the Partition or Non-Partition folder on Object storage. Without schema enforcement, there is nothing which can actually prevent this from happening. Readers (Spark, Presto, Trino, Hive) will infer a merged schema across files, which may lead to messy surprises. In a plain Data Lake setup — where you are just dumping data files (Parquet, ORC, CSV, JSON, etc.) into object storage (S3, GCS, ADLS), without a catalog, table format, or governance layer — there is no built-in schema enforcement. Lakehouse architecture and open table formats have attempted to resolve this issue and have contributed valuable knowledge to address this missing critical design aspect in simple Data Lake pipelines. Features such as Catalog, Governance layer, Data versioning, improved ACID support, better support for UPDATE/MERGE/DELETE, schema enforcement, and schema evolution are increasingly important due to serious Machine Learning pipelines and the necessity for reliable results in the Data Engineering landscape.

2 Comments

Matt Turck

80,351 followers 1y

Some key insights from this week's chat with Justin Borgman, CEO of Starburst on The MAD Podcast / Data Driven NYC: 1) Database 101: * Transactional databases (Oracle, MongoDB) optimize for fast writes and consistency. * Analytical databases (Snowflake, Databricks, Starburst) focus on fast reads and complex queries to make sense of data 2) Data Lakes vs. Data Warehouses vs. Lakehouses: * Data lakes (originally Hadoop) allow massive storage but are inefficient for analytics. * Data warehouses (Snowflake, Teradata) optimize analytics but are proprietary. * Lakehouses combine both, leveraging open-source formats like Iceberg. 3) The Open Format War * Iceberg vs. Delta vs. Hudi: The industry debated which open table format would dominate. * 2024: The Year Iceberg Won: Snowflake embraced Iceberg, signaling broad adoption. Databricks acquired Tabular (the commercial Iceberg company) for $2B in a bidding war with Snowflake. Starburst had long supported Iceberg, making it a natural leader in the transition. 4) Starburst’s Strategy and Evolution * Justin co-founded the company with key contributors to Presto from Facebook * When Facebook claimed Presto, had to fork and rebuild a new open source project (Trino) * Building Starburst Galaxy (cloud product) twice: The first version didn’t deliver a seamless experience, leading to a costly but ultimately beneficial re-architecture. *Betting on the Lakehouse and Iceberg: Starburst has long championed the lakehouse model, believing open formats like Apache Iceberg would prevail. * Hybrid Solution: While Snowflake and Databricks are cloud-only, Starburst supports on-prem and hybrid deployments, making it attractive for enterprises with legacy infrastructure. 5) Starburst’s Offerings * Starburst Enterprise: On-prem self-managed solution * Starburst Galaxy: Fully managed SaaS version, optimized for cloud. * Dell Partnership: OEM deal embedding Starburst into Dell’s infrastructure, especially for AI workloads. 6) Core Capabilities: * Federated Query Engine: Runs analytics across multiple data sources. * Governance & Security: Fine-grained access controls, auditing, and data sovereignty compliance. * Streaming Data & Ingestion: Optimized for real-time analytics with Kafka. * Automatic Table Maintenance: Performance optimizations for Iceberg. 7) Starburst’s Role in the AI Stack Training AI Models: Access to high-quality, structured data is critical for AI. Retrieval-Augmented Generation (RAG): Providing AI with real-time structured data access for enterprise applications. 8) Go-to-Market Lessons * Starburst initially considered product-led growth (PLG) but realized its complexity required a direct enterprise sales motion. * Heavy reliance on solution architects to support large deployments. * Building a Services Ecosystem with boutique SI firms (e.g., Kubrick) to expand implementation support. * Strategic partnerships (Dell) took years to develop but became high-value revenue drivers.

Trino, Iceberg and the Battle for the Lakehouse | Justin Borgman, CEO, Starburst

https://www.youtube.com/

7 Comments

Andrew Jones

📝 Principal Engineer. Builder of data platforms. Created data contracts and wrote the book on it. Father of 2. Brewer of beer. Aphantasic.

8,156 followers 9mo

I started working in data in the era of Hadoop. It was a great technical leap forward, but led to a damaging change in mindset that still persists today. Hadoop, and specifically, HDFS, freed us from the limitation of databases constrained by the size of the on-prem machine that hosted it, allowing us to store data at significantly reduced costs, and provided new tools to process this data that was now stored across many, low-cost machines. We called this a data lake. However, because storage became relatively cheap, we stopped applying discipline to the data we stored. We lowered the barriers of entry for data writers, harvesting as much data as we could. We stopped worrying about schemas when writing data, saying it's fine, we'll just apply the schema on read... → But by making writing data cheap, we made reading it 𝒆𝒙𝒑𝒆𝒏𝒔𝒊𝒗𝒆. For a start, it was almost impossible to know what data was in there and how it was structured. It lacked any documentation, had no set expectations on its reliability and quality, and no governance over how it was managed. Then, once you did find some data you wanted to use, you needed to write MapReduce jobs using Hadoop or, later, Apache Spark. But this was very difficult to do – particularly at any scale – and only achievable by a large team of specialist data engineers. Even then, those jobs tended to be unreliable and have unpredictable performance. 🙅♂️ Using this data became prohibitively expensive. So much so, it was hardly worth the effort. That's why people started calling them data swamps, rather than data lakes. Although some of us have moved away from data lakes, schema on read, etc, this mindset is still prevalent in our industry. We still feel we need writing data to be cheap. We still accept that a large portion of our time and money will be spent "cleaning" this data, before it can be put to work. But, if the data is as valuable as we say it is, why can't we argue for a bit more discipline to be applied when writing data? Why can't we apply schemas to our data on publication? We use strongly-typed schemas every other time we create an interface between teams/owners, including APIs, infrastructure as code, and code libraries. How much would costs be reduced for the company if that was the case? We can total up the time spent cleaning the data, the time spent responding to incidents, the opportunity costs of being unable to provide reliable data to the rest of the company. Is the ROI positive? I'd bet it is.

13 Comments

Lumina Wang

Developer Advocate @ Milvus open-source vector database | RAG Pipeline | LLM Finetuning | Data Science | Model Training

3,922 followers 10mo

At first, all the names blurred together- 𝐃𝐚𝐭𝐚 𝐖𝐚𝐫𝐞𝐡𝐨𝐮𝐬𝐞, 𝐃𝐚𝐭𝐚 𝐋𝐚𝐤𝐞, 𝐃𝐚𝐭𝐚 𝐋𝐚𝐤𝐞𝐡𝐨𝐮𝐬𝐞, 𝐃𝐚𝐭𝐚 𝐌𝐞𝐬𝐡. I thought: Aren’t they all just ways to store data? -> 𝐀𝐛𝐬𝐨𝐥𝐮𝐭𝐞𝐥𝐲 𝐧𝐨𝐭. Here’s a simple breakdown—for anyone who’s ever felt the same: 🏢 𝐃𝐚𝐭𝐚 𝐖𝐚𝐫𝐞𝐡𝐨𝐮𝐬𝐞: 𝐒𝐜𝐡𝐞𝐦𝐚-𝐨𝐧-𝐰𝐫𝐢𝐭𝐞, 𝐝𝐞𝐟𝐢𝐧𝐞 𝐟𝐢𝐫𝐬𝐭 𝐭𝐡𝐞𝐧 𝐬𝐭𝐨𝐫𝐞 A centralized storage system optimized for structured data and business intelligence. ✅ Fast queries, strong governance—ideal for BI and compliance. ❌ Rigid schemas, not ideal for raw/unstructured data, expensive at scale. Go-to Tools: Snowflake, BigQuery, Redshift. 💧 𝐃𝐚𝐭𝐚 𝐋𝐚𝐤𝐞: 𝐒𝐜𝐡𝐞𝐦𝐚-𝐨𝐧-𝐫𝐞𝐚𝐝, 𝐬𝐭𝐨𝐫𝐞 𝐟𝐢𝐫𝐬𝐭 𝐭𝐡𝐞𝐧 𝐝𝐞𝐟𝐢𝐧𝐞 A centralized repository that stores massive volumes of raw structured and unstructured data in native formats. ✅ Cheap, flexible, great for ML and exploration. ❌ Lacks governance, slower queries without tuning, data swamp risk. Go-to Tools: AWS S3+Glue, Azure Data Lake. 🏞️ 𝐃𝐚𝐭𝐚 𝐋𝐚𝐤𝐞𝐡𝐨𝐮𝐬𝐞: 𝐋𝐚𝐤𝐞 𝐜𝐨𝐬𝐭𝐬 + 𝐖𝐚𝐫𝐞𝐡𝐨𝐮𝐬𝐞 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞 A next-gen data platform that combines the flexibility of data lakes with the performance of data warehouses. ✅ Unified storage with strong analytics + ML performance. ❌ Complex to build and operate, tools still evolving. Go-to Tools: Databricks, Apache Iceberg. 🌐 𝐃𝐚𝐭𝐚 𝐌𝐞𝐬𝐡: 𝐃𝐚𝐭𝐚 𝐚𝐬 𝐚 𝐩𝐫𝐨𝐝𝐮𝐜𝐭, 𝐝𝐨𝐦𝐚𝐢𝐧 𝐚𝐮𝐭𝐨𝐧𝐨𝐦𝐲 A distributed architecture treating data as products, with each business domain owning and managing their own data. ✅ Scales with teams, empowers domain ownership. ❌ High governance overhead, needs strong org maturity. Go-to Tools: Requires combining multiple tools to implement. ⚖️ So how do you choose? Here's a quick guide based on your org’s stage and goals: 𝐈𝐟 𝐲𝐨𝐮'𝐫𝐞 𝐚... 🚀 Startup/Data-Heavy Org → Lakehouse (balances cost + capability) 🏢 Traditional Enterprise → Warehouse (reliable + proven) 🌊 Experiment-Heavy Team → Data Lake (max flexibility) 🏗️ Large Multi-Business Org → Data Mesh (long-term scalability) 💡 Bottom Line: Don't ask "which is best?"—ask "which fits us?" The best architecture is the one your team can build, run, and grow with. ------ 👉Follow Lumina Wang for more AI & data insights. Let's grow together!

22 Comments

Vino Duraisamy

Developer Advocate @Snowflake | Data & AI engineering

44,058 followers 7mo

⁉️ What is a lakehouse? Why is it a big deal? And why is #ApacheIceberg 🧊 suddenly the hottest topic in data engineering? They're not just buzzwords. They represent a fundamental shift in how we build data platforms, and understanding their relationship is key. ⚡ We all know that: Data 𝗹𝗮𝗸𝗲 + Data ware𝗵𝗼𝘂𝘀𝗲 = 𝗹𝗮𝗸𝗲𝗵𝗼𝘂𝘀𝗲. For years, we've dealt with a false dichotomy: 𝟭. 𝗗𝗮𝘁𝗮 𝗪𝗮𝗿𝗲𝗵𝗼𝘂𝘀𝗲: Structured, governed, and performant. It supports ACID transactions, ensuring data integrity. Schema-on-write prevents bad data from corrupting your tables. It's built for BI, not for unstructured data or ML workloads. 𝟮. 𝗗𝗮𝘁𝗮 𝗟𝗮𝗸𝗲: Cheap, scalable object storage for all data types. But it's a wild west of files (Parquet, ORC, etc.) with no transactional guarantees, no schema enforcement, and poor performance, leading to the classic "data swamp." ✅ The promise of the #Lakehouse is to 𝗺𝗲𝗿𝗴𝗲 𝘁𝗵𝗲 𝗿𝗲𝗹𝗶𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗮𝗻𝗱 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗼𝗳 𝘁𝗵𝗲 𝘄𝗮𝗿𝗲𝗵𝗼𝘂𝘀𝗲 𝘄𝗶𝘁𝗵 𝘁𝗵𝗲 𝗼𝗽𝗲𝗻𝗻𝗲𝘀𝘀 𝗮𝗻𝗱 𝗳𝗹𝗲𝘅𝗶𝗯𝗶𝗹𝗶𝘁𝘆 𝗼𝗳 𝘁𝗵𝗲 𝗹𝗮𝗸𝗲. The component that makes this possible is the table format. 🧊 Enter Apache Iceberg! ⁉️ Iceberg is not a file format or a query engine. It is an open-source, community-driven table format specification. It defines 𝗮 𝘀𝘁𝗮𝗻𝗱𝗮𝗿𝗱𝗶𝘇𝗲𝗱 𝗺𝗲𝘁𝗮𝗱𝗮𝘁𝗮 𝗹𝗮𝘆𝗲𝗿 𝗼𝗻 𝘁𝗼𝗽 𝗼𝗳 𝘆𝗼𝘂𝗿 𝗳𝗶𝗹𝗲𝘀 𝗶𝗻 𝗼𝗯𝗷𝗲𝗰𝘁 𝘀𝘁𝗼𝗿𝗮𝗴𝗲 𝘁𝗵𝗮𝘁 𝗽𝗿𝗼𝘃𝗶𝗱𝗲𝘀 𝘁𝗵𝗲 𝗱𝗮𝘁𝗮𝗯𝗮𝘀𝗲 𝘀𝗲𝗺𝗮𝗻𝘁𝗶𝗰𝘀 𝘆𝗼𝘂'𝘃𝗲 𝗯𝗲𝗲𝗻 𝗺𝗶𝘀𝘀𝗶𝗻𝗴 𝗶𝗻 𝘁𝗵𝗲 𝗹𝗮𝗸𝗲. Here’s what it does at a technical level: ➡️ Defines the Table State ➡️ Enables ACID Transactions ➡️ Abstracts Physical Layout ➡️ Guarantees Schema Evolution This is the key: 𝗜𝗰𝗲𝗯𝗲𝗿𝗴 𝗯𝗿𝗶𝗻𝗴𝘀 𝗱𝗮𝘁𝗮𝗯𝗮𝘀𝗲-𝗹𝗲𝘃𝗲𝗹 𝗿𝗲𝗹𝗶𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝘁𝗼 𝘆𝗼𝘂𝗿 𝗱𝗮𝘁𝗮 𝗹𝗮𝗸𝗲. It transforms a collection of files into a governable, high-performance asset. ✅ This is why it's fundamental to the Lakehouse. With Iceberg as the open standard, you can have multiple engines (Snowflake, Spark, Trino, Flink) concurrently and transactionally operating on the same single copy of data in your own object store. No more siloing data for different workloads. 🚀 At Snowflake, we've embraced this open standard. You can create Iceberg Tables managed by Snowflake or connect to your existing Iceberg tables. This allows you to leverage Snowflake's performance, security, and unified governance on your open data lake assets without locking them away. 🚀 It's about bringing the query engine to the data, not the other way around. ⁉️ 𝗪𝗮𝗻𝘁 𝘁𝗼 𝗿𝘂𝗻 𝗮𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀 𝗼𝗻 𝘆𝗼𝘂𝗿 𝗹𝗮𝗸𝗲𝗵𝗼𝘂𝘀𝗲 𝘂𝘀𝗶𝗻𝗴 𝗦𝗻𝗼𝘄𝗳𝗹𝗮𝗸𝗲? Check out quickstart guide in comments! Jeemin Sim Emma W. Danica Fine Chanin Nantasenamat #dataengineering

22 Comments

Jonas Thordal

Co-founder & CEO at Weld | Fast, reliable data movement

7,683 followers 8mo

The concept of a data lakehouse is gaining traction, but what does it look like in practice? Let's break it down. A data lakehouse is an architecture that: - Stores data in *open formats* like Parquet or ORC. - Adds *governance and reliability* features (ACID transactions, indexing, versioning). - Supports both *SQL analytics and advanced ML/data science* in one platform. In other words, it combines the trust and performance of a warehouse with the flexibility and scale of a data lake. This means data engineers, analysts, and data scientists can all access the same datasets directly without extra ETL jobs or siloed copies. So how do they differ? - Data warehouse (Snowflake, BigQuery, Redshift) → structured and governed systems designed for BI and reporting. - Data lake (AWS S3, Azure Data Lake, Hadoop HDFS) → designed for storing massive volumes of raw, semi-structured, or unstructured data, useful for ML and exploration. - Data lakehouse (Databricks Delta Lake, Apache Iceberg, Snowflake UniStore, Google BigLake) → designed to combine warehouse-style governance and performance with lake-style flexibility, bridging analytics and ML on a single platform. As teams demand real-time insights and AI/machine learning at scale, the lakehouse is emerging as the natural next step in data architecture for many. However, the real question isn’t warehouse vs. lake vs. lakehouse, it’s which setup will make your data trusted, usable, and impactful for the people who need it.

37 Comments

Andrew Madson

96,217 followers 2mo

$150K and 3 months. That's what picking the wrong table format cost one team. Delta was perfect for their Spark workloads. But when they added Trino, Snowflake, and Athena to the mix, they needed Iceberg's multi-engine support. The lesson? Match your table format to your actual workflows. 𝐂𝐡𝐨𝐨𝐬𝐞 𝐀𝐩𝐚𝐜𝐡𝐞 𝐈𝐜𝐞𝐛𝐞𝐫𝐠 𝐰𝐡𝐞𝐧: → Your stack includes multiple engines → Partitioning strategy might change → You want vendor neutrality with REST standards → Schema evolution is frequent 𝐂𝐡𝐨𝐨𝐬𝐞 𝐃𝐞𝐥𝐭𝐚 𝐋𝐚𝐤𝐞 𝐰𝐡𝐞𝐧: → Spark/Databricks is your primary compute → You want Liquid Clustering (no more guessing partition columns) → You need fast UPDATEs/DELETEs via deletion vectors → Auto-compaction and less manual maintenance appeal to you 𝐂𝐡𝐨𝐨𝐬𝐞 𝐀𝐩𝐚𝐜𝐡𝐞 𝐇𝐮𝐝𝐢 𝐰𝐡𝐞𝐧: → Streaming ingestion with frequent upserts is your main workload → You need sub-minute data freshness → CDC pipelines are core to your architecture → You have massive tables with record-level indexing 𝐂𝐡𝐨𝐨𝐬𝐞 𝐋𝐚𝐧𝐜𝐞 𝐰𝐡𝐞𝐧: → Multimodal AI is your primary workload → You need vector search, full-text search, and random access → Feature engineering is core to your architecture → You want lakehouse benefits (ACID, time travel) with AI-native capabilities 𝐓𝐡𝐞𝐬𝐞 𝐟𝐨𝐫𝐦𝐚𝐭𝐬 𝐚𝐫𝐞 𝐜𝐨𝐧𝐯𝐞𝐫𝐠𝐢𝐧𝐠. Databricks acquired Tabular (Iceberg's creators). Delta UniForm generates Apache Iceberg metadata. Apache XTable translates between formats. And Lance brings AI-native capabilities while still integrating with open engines like Spark, Trino, and DuckDB. The format wars are cooling, but picking the wrong one today can still cost months of work. Match the format to your primary engine and workload pattern. Which camp are you in: multi-engine, Spark-native, streaming-heavy, or AI-first? #DataEngineering #ApacheIceberg #DataLakehouse Apache Iceberg Delta Lake DuckDB LanceDB

26 Comments

Data Lakehouse Solutions

Trino, Iceberg and the Battle for the Lakehouse | Justin Borgman, CEO, Starburst

https://www.youtube.com/

More in Data Lakehouse Solutions

More Technology topics

Explore categories