Delta Lake vs. Data Lakes – what's the difference?

1 point by MrPowers 1 year ago | 2 comments
  • RoyTyrell 1 year ago
    This is just a marketing brochure...

    Not to be "old man yells at sky" but a lot of these new cloud-based/cloud-focused architectures seem to be geared toward highly specialized needs that 99.9% of businesses aren't going to need. However they do one important thing - they over-use resources that line the pockets of MS, Amazon, Google, Data Bricks, etc. A Data Lakehouse is fine but what benefit does it give you over a much more simple solution of ETL/ELTing the data in batches (weekly, daily, hourly, etc) and letting it sit in some kind of DB.

    They say the Data Lakehouse needs all this metadata storage, API access layers, etc. Seems like an overly complex system for anything but large real-time systems that need to replicate a DB but due to data volume and throughput, are unable to. Perhaps you also aren't just driving traditional reporting (dashboards, etc).

    I'm happy to use this new technology to make more money for myself as a specialist, and effectively be in on the scam, but from an optimal solution pov they suck.

    • MrPowers 1 year ago
      > A Data Lakehouse is fine but what benefit does it give you over a much more simple solution of ETL/ELTing the data in batches (weekly, daily, hourly, etc) and letting it sit in some kind of DB.

      Lots of engines like Polars, PyTorch, Spark, and Ray can read structured data from databases, but Lakehouses are more efficient.

      Databases aren't as good for storing unstructured data.

      Databases can also be much more expensive than a Data Lakehouse.

      Databases are awesome and have lots of amazing use cases of course. Like you mentioned, data lakehouses are great for high data volume and throughput, but there are other use cases as well IMO.