Show HN: Denormalized – Embeddable Stream Processing in Rust and DataFusion

125 points by ambrood 10 months ago | 31 comments
tl;dr we built an embeddable stream processing engine in Rust using apache DataFusion, check us out at https://github.com/probably-nothing-labs/denormalized

Hey HN,

We’d like to showcase a very early version of our embeddable stream processing engine called Denormalized. The rise of DuckDB has abundantly made it clear that even for many workloads of Terabyte scale, a single node system outshines the distributed query engines of previous generation such as Spark, Snowflake etc in terms of both performance and cost.

Now a lot of workloads DuckDB is used for were normally considered to be “big data” in the previous generation, but no more. In the context of streaming especially, this problem is more acute. A streaming system is designed to incrementally process large amounts of data over a period of time. Even on the upper end of scale, productionized use-cases of stream processing are rarely performing compute on more than tens of gigabytes of data at a given time.

Even so, the standard stream processing solutions such as Flink involve spinning up a distributed JVM cluster to even compute against the simplest of event streams. To that end, we’re building Denormalized designed to be embeddable in your applications and scale up to hundreds of thousands of events per second with a Flink-like dataflow API. While we currently only support Rust, we have plans for Python and Typescript bindings soon.

We’re built atop DataFusion and the Arrow ecosystems and currently support streaming joins as well as windowed aggregations on Kafka topics.

Please check out out repo at: https://github.com/probably-nothing-labs/denormalized

We’d love to hear your feedback.

  • dman 10 months ago
    This looks super interesting. I built https://github.com/finos/perspective in a past life but have been out of the streaming analytics game for some time. Nice to see single machine efficiency be a focus, will give this a try and post feedback on github.
    • ambrood 10 months ago
      this looks so clutch! curious if this was purpose built for the finance industry?
      • dman 10 months ago
        Yes it was. People wanted a realtime version of pandas for booking up their ticking charts and grids.
    • emgeee 10 months ago
      Other founder here -- we've been working on this now for several months and have had a lot of fun building on top of arrow and datafusion
      • snthpy 10 months ago
        Hi,

        I previously built pq (https://github.com/PRQL/prql-query) as a side project using PRQL, Arrow, DataFusion, and DuckDB in Rust but unfortunately my life got too busy to maintain it.

        I've been looking for work in a related area to make it easier to pick up the torch on that again.

        I'd love to chat about the space and share experiences. My colleague on PRQL built https://github.com/aljazerzen/connector_arrow which may also be of interest.

        You can reach me at <HN handle> at Google email service.

      • theLiminator 10 months ago
        Are you going to support OLAP use cases as well? I haven't yet found a really nice hybrid batch/streaming query engine with dataframe support.

        Ideally, you'd support an api similar to Polars (which I have found to be the nicest thus far).

        It'd also be important/useful to support Python udfs (think numpy/jax/etc.).

        It'd be very cool if you could collaborate with or even tap into the polars frontend. If you could execute polars logical plans but with a streaming source, that would be huge.

        • snthpy 10 months ago
          Have you looked at Databend? They support Flink CDC (https://docs.databend.com/guides/load-data/load-db/flink-cdc) so should be able to handle hybrid use cases.

          I haven't looked at their Python API but they support PRQL which is a pretty nice and ergonomic interface in my (biased) opinion. See https://docs.databend.com/sql/sql-reference/ansi-sql#support...

          • ambrood 10 months ago
            DataFusion is primarily a batch OLAP system, so we should be able to support hybrid workloads as well. And definitely agree with you re: Polars dev exp. That is something we are aiming for with our forthcoming Python sdk.

            > It'd also be important/useful to support Python udfs (think numpy/jax/etc.).

            Yep that's our longterm gameplan.

            > It'd be very cool if you could collaborate with or even tap into the polars frontend. If you could execute polars logical plans but with a streaming source, that would be huge.

            Are there examples of project that do this? I'd be very much interested in looking into this.

            • theLiminator 10 months ago
              > Are there examples of project that do this? I'd be very much interested in looking into this.

              Nope, I don't believe there are. Unfortunately they don't seem like they're interested in exporting their logical plans to substrait, so there's no obvious way forward.

              > DataFusion is primarily a batch OLAP system, so we should be able to support hybrid workloads as well. And definitely agree with you re: Polars dev exp. That is something we are aiming for with our forthcoming Python sdk.

              Ah, since this is the case, it might also make sense to tap into the datafusion python bindings which recently got a massive overhaul to have a more similar dev ex as polars (though the docs are still quite a bit behind).

              I'm looking forward to seeing what the result will be! I know Ibis also is an option, but with my little bit of playing around with it, I've found it's just the lowest common denominator and doesn't provide as nice of an experience as directly using polars (or whatever query engine api is provided).

          • j-pb 10 months ago
            I'd be curious to know what your thoughts on differential/timely dataflow are. Superficially it seems that it might be possible to integrate the existing Rust infrastructure from those libraries with DataFusion and Arrow, which could give you quite a few operators for free, and provide your users with the very nice incremental query/streaming-as-view-maintenance model.
            • ethegwo 10 months ago
              Neat, founder of https://tonbo.io/ here, I am excited to see someone bring stream processing to datafusion, we are working on a arrow-native embedded db and plan to support datafusion in the next release, we’re interested in building the streaming feature on denormalized.
              • ambrood 10 months ago
                thanks for the encouraging words @ethegwo. Tonbo looks very cool and potentially something we could use for our state backend (currently using RocksDB which we aren't that happy about). Would love to chat about how we can work together. Feel free to reach out to me - amey@denormalized.io
              • shrisukhani 10 months ago
                Interesting. What use cases are you guys targeting with this?
                • stereosky 10 months ago
                  Congratulations on launching your project! We spoke back in March at a Kafka Summit London social meetup and talked all things Python and Kafka (I work on https://github.com/quixio/quix-streams). Always great to see a new stream processing project tackle a new segment
                  • eXpl0it3r 10 months ago
                    For someone not deep in the topic, what is a "Streaming Processing Engine"?

                    All the description for Denormalized use the term, so if don't know it, it's kind of impossible to understand what Denormalized is / trying to solve.

                    • emgeee 10 months ago
                      "streaming data" can generally be thought of as a sequence of events in time. Think measurements from sensors, mouse click events from uses on a webpage, or access logs from a server. These events are often fed into a message bus type system like Apache Kafka or AWS Kinesis.

                      Stream processing is a programming paradigm for working with these types of timestamped events and a "stream processing engine" like Denormalized seeks to actually execute stream processing compute jobs against streaming data.

                      The goal of the engine is to abstract away as much of the low level complexities needed to effectively work with this data at scale and provide an ergonomic way for developers to write and operate streaming applications.

                    • nonlogical 10 months ago
                      This looks totally awesome! Easy to setup, memory-efficient, streaming, real-time data aggregation, compilable to a single self contained binary, that is a dream come true.

                      Bookmarked for future projects!

                      • ztratar 10 months ago
                        Will be excited to see the typescript bindings once out. We may be able to use this to handle some of our workloads at Embra.

                        Will reach out! Congrats on the ship.

                        • ambrood 10 months ago
                          thanks @ztratar. would love to hear about your workloads at embra would be very helpful vis-a-vis the direction of our typescript experience. feel free to drop us an email: hello@denormalized.io
                        • drawnwren 10 months ago
                          What differentiates you from i.e. Arroyo and Fluvio?
                          • necubi 10 months ago
                            I'm the creator of Arroyo (and have talked a lot with the Denormalized folks) so maybe can answer from my perspective (and Matt and Amey please correct me on any inaccuracies.)

                            First the similarities: both Arroyo and Denormalized use DataFusion and Arrow and are focused on high-scale, low-latency stateful stream processing.

                            Arroyo has been around a lot longer and is overall more mature. It's distributed (I believe Denormalized at this point is a single-node engine), supports consistent snapshotting of its state, event time and watermarks, and has a wide range of supported connectors (https://doc.arroyo.dev/connectors). It ships with a control plane, distributed schedulers, and web ui.

                            But the use cases we're targeting are different. Arroyo programmed via SQL, and is used primarily for real-time data pipelines; we aim to replace Flink SQL and kSQL.

                            Denormalized (as I understand it) is focused more on data science use cases where it makes sense to have an embedded engine, rather than a distributed one. It's programmed with a Rust dataframe API (and soon Python).

                            • debadyutirc 10 months ago
                              I work with the creators of Fluvio at InfinyOn.

                              Fluvio is an edge to core cloud native streaming engine built from the ground up in rust. Compiles to a single 37 Meg binary and deploys on ARM64 devices.

                              We just released the first public beta version of Stateful DataFlow. Stateful DataFlow is a framework for building unbounded distributed stream processing based on wasm that runs on Fluvio streams.

                              We are going for a Lean alternative to Kafka + Flink with a user experience of Ruby on Rails.

                              BTW, Stateful DataFlow has integrations with Arrow, Polars, and the ability to use SQL for dataframes, and other wasm compatible programming languages to express business logic. And Fluvio has Rust, Python, and JS clients.

                              • ambrood 10 months ago
                                while haven't checked out Fluvio yet, we are fans of Arroyo. regarding latter my understanding is that the team is going for a SQL first complete replacement for Flink. Denormalized is meant to be an embeddable engine you can import within your project. Our plan is to focus on the developer experience for users building with Python and Typescript in particular.
                              • franciscojarceo 10 months ago
                                Can't wait for the Python SDK!
                                • emgeee 10 months ago
                                  it'll be coming soon!
                                • lhnz 10 months ago
                                  Do you have plans to make the data sources pluggable instead of being Kafka specific?
                                  • ambrood 10 months ago
                                    we absolutely do, the library itself is designed to be extensible. we are currently working on adding webhooks as one of our sources. are there are any specific connectors/sources you'd be interested in?
                                    • lhnz 10 months ago
                                      I have lots of HTTP endpoints that we poll with a cursor but actually the underlying data is very large (we work with snapshots of it) and updates very frequently and eventually we'll move to something else (e.g. interact directly with the underlying services with capnproto) so really it would just be useful to be able to define these sources ourselves. I'm working doing full-stack engineering at an HFT currently and we were thinking of using DataFusion to allow users to join, query and aggregate the data in realtime but I haven't attempted this yet (and to do so means integrating with what currently exists as I don't have time to rewrite all of the services).
                                  • akshay2881 10 months ago
                                    Nice! How feature complete is this with current industry standards like Flink?
                                    • rNULLED 10 months ago
                                      Looks cool! I’ll try it out for my ambitious project :)
                                      • emgeee 10 months ago
                                        would love any and all feedback!
                                      • dphatak17 10 months ago
                                        [dead]