Comparing Streaming Frameworks

December 2017
Development

(Note: This was written in Early 2017, so some statements might be outdated. This is taken straight from the README that you can find at github.com/nambrot/biometric-stream-processing)

I recently decided that I wanted to learn more about the world of stream processing. I felt quite comfortable in the world of CRUD but knew that if I wanted to elevate as a software engineer, I couldn't ignore the tremendous progress and change that is happening when it comes to dealing with unbounded and plentiful data.

At my company, for the infrastructure engineer position, we have a challenge that seemed suitable for me to start from. The problem is as following:

Suppose you have a continous stream of heart rate events {"user_id":12345, "heart_rate":200} and blood pressure events {"user_id":12345,"systolic":120,"diastolic":80}. If for any user, the heart rate is above 100 and the systolic blood pressure is below 100, trigger an alert.

The prompt seemed simple enough, but as we will see, the world of streaming requires us to think very diffirently and there are gotchas and edge cases to be had. I also noticed that a lot of content out there mostly focus on getting a basic word count out there, so hopefully this will be help to dive into a more complex (yet still totally unrealistic) problem with joining and windowing.

One of the very first things I ran into as a streaming-novice is the need to window your streams. I.e. when and for how long are biometrics events valid for? When are two events valid togehter? How do I express that in my stream processing framework? The challenge actually refers to this problem:

What would change about your implementation to support time-windowing operations, e.g. low blood pressure and high heart rate within a 60 minute interval?

You might be tempted to think that you could just create fixed 60 minute intervals, consider events in that interval and trigger an alert but what then happens when two matching events happen right around the window boundary? Thus, you'll need a different approach, and as we'll see, it depends on the features of the streaming framework.

Biometric Alert Streaming with Spark

The first framework I looked at was Apache Spark. While Spark gained Structured Streaming in 2.0, it is still evolving, so I decided to use the traditional Spark Streaming feature. Spark has been one of the greater successes in the big data landscape in recent years due to its great performance while handling large amounts of data, high-level API in Scala, Java and Python, similar programming model for streaming and batch and widespread industry support. That means that there is a good amount of documentation out there.

To get started, we have the most basic spark setup

In reality, you'd use a real input and a real cluster. For the purpose of this excercise, I just create a manual DStream. The next thing we'll want to do is to parse the data into case classes:

Now to the interesting piece: the join. Spark streaming supports quite nice ergononimcs when it comes to joining (and consequently windowing the stream). Let's take a look

First, we transform both stream of events into streams of lists of events. That will be helpful for the fullOuterJoin, as it combines two streams of elements into one stream of Option pairs. In the following map I "un-option" the individual pairs. Finally, in the .reduceByKeyAndWindow I window the stream so that in each window of 60 minutes, I have all events that apply for a given user.

From here, we can simply filter that combined stream for the conditions of our alert:

If you print out this stream, you will note that it will print the "same" alert every minute. That is because the combinedStream window is sliding every minute, thus for the following minutes after an alert was detected, the window will still contain event combinations that should trigger. If we didn't have sliding windows, but hopping ones, we'd potentially miss out of valid event combinations across the window boundary. Thus the only way to ensure proper alerting is to have minimally sliding windows. Take the following "visualization":

If you print alertStream, you'll get an output like

If we only want to "trigger" once, we'll need to maintain state, track alerts and have a cooldown on them. We can achieve such by using updateStateByKey:

This way, we'll only ever get one alert that "lasts" for 60 minutes.

Biometric Alert Streaming with Akka Streams

The next framework we'll be looking at is Akka Streams. Akka enjoys high regard in the JVM ecosystem due to its solid foundations of providing concurrency, especially with the actor model it borrowed from Erlang. A lot of other projects have been using Akka under the hood. In fact, Spark in the past has been built upon Akka. Akka expanded beyond providing simple concurrency primitives with higher level abstractions such as Akka Streams. I have also been meaning to check out their event sourcing library "Persistence Query", but that shall be the topic of another post.

You should soon notice that Akka Streams is very different from Spark Streaming. It does not support clustering and has not native support for windowing and joins. So well have to do it ourselves.

To get started:

We have to use peekMatValue per akka/akka#17769 to be able to add elements dynamically to the sources via the futures. Next, we'll want to to join the two sources. To do that, we'll be creating list of events again, and then merge them.

Similarly like joining, windowing isn't natively supported. I'll be mostly following the excellent guide from SoftwareMill:

Compared to the Spark example, this obviously is a lot more code to window the stream and group by user id. But it does give a nice insight as to how one might accomplish windowing and at the end, we get alerts for the right windows and users. The last thing we'll need to do is to "rate-limit" the stream to only trigger once per user for a certain time.

Once again, very similar to Spark's updateStateByKey, but a little more verbose. To the point that I'm wondering what I'm missing because this seems very easily generlizable functionality.

Overall, akka has very nice documentation and a lot of other functionality which make it a great choice to use it as the backbone of your system. However, if your focus on streaming, it certainly seems that other competitors have a little more functionality, although I really like their Graph DSL and stages abstraction.

Biometric Alert Streaming with Kafka Streams

Kafka Streams has been getting a lot of attention since its announcement since May 2016. It has the promise of simplification by avoiding the need for having a streaming-specific cluster a la Spark or Flink. It is built right into your Kafka cluster by using Kafka topics as its distribution model, making Kafka Streams a "mere library". Let's take a look at how we would implement the biometric alerting example. I omitted some boilerplate/setup code in the following snippets. Also note that there is some difficulties around using Kafka Streams as it does not have a native Scala API yet.

Here we just parse the raw strings into case classes and make them Options. That will be important for the following join, since Kafka's outer join will possible leave null as the values. (Kafka Join Semantics for more info). Other than the null, the ergonimics of the joins in Kafka Streams are quite nice.

Another complexity addition is having to think about Serdes. Since Kafka Streams uses Kafka as the distribution model, you haev to think about how to de-/serialize state to Kafka. I haven't much experience with serialization in the jvm/apache ecosystem so I just created a generic JsonSerde, but Avro seems preferential.

After the join, we can operate on the streams like any other stream processing framework:

While Spark has updateStateByKey and Akka has statefulMapConcat, Akka's stateful aggregation is somewhat more limited, so if we want to have the "rate-limiting" functionality, we'll have to drop down to the Processor API:

While it is clear that Kafka Streams is quite young, I think it has great a great foundation, and with recent additions like queryable state stores and Kafka's general versatility, I am quite excited to see where Kafka Streams will go. Articles like this especially resonate with me when it comes to streaming being a potential game changer to general application development, and Kafka seems well-poised to be a major player.

Biometric Alert Streaming with Apache Beam

Apache Beam is a project coming from Google and their work on their unified batch/stream processing service Dataflow. While it is clear that support for Dataflow is a priority for Google, Beam is meant to be a a framework to describe your processing work which you can then run on different backends, or what they call runners. Depending on the runner and their semantics, not all features of Beam might be supported. Checkout their capability matrix.

After the basic setup, we can create our sources:

PCollection is Beam unifying abstraction that represents a collection of events. If you are in batch mode, you have a bounded collection and can use a single global window, but for streaming use cases, you will want to apply a window. Beam has a very interesting concept of windows thatI don't get into here, but would allow us to account for "late" data (when a users phone is offline etc.). Joining seems quite awkward with the use of CoGroupByKey and TupleTag which we can then filter by the following:

Like in the above cases, we'll need a way to effectively throttle the alerts since we trigger them based upon sliding windows. Since I had to resort to stateful stream processing in the other frameworks, I thought I had to do same with Beam. I waited until the Feb 2017 announcement of stateful stream processing in the Beam model to do this example, but then realized that it seems Beam's statesful processing is currently restricted to processing within a window and key. Similar as side inputs. The data driven trigger proposal might fix that in the future.

However, what I effectively want is to group close Alerts together as they "belong" together, and SessionWindow accomplishes just that.

We effectively reassign the alerts to a session window and then do a leading debounce manually.

Overall, Beam was a pleasure to work with. The abstractions all make a lot of sense to me. Beam is quite young however, so documentation and general content/examples is very hard to find.