Efficient Stream Processing with Pulsar Functions
Introduction
The topology of streaming systems can range from a simple system with one node for collecting messages and one for processing them, to a complex web of distributed systems connected via RPC calls. The advantages and disadvantages of adopting these technologies are often the topic of conference talks and the topic of many books. This post explores the reasoning and process behind migrating streaming workflows from a highly distributed and complex stream processing architecture to a simplified one based on Apache Pulsar and Pulsar Functions. Readers should come away with an understanding of Pulsar Functions and where it fits in the stream processing landscape.
Background
For many years, using Spark Streaming and AWS Kinesis has been adequate for my needs. I settled on this architecture for stream processing after experimenting with Kafka, Flink, Storm, Heron and other systems. This architecture handled both the simple use cases and more complex event processing based problems. While Kinesis didn't support long term storage of data, the three day retention period set was more than adequate for the system. Over time, some warts began to show in the platform that needed to be addressed. Among these warts were the overhead of running a distributed streaming engine on simple streaming tasks and building a custom system to handle replaying messages on Kinesis. When faced with these realizations it was time to experiment with some other systems again. I gathered a list of active projects and put together a table of streaming systems that could meet the requirements.
Search & Experimental Design
While gathering the requirements for this new system, it became evident that not all stream processing is created equal. Some streaming jobs were simple transformations that put messages back onto a stream and others were complex, memory and CPU intensive processes that were designed to handle the complex event processing use case. These are cases where the data may come out of order over a time window, or where certain events may trigger new stream processing pipelines to spawn. For the latter case tools like Spark, Heron and Flink seemed like a no-brainer, but for the simple case, there was some question about adopting a complex topology with the distributed state to do small computations on streams of data with no care about the order of the data. I decided to narrow down my list and research tools that would enable a simple stream processing topology for these cases.
Outside of some managed offerings, Apache Pulsar (The distributed pub-sub and queuing system) with Pulsar Functions was the simplest topology. Some additional benefits of using pulsar were around the ease of operability with Kubernetes and Pulsars flexibility in how to store data long term with the tiered storage API. Pulsar Functions are lightweight functions that run on the Pulsar nodes. They consume Pulsar topics and execute predefined logic on each message or a batch of pub/sub messages. They are ideologically similar to using AWS Lambda + Kinesis; however, there is a shared resource pool between the functions and the Pulsar Nodes. An additional benefit of this set up would be reduced network latency since the data is streamed and processed on the same hardware. My only hesitancy at the time was surrounding the scalability of Pulsar Functions. In my tests, I proved Pulsar could handle the message volume required, but Pulsar Functions was in beta at the time, and it was unclear how processing data on Pulsar nodes would affect the entire system and if I would have trouble with backpressure, or CPU constraints.
My experiment had the following parameters. With a five node Pulsar Cluster consuming ~1000 messages per second I set up three pipelines (illustrated below). One pipeline simply stripped fields from pub/sub messages atomically, another pipeline changed the value of a message based on the current time, and the third filtered out messaged based on a message field. Each function pushed data into a new Pulsar topic. Throughout a week and a steady stream of work for this cluster, I observed system metrics and cluster behavior to note any anomalies, and test the overall resilience of the system.
Results
During the week, the Pulsar Functions performed brilliantly. Each of the three pipelines kept up with message workload and even when there were some partial failures, the system recovered, and there was no data loss, or any manual intervention needed (my Pulsar nodes ran on Kubernetes). While there were some CPU and memory spikes, they happened when messages exceeded the 1000 messages per second threshold (sometimes there were as many as 5000 messages per second) and would settle down after a period. After this experiment, Pulsar Functions permanently took over these workloads and some other similar styled workloads.
Pulsar + Pulsar Functions helped achieve a much-simplified stream topology when compared to Spark and Kinesis for these style of streaming jobs. On top of a simplified topology, it also provided a reasonable programming model and lower cost for running one system instead of two.
The Future
The three areas I'm interested in exploring next with Pulsar are load balancing with Kubernetes, using external sinks with Pulsar Functions and using external Java libraries with Pulsar functions. Since the Pulsar Nodes are on Kubernetes, we could (in theory) utilize the load balancer to spin up (or down) Pulsar Nodes to respond to demand. In the case where my pipelines were producing five times normal load, adding new members to the Pulsar Nodes and deploying more functions would help. I'm actively working on experimenting with this.
There are many exciting sinks I could use with Pulsar Functions, and while Pulsar I/O handles most of the sinks, writing from Pulsar Functions could be advantageous for some of my pipelines. Finally, the Java ecosystem is incredibly rich, and there are many libraries I can use to do machine learning, cryptography and remote procedural calls among other things. I look forward to exploring this functionality with Pulsar Functions soon.
Resources
Pulsar has incredible documentation, check it out