Jowanza Joseph

View Original

Beginning Apache Flink

I’ve been committed to the Apache Spark ecosystem for as long as I’ve been doing data engineering. I’ve seen it’s adoption, been a fan of using the APIs and have even decided to write a book about Spark. In all of this experience, there has been a nagging sense that I need to try a “true” streaming framework like Twitter Heron, Apache Flink or Apache Apex. I’ve read articles and watched videos about the difference between a true streaming and a micro batch framework and been mostly unconvinced that it would make a difference for me. I started a project where Flink ended up being the go to streaming framework and I learned a great deal about Flink. This post is a combination of some notes and things I noticed while using Flink and hopefully anyone starting Flink can find it helpful.

Getting Started

Like Spark, Flink is fairly overwhelming to get started with. This is mostly because of installations and run-time configurations. After a bunch of searching around and I was able to put together a decent starter SBT config for Flink. I used Intellij to work with Flink because of it’s complex API, the type hinting and other niceties come in pretty handy.

For other ways using Flink (Docker, cluster, Mesos) I found the docs pretty helpful for that. This blog is fairly elementary, so I won’t cover those in this post.

Ergonomics / API

What I noticed early on using Flink is that it really was designed with streaming in mind first. This is evident by different APIs for a Key Stream, versus a DataStream versus a Window Stream. Once you work with streams for a little while, you’ll find yourself wanting an API that is robust and can handle streams in a sensible way. In Spark streaming, I found myself writing a bunch of boilerplate code aimed at accomplishing a lot of what Flink is designed to do out of the box. One of the ways this is most evident is the DataStream and it’s derivatives, I’ll walk through each.

DataStream

A Datastream is a typed stream of data is one-at-a-time. It’s what all of Flinks streaming architecture is built off of. You can use this stream to do whatever you wanted to. For example I could read in a stream and add two to each value:

Or, I could read in a stream and write it back out to the file system.

KeyedDataStream

A KeyedDataStream is a stream grouped by and evaluated by a key. Typically, this is the first step in a longer process of creating a window or other type of stream. The API is pretty straight forward, you just add a .keyBy("Name your key") to a DataStream. In and of itself it's not all that useful. It comes in handy with a Window.

WindowedDataStream

A WindowedDataStream is what you’d expect, it’s a DataStream over some window of time. This is how Spark Streaming works out of the box and how most introductory streams are introduced. Flinks API is very robust allowing you to configure streams based on number of occurances and windows of time. This kind of control makes it easy to work with out of the box versus custom code to do it.

You can also define what are known as TriggerPolicies and EvictionPolicies, in which you can define a specific trigger and end state for that trigger. These require a bit more orchestration but can be effective for a number of streaming situations.

Technically, Spark supports all 3 of these but not with the same ease of use. With the new Structured Streaming API most of these mechanics are there now, but it’s still not as intuitive as Flink in my opinion.

Another thing that is extremely nice about Flink is their dashboard. Spark has a notoriously bad dashboard but it’s not nearly as nice as this:

It’s quite easy to follow the flow of data in Flink and consume and create streaming sources without a ton of effort.

Batch Model

The real strength of Spark is in it’s batch model and in comparison Flink is actually pretty nice to use. Doing most of your batch related transformations is just as nice as it is to do in Spark. Here are a few examples:

Ecosystem

When your ecosystem bar is Spark, you’ll be hard pressed to meet it. Flink supports all of the major streaming technologies like Apache Kafka, AWS Kinesis and Debezium. Being the newer kid on the block, it’s just not as rich as what Spark has to offer. That being said, with full support of the Scala and Java ecosystem, I have yet to find a situation Flink couldn’t handle.

Summary

It’s clear that a good deal of smart people and companies are investing a lot in Apache Flink, but the question remains, should you? A lot is written and said about streaming, but in my experience, the majority of workloads are still batch. Spark is a much more mature ecosystem for batch workflows and a pretty mature one for streaming as well.

I’m a pretty big believer in streams and their potential. I’m not sure that streams are the wave of the future for Dashboards, reporting and applications that are frozen in time, but I think it’s already clear that some of the most impressive and useful apps are built off of real time data. These are the kinds of applications I want to build my career on, so I’ll be using Flink more often. I hope to write some useful beginner friendly tutorials on Flink in the coming months as well!

Notes


See this gallery in the original post