Creating A Spark Server For Every Job With Livy

One of the frustrations that most people who are new to Spark have, is how exactly to run Spark. Before running your first Spark job you’re likely to hear about YARN or Mesos and it might seem like running a Spark job is a world unto it self. This barrier to entry makes it harder for beginners to imagine that's possible with Spark. Livy provides a RESTful interface to Apache Spark and helps obfuscate some of the details of Sparks execution mechanics and lets developers submit programs to a Spark cluster and gets results. This post if a summary of my notes using Livy to send jobs queued from web hooks to a Spark cluster. My aim to to talk about the benefits and drawbacks of using this setup as well as a small tutorial on Livy. If you’re interested in using Livy, the documentation is excellent.

The Basics

When running a Spark Job, you typically submit jobs via a Spark Shell. This can be in Python or Scala, but running a Spark Job looks something like this:

There are some exceptions, notably if you’re working in a Notebook context like Juypter Notebook, Zeppelin or Beaker Notebooks. In these cases, the notebooks are bound to a Spark Shell so you can run jobs dynamically instead of submitting Jar files or Python files.

In either context, you need to have a Spark Context (either create on in the notebook or within the file submitted to the shell) and code is isolated to your environment. This is fine for most workloads and for development, but it limits the kinds of programs you can write in Spark and the amounts of services that can communicate with Spark.

For example, if we built a regression model in Spark and wanted to run live data through it, it’s not immediately obvious how we’d do that, or over what protocol. It all seems too boxed in and tightly coupled with the machine it’s running on. That’s where Livy is helpful. Livy exposes a rest endpoint to Spark, allowing you to submit jobs from anywhere*. How it accomplishes this is a bit tricky and I’ll walk through the mechanics of it.

Mechanics

Spark doesn’t have a RESTful protocol to it’s engine, however with a little work you can create a rest API server that translates Python, Scala or R code to Spark Job lingo and return the results. This is essential with Livy does (forgive the oversimplification). This allows (for example) us to write a DSL that submits Spark Jobs over REST and gets data back (There are other ways to get about this like MLeap that I’ll cover in a future post)

The power of doing this should be immediately obvious, but the drawbacks might be as well. I worked through two examples to explore the API behind Livy and then to try and actually use REST to do something interesting.

A RESTful Endpoint Example

My first example is just an endpoint that squares the integer it receives on a POST request. For example, POST /2 would reply with 2^2 = 4. I chose this strategically because of the complexity in putting one of these endpoints together. My example is in Scala, but you could do the same thing in PySpark or SparkR. Here is the endpoint. I commented int he code about each part and what it’s doing. I find that much easier than posting the code and explaining it after the fact:

Predict My Weight

In order to do something a bit more applicable to an actual workload, I created a silly model. The models predicts what my weight will be one week from today, based on how many calories I ate + how many calories I burned today. It’s wildly inaccurate, but good for the purposes of this blog. I will enter my weight and calories burned in a Google Sheet and I used Microsoft Flow to trigger an HTTP event that fires to my Livy server and calculates my weight.

Here is a rough sketch of what will happen.

This will work a little differently from the example I shared above. Instead of writing a Scala HTTP client, I can just make a post request from the Microsoft Flow HTTP client. I won’t walk through how to do that as the above example already illustrates and the UI is intuitive. Essentially, I’ll add my weight and calories I burned today into a spread sheet, that’ll trigger an event to predict my weight and add it as a new column in a separate spreadsheet. Here is the function:

I entered 199 as current weight and 1500 as calories burned today ( Both fakes numbers) and it predicted my weight would be 188.99 a week from now.

Summary

Livy provides an interesting way to use Spark as a RESTful service. In my opinion, this is not an ideal way to interact with Spark, however. There is just a tad too much overhead of language interoperability to make it worth it. For starters, sending strings of Scala code over the wire doesn’t inspire a lot of confidence. It’s also not immediately clear why executing pre-defined JAR files over rest has. On the positive side, I expected something much slower than what I got out of Livy. For a use case as contrived as the one I made up for this blog it’s pretty solid, but the model in general might be hard to scale and reason about.

Notes

Microsoft Flow is very cool. I know it seems like an IFTTT clone, but with the ability to send HTTP requests and web hooks it’s much more customizable. Also the free tier is much more generous than something like Zapier.
This stuff takes forever to configure and use the first time.
There are some alternative projects that aim to accomplish the same task as Livy, most notably spark-jobserver, which I think is a little bit easier to use but I didn’t find out about until long after I started experimenting with Livy. If anyone would be interested in a tutorial about that feel free to let me know.

Apache Spark, Data EngineeringJowanza JosephApril 19, 2017Scala, Apache Spark

Blog