Blog — Jowanza Joseph

Creating A Spark Server For Every Job With Livy

Livy provides an interesting way to use Spark as a RESTful service. In my opinion, this is not an ideal way to interact with Spark, however. There is just a tad too much overhead of language interoperability to make it worth it. For starters, sending strings of Scala code over the wire doesn’t inspire a lot of confidence.

Apache Spark, Data EngineeringJowanza JosephApril 19, 2017Scala, Apache Spark

A Gentle Intro To Graph Analytics With GraphFrames

GraphFrames allow us to do exactly this. It’s an API for doing Graph Analytics on Spark DataFrames. This way, we can try to recreate SQL queries in Graphs and have a better grasp of the graph concepts. Not having to load the data and create the relationships makes a lot of difference in a pedagogical context (At least I’ve found).

Apache Spark, Data EngineeringJowanza JosephApril 2, 2017Scala, Apache Spark, Graph Analytics, GraphX

Which Hadoop File Format Should I Use?

The past few weeks I’ve been testing Amazon Athena as an alternative to standing up Hadoop and Spark for ad hoc analytical queries. During that research, I’ve been looking closely at file formats for the style of data stored in S3 for Athena. I have typically been happy with Apache Parquet as my go-to, because of it’s popularity and guarantees, but some research pointed me to Apache ORC and it’s advantages in this context.

Apache Spark, Data EngineeringJowanza JosephMarch 23, 2017Apache Spark, Scala, Amazon Athena, Amazon S3

Compact and Quick In Memory Text Search With Succinct

In a recent project, I wanted to do text searches over a large unstructured dataset (100 GB) in memory and I was able to do it in Spark once I provisioned a machine with enough memory. I was able to do it quickly and efficiently, but I was bugged that I couldn't compress the data and had to spin up a master with that much memory.

Apache Spark, Data EngineeringJowanza JosephMarch 9, 2017Apache Spark, Scala

Time-Series Missing Data Imputation In Apache Spark

n a recent project, I needed to do some time-based imputation across a large set of data. I tried to implement my own solution with moderate success before scouring the internet for a solution. After an hour or so, I came across this article about the Spark-TS package.

Apache Spark, Data EngineeringJowanza JosephDecember 5, 2016Scala, Apache Spark