Blog — Jowanza Joseph

Why h2o Sparkling Water?

Despite having an SEO hostile name, h2o.ai is a pretty cool company. They have developed a great open source plug-and-play data science platform in h2o. They some other projects that are noteworthy and of course Sparkling Water, the subject of this post. Sparkling Water is essentially the h2o APIs on top of Spark, allowing the power of h20 to take advantage of Sparks distributed computing model. That being said, is it worth it to load another dependency when Sparks MLLib is adequate for most machine learning needs? I went through this exercise a few weeks ago and this post is mostly my notes with some added illustration and some code.

Apache SparkJowanza JosephJanuary 17, 2017Apache Spark

What is Alluxio and will it help my Spark Jobs?

Alluxio is an open source project aimed at solving caching for analytical applications. If that doesn’t mean anything to you, then this may be the wrong post for you. Alluxio provides a way to reduce the cost of data querying (I’ll explain this later), without adding complexity of added databases or long term storage solitons.

Apache SparkJowanza JosephJanuary 8, 2017Apache Spark, Alluxio

edX Introduction to OpenStack Course Review

I’ve been trying to get more involved in DevOps related work as I’ve moved further down the tech stack in data engineering. I try to learn new tech during free time and look for the best courses that fit my requirements of being concise, informative and as hands on as possible.

Apache SparkJowanza JosephJanuary 4, 2017Openstack, Reviews

The How and Why of Spark and Couchbase

I can spend a lot of time gushing about Couchbase and the details about its architecture and implementation. I've grown to really love Couchbase as a NoSQL store but my love for it isn't really a good reason to write a blog post.

Apache SparkJowanza JosephJanuary 1, 2017Scala, Apache Spark

A Gentle Intro to UDAFs In Apache Spark

In Spark SQL, there are many API's that allow us to aggregate data but not all of the built it methods are adequate for our needs. Fortunately, in these cases we can define our own aggregation functions called User-defined aggregate functions.

Apache SparkJowanza JosephDecember 19, 2016Scala, Spark