Despite having an SEO hostile name, h2o.ai is a pretty cool company. They have developed a great open source plug-and-play data science platform in h2o. They some other projects that are noteworthy and of course Sparkling Water, the subject of this post. Sparkling Water is essentially the h2o APIs on top of Spark, allowing the power of h20 to take advantage of Sparks distributed computing model. That being said, is it worth it to load another dependency when Sparks MLLib is adequate for most machine learning needs? I went through this exercise a few weeks ago and this post is mostly my notes with some added illustration and some code.
Read MoreAlluxio is an open source project aimed at solving caching for analytical applications. If that doesn’t mean anything to you, then this may be the wrong post for you. Alluxio provides a way to reduce the cost of data querying (I’ll explain this later), without adding complexity of added databases or long term storage solitons.
Read MoreI’ve been trying to get more involved in DevOps related work as I’ve moved further down the tech stack in data engineering. I try to learn new tech during free time and look for the best courses that fit my requirements of being concise, informative and as hands on as possible.
Read MoreI can spend a lot of time gushing about Couchbase and the details about its architecture and implementation. I've grown to really love Couchbase as a NoSQL store but my love for it isn't really a good reason to write a blog post.
Read MoreIn Spark SQL, there are many API's that allow us to aggregate data but not all of the built it methods are adequate for our needs. Fortunately, in these cases we can define our own aggregation functions called User-defined aggregate functions.
Read More