Redshift to MySQL and Back Again with Apache Spark
Moving data out of Redshift can be a tricky task. Natively, Redshift only supports unloading data in batch to S3 and RDS. In the event you have a database not hosted on Amazon, getting data transfered quickly, safely and easily is non trivial. Fortunately, Apache Spark is the glue between Redshift and other databases. Recently, I’ve used Apache Spark to get data from Redshift to MySQL and I share how to do that In this blog.
Spark-Redshift
The kind folks at Databricks provided a driver for communicating with Redshift from Spark. There are a handful of peculiarities about using it that i’ll go over. You can read about all the details in the repo, I’ll give a high level view.
Under the hood, Spark-Redshift uses Amazon S3 for temporary storage. It basically performs an Unload into S3 and then a Copy back into Spark. In order to use it, you need to have Amazon S3 account with sufficient storage for a short term dump of compressed S3 data. Finally, you’ll need your S3 instance to be in the same region as your Redshift instance.
Putting it all together
I used SBT for building dependencies on this project, here are my configurations:
Here is the code: