Building a Real-Time Bike-Share Data Pipeline with StreamSets, Kafka and MapD
In this post, we will use the Ford GoBike Real-Time System, StreamSets Data Collector, Apache Kafka and MapD to create a real-time data pipeline of bike availability in the Ford GoBike bikeshare ecosystem. We’ll walk through the architecture and configuration that enables this data pipeline and share a simple auto-updating dashboard within MapD Immerse.
High-Level Architecture
The high-level architecture consists of two data pipelines; one pipeline polls the GoBike system and publishes that data to Kafka. The other pipeline consumes from Kafka using Streamsets, transforms the data, then writes the data to MapD:
In the remainder of this post, I’ll outline each of the components of the pipelines and their configurations.
GBFS
GBFS is data feed specification for bike share programs. It provides several JSON schemas so that the same data from every bike share can be open to app developers to incorporate into their systems or for researchers. The Ford GoBike system uses GBFS so I refer to data from Ford as GBFS for simplicity for the remainder of this article.
From GBFS, there are several feeds available and the details about each can be found here. For our use case, we're interested in each station’s status and also each station’s locations for future visualization. For a sample of the feeds, you can check out the documentation.
For this project, we’ll poll the GBFS for the station status. Station locations change infrequently, so a daily or weekly cron pull of the station locations should be sufficient to ensure our tables have the most correct information.
StreamSets
The majority of the heavy lifting for this system is managed using StreamSets. If you're unfamiliar with StreamSets, their website and documentation is top notch. At a high level, StreamSets is a plug-and-play stream-processing framework. I like to think of it as Spark Streaming with a UI on top of it. It provides a drag-and-drop interface for the source-processor-sink streaming model.
HTTP Client
To poll GBFS, I created an HTTP client in StreamSets with the following configurations:
Each GET request will return all stations for Ford GoBike and a status as outlined in the GBFS specification. The more desirable state would if every station_id JSON object were its own message. Fortunately, StreamSets has a pivot processor which will take a JSON Array like this:
and flatten it so that each object is its own message:
The configuration looks like this:
Kafka Producer
With the data polled from the GBFS API and flattened, it’s time to write it to Kafka. Connecting to Kafka with Streamsets is quite simple. A default installation of StreamSets come with the Kafka package and it has all the facilities for producing and consuming Kafka messages.
If you have a distributed deployment of Kafka (which you should for production), you have to ensure all of the Kafka node URIs are inputted:
Kafka Consumer
With a working producing pipeline we’re publishing about 500 messages every 2 seconds to Kafka. We’d like to consume that data and write it to MapD. StreamSets Kafka Consumer is easy to configure:
For the consumer, you need both the Broker URIs and the Zookeeper URIs. On initial testing, I ran into failures downstream when using the JDBC connector. With the JSON payload, some of the fields are strings that I outlined as Integers in my MapD table schema.
Using Streamsets Field Converter, I isolated these fields and converted them to the appropriate data type.
MapD via JDBC
Now that the rest of the pipeline is ready, we need to sink to our final destination, the MapD database.
Step 1 is getting the JDBC driver for MapD installed on StreamSets. You can download the latest driver here, and install the driver on the packages page in StreamSets:
Step 2 is to configure the JDBC connection in StreamSets. Connecting to MapD via JDBC is outlined in their documentation. I used the following JSON configuration to set up each field in the MapD table and how they made to the JSON payload coming from Kafka:
With these configurations, our data pipeline is complete.
Conclusion & Dashboard
With the 10-second polling interval on GBFS and splitting each GBFS record, we can write ~1000 m/s to MapD. We could poll more often to increase the throughput, but this should be adequate for our needs. Using MapD Immerse, we can provide a real-time dashboard to monitor our record counts in real-time:
In a future post, we’ll use this same system to write more feeds that follow the GBFS to test the scalability of streaming lots of data into MapD.
Notes
Thanks to Randy Zwitch for editing previous drafts of this post and his contributions to building the pipeline and provisioning the vms.