Sunday 7 May 2017

Zeppelin + Scala: Consuming a HTTP endpoint

Summary

In our last post, we discussed on how could we execute Spark jobs in Zeppelin and then create nice SQL queries and graphs using the embedded SQLContext (provided along with the usual Spark context).

Today, we will see a way to populate your Spark RDDs/Data-Frames with data retrieved from a HTTP endpoint/REST service. We will focus on the parsing of the resulting JSON response. Additionally, we will also learn how to import additional libraries to Zeppelin.


Import additional libraries to Zeppelin: ScalaJ-HTTP

One of the main objectives, while writing software, is to avoid re-inventing the wheel. This means that many problems have already been solved and there is no point in coding them again.
In today´s example, we will query a HTTP endpoint using Scala and, instead of doing our own implementation, we will use ScalaJ-HTTP.

This GitHub project and Scala greatly simplifies performing HTTP requests, while using the most common verbs (GET/PUT), header payloads, content types and security features. Take a look to this GET request with parameters:


Now, I did not find any compiled binary for this library, so I had to clone the repository and perform a SBT package (you will need Git and SBT of course). Moreover, I needed to adjust the Scala versio to 2.11.8. You can easily do this by adjusting this line in the build.sbt file:

scalaVersion := "2.11.8"

Then, go back to Zeppelin and add a first paragraph to your notebook importing the new library:


Execute the paragraph and you are ready to use the library. Note that, if you have any other previous paragraph using any interpreter from the Spark group, you should manually restart it (the interpreter).

Executing the HTTP request and parsing the response

For this example we are going to use a test service called RestTestTest that returns this JSON content:


Note: Actually the service is just echoing the JSON we send to it in the "json" placeholder. This will be the data that we will extract in the example.
Note (2): For the sake of brevity, only some lines of the JSON file are shown above. I have uploaded the whole file to this GitHub gist.

As you saw in the code above, executing the request is totally straightforward. What is a bit trickier is to parse the response. However, with the help of this great StackOverflow response and Scala For-Loop comprehension, it is relatively easy to navigate through the JSON structure and generate a table-like structure that will feed our RDD later on.


As Zeppelin shares the sqlContext along all paragraphs, it's very easy to reuse it using the Spark SQL interpreter. First, we perform the query, labeling the Dataframe columns with some human friendly aliases:

Raw view of the query results. Only the first records are displayed

Then, we switch to the graphical mode, selecting the X and Y axis along with the grouping values (the series):

Graphical view of the same dataset, grouped by series

And we are done! I personally find this very useful to match data coming from different sources (for instance a REST service, a Hive database, an HDFS file,  a regular PostgreSQL database, a local text file,...). Once you have parsed and loaded them in RDDs or DataFrames, you can join them and extract the information you are looking for.

Resources

No comments:

Post a Comment