This is a guest blog from our friends at Metacog.
Luis Caro is the Lead Cloud and DevOps Architect at Metacog, where he is responsible for the security and scalability of the entire platform.
Doug Stein is the CTO of Metacog, where he is responsible for product strategy and development; he doubles as the product owner and voice of the market.
At Metacog, we have been using Databricks as our development and production environment for over one year. During this time we built a robust continuous integration (CI) system with Databricks, which allows us to release product improvements significantly faster. In this blog, we will describe how we’ve built the CI system with Databricks, GitHub, Jenkins, and AWS.
Metacog allows companies to replace simplistic assessment (e.g. multiple-choice) with authentic performance tasks (scored by machine learning algorithms informed by customer-supplied scoring rubrics and training sets). We do this by offering a learning analytics API-as-a-service, which allows us to translate a person’s interactions (i.e., real-time events) with an online manipulative into an accurate assessment of their understanding. The product gives our customers an API and JSON wire format for large-scale ingestion, analysis, and reporting on “Metacognitive activity streams” (how the learner tackles an open-ended performance task - not merely the final answer). The Metacog platform is applicable to learning (instruction, training, or assessment) in K-12, postsecondary, corporate, military, etc.
Metacog supports tens of millions of concurrent learners (each of whom might be generating activity at a rate up to tens to a few hundred KB/sec). This is Big Data - and the platform needs to be able to ingest the data without loss and apply various machine learning algorithms with optimal performance, reliability and accuracy. To this end, Metacog implemented Apache Spark with Databricks as the primary compute environment in which to develop and run analysis and scoring pipelines.
The Metacog development team consists of backend developers, devops and data scientists who constantly introduce improvements to the platform code, infrastructure and machine learning functionality. In order to make this “research-to-development-to-production” pipeline a truly streamlined and Agile Process, Metacog deployed a continuous integration production system for all Spark code.
The Metacog development pipeline ensures that both hardcore developers and data scientists are able to:
The development pipeline is illustrated in the following figure:
The pipeline works as follows: A developer syncs his/her local development environment (which can be either a notebook or an IDE) using GitHub. Whenever the developers commit on a specific branch, the Metacog Jenkins server automatically tests the new code; if the tests pass the code is then built and deployed to the live testing infrastructure. This infrastructure consists of an exact replica of production resources to permit load testing and final checks before deploying changes to production. In the case of Spark code, live testing employs multiple Spark clusters created using the Databricks Jobs functionality.
The main components in the pipeline are:
In the sections below we will describe in detail how we used Databricks APIs to automate two key capabilities in the pipeline: Deploying built JARs as libraries (Components #1 and #3) and updating stage and production resources with latest builds (Components #5).
Both developers and data scientists need to be able to use any methods or classes from the production code inside their notebooks. Having these libraries allows them to have access to production data and test or evaluate performance of different versions of production code. Using the library API, a Jenkins server can build and deploy JAR files as libraries into a special folder on the Databricks workspace. When Databricks loads such libraries it recognizes versioning - so that developers can use a compatible version of the library for the code they’re developing. This allows developers to control when they adopt new library versions and gives them a stable environment for benchmarking and regression testing. Every time someone commits code to the stage repo, Jenkins follows the following steps:
def createlibrary(path,jarurl):
head = {'Authorization':DBAUTH}
endpoint = DBBASEENDPOINT + 'libraries/create'
newpath = "/" + path
data = {"path": newpath, "jar_specification": {"uri":jarurl}}
r = requests.Session().post(endpoint,headers=head,data=json.dumps(data))
if r.status_code == 200:
print "library uploaded for notebooks"
else:
print r.content
After the process is complete developers can attach these libraries to any notebook cluster or job using Databricks web console. Here is an example of the Jenkins build output:
Here is an example of the Databricks workspace after job is updated (note the newly-built V376 JAR at the end of the listing):
Metacog uses the Jobs API to deploy and manage production and stage Spark clusters. In addition to the steps described above, when a new version of the library gets built, Jenkins must update all jobs to make sure clusters use the new build JAR and the correct Spark version for that jar. This functionality is achieved by using the build file stored on S3 together with a python script that updates the jobs using the Databricks Jobs API.
This is done in three steps (illustrated in the figure above):
Thanks to the Databricks environment and APIs we succeeded in implementing a continuous delivery pipeline all of our Spark code. Some of the main benefits that the Metacog team now enjoys are: