Apache Spark Improves the Economics of Video Distribution at NBC Universal

Published: September 24, 2014

This is a guest blog post from our friends at NBC Universal outlining their Apache Spark use case.

Business Challenge

NBC Universal is one of the world’s largest media and entertainment companies with revenues of US$ 26 billion. It operates television networks, cable channels, motion picture and television production companies as well as branded theme parks worldwide. Popular brands include NBC, Universal Pictures, Universal Parks & Resorts, Telemundo, E!, Bravo and MSNBC.

Digital video media clips for NBC Universal’s cable TV programs and commercials are produced and broadcast from its Los Angeles office to cable TV channels in Asia Pacific, Europe, Latin America and the United States. Moreover, viewers increasingly consume NBC Universal’s vast content library online and on-demand.

Therefore, NBC Universal’s IT Infrastructure team needs to make decisions on how best to serve that content, which involves a trade-off between storage and bandwidth cost versus consumer convenience. NBC Universal can keep all content available online and cached at the edge of the network to minimize latency. This way, all of the content could be delivered instantly to consumers across all countries where NBC Universal has a presence. But this would also be the most costly option.

Therefore, the business challenge is to determine the optimal mix between storing the most popular content locally close to its viewers, and serving less popular content only on demand, which incurs higher network costs, or taking it offline altogether.

Solution

NBC Universal turned to Spark to analyze all the content meta-data for its international content distribution. Metadata associated with the media clips is stored in an Oracle database and in broadcast automation playlists. Apache Spark is used to query the Oracle database and distribute the metadata from the broadcast automation playlists into multiple large in-memory resilient distributed datasets (RDDs). One RDD stores Scala objects containing media IDs, time codes, schedule dates and times, channels for airing etc. It then creates multiple RDDs containing broadcast frequency counts by week, month, and year and uses Spark’s map/reduceByKey to generate the counts. The resulting data is bulk loaded into HBase where it is queried from a Java/Spring web application. The application converts the queried results into graphs illustrating media broadcast frequency counts by week, month, and year on an aggregate and a per channel basis.

A secondary procedure then queries filepath information from Oracle and builds another RDD that contains that information along with the date the file was written. It then computes a Spark join of that RDD and the previous frequency count data RDD to produce a new RDD that contains the usage frequency based on the file age sorted in ascending order. This resulting data is used to generate histograms to help determine if the offlining mechanism is working optimally.

NBC Universal runs Apache Spark in production in conjunction with Mesos, HBase and HDFS and uses Scala as the programming language. The rollout in production happened in Q1 2014 and was smooth.

Spark provides both extremely fast processing times, leveraging its distributed in-memory approach, as well as much better IT staff productivity. In my Spark Summit 2014 talk, I highlighted two aspects of our Spark deployment:

Developer productivity: The combination of Spark and Scala provides an “ideal programming environment”
Operational stability: Mesos for cluster management.

Alternative approaches not based on Spark would have required a much more complicated data processing pipeline and workflow. For example, since the main memory usage requires more than what is available on a single server, a much slower procedure would have to be used that would have involved processing the data in chunks, and writing intermediate results to HDFS and then loading the chunks of data back into main memory and processing it in a manner similar to traditional Hadoop map/reduce jobs which has been shown to be much slower than Spark, through various examples on the Apache Spark website.

Value Realized

Spark provides a fast and easy way to assemble a data pipeline and conduct analyses that drive decisions on which content to keep online versus take off-line. Moreover, infrastructure administrators gain valuable insights into network utilization. They can detect patterns that help them understand wastage of bandwidth in the multi-system operator (MSO) network. The initial results are promising, which prompted NBC to expand its use of Spark to machine learning.