NTT DATA: Operating Apache Spark clusters at thousands-core scale and use cases for Telco and IoT
This is a guest blog from our one of our partners: NTT DATA Corporation
About NTT DATA Corporation
NTT DATA Corporation is a Japanese IT solution provider and the global IT services arm of NTT (Nippon Telegraph and Telephone Corporation), which ranks among the top 10 telecommunication companies in the world by revenue.
At NTT DATA, the OSS (Open Source Software) Professional Services Team has the responsibility of providing our customers with consulting, designing, integrating, and supporting services for various OSS products including Apache Hadoop. For these 7+ years we've been integrating dozens of Hadoop systems, which include 1000+ nodes cluster for production use of our customer.
Recently, Apache Spark has become a core component of our development platform and been included in the support service we provide.
Why Spark
There are several reasons we make use of Spark. The first is that Spark can be effectively integrated with our existing Hadoop ecosystem, such as HDFS and YARN. The companies who are well-experienced with Hadoop are able to use Spark on the same Hadoop cluster. The second is that Spark has features useful for data analysts, which traditional Hadoop doesn’t have, such as interactive shell for on demand analysis and high level APIs for complex data analysis.
In order to validate that Spark has a good balance of high throughput and low latency, we ran some tests on a large cluster and found that Spark could scale to processing tens of TBs of data without unpredictable decrease of performance or stoppages. The detailed result of this validation is shown in the footnoted URL.
Our Spark use cases
The followings are some examples of our Spark use cases in production:
Use case 1: The first case is for the analysis of system infrastructures of a Telecom company.
5 years ago, we implemented an on-premises Hadoop cluster consisting of 1000 nodes with NTT DOCOMO, the leading mobile carrier company in Japan. Our emphasis was on achieving high fault tolerance and scalability while computing vast amount of system operation data in the mobile carrier. We were able to achieve our goals without any data loss for 4+ years nevertheless it was difficult to operate Hadoop at this scale at that time.
However, we still needed more speed and flexibility for present day requirements. In other words, the demand for a newer parallel distributed processing framework, based on the computational model other than MapReduce was steadily increasing. To satisfy this condition, we have launched a Spark on YARN cluster with 4000+ cores, evaluated Spark's features and successfully migrated some data processing models to Spark environment.
Use case 2: The second case is for the numerical analysis of massive IoT data gathered from machineries and public infrastructures.
We are supporting this customer to establish the platform for data analysis, using a statistical approach. One important requirement of this project is to iterate trial-and-error workflows rapidly in order to find algorithms optimized for dynamically changing situations. In general, Spark excels in this type of use case.
In this project, Spark applications are combined with Hadoop HDFS, the stable data storage, and YARN, the resource management service for distributed computing. One important advantage of using YARN is that we can use multiple versions of Spark on the same cluster at the same time. For example, we can try an application built for the latest version of Spark, which has many useful new features, while another application built for older version of Spark is already running on the cluster.
Our future with Spark
NTT DATA has become the No 1 Spark contributing company in Japan. Based on the above mentioned production cases and our long experiences as a Hadoop integrator, we regularly provide input and feedback to the Spark development community. Our main focus of contribution is to improve usability. For example, we are developing a Web-based debugging tool "Timeline Viewer", which has been contributed to the community. We can easily understand when/where tasks are assigned in chronological order and how long it took with Timeline Viewer.
Conclusion
We found that Spark has flexible features and good performance for production use from our deployments. When technical issues arise, NTT DATA has been able to resolve them quickly and contribute the solutions back to the Spark community. Based on our experiences for distributed computing and open source software, NTT DATA has begun to actively deploy Spark for our customers.
By taking advantage of open source software, we provide capability to establish variety of data processing systems, such as strict data management, batch processing, data analysis, stream processing, visualization, etc. We believe that using open source software actively triggers “the open innovation” and Spark is one core component for this concept.
Additional Resources
For more information of our validation of Spark, please refer to Masaru Dobashi's recent presentation video and slides at the Spark Summit 2014.
And also please refer to Satoshi Tanaka's (from NTT DOCOMO) slides in Japanese at Hadoop Conference Japan 2014.
Timeline Viewer is proposed in this ticket.