Akshay Rai is an engineer at Linkedin working with the Grid team. He is the lead engineer for the popular Dr. Elephant project, open sourced by Linkedin. He has been working on operational intelligence solutions for Hadoop and Spark and trying to improve the developer productivity by building systems that enable monitoring, visualizing and debugging of Big Data applications, Hadoop clusters and other related tools in real time.
At Linkedin, we have thousands of Hadoop and Spark users ranging from amateurs to experts who run a variety of jobs on our huge 2000-plus node clusters. In just a few years, the number of Hadoop and Spark jobs have grown from hundreds to thousands. With this ever increasing number of users and jobs, it becomes very crucial to have an efficient way to find answers to frequently asked questions like: 1) Why is my job running slow? 2) Why did my job get killed? 3) Can you send me an alert when my job is about to fail or miss SLA? 4) Do we have enough resources on the Hadoop cluster? Having this information available with us will help in quicker debugging, alert based on anomalies, perform root cause analysis(RCA), identify workload patterns and perform capacity planning. To address this problem, we at Linkedin have built a Unified Grid Metrics Platform that captures and stores, current and historical job metrics. In our experience debugging and tuning jobs and interacting with our users, we have learnt a lot of lessons and have been integrating ideas and solutions into this system. For example, we have learned that capturing and storing the complete set of metrics and its history though fascinating is actually rarely useful just like the verbose logs in Spark. We have come up with some derived metrics and curated list of metrics which we track very closely at LinkedIn. In this talk, we will discuss the architecture of how we built this platform for both Hadoop and Spark along with the huge challenges in collecting all the standard, derived and custom user metrics in real-time. We will see how this system allows users to build reporting dashboards, perform trend analysis, dimension analysis and view correlated metrics together. Session hashtag: #Exp2SAIS
Is your job running slower than usual? Do you want to make sense from the thousands of Hadoop & Spark metrics? Do you want to monitor the performance of your flow, get alerts and auto tune them? These are the common questions every Hadoop user asks but there is not a single solution that addresses it. We at Linkedin faced lots of such issues and have built a simple self serve tool for the hadoop users called Dr. Elephant. Dr. Elephant, which is already open sourced, is a performance monitoring and tuning tool for Hadoop and Spark. It tries to improve the developer productivity and cluster efficiency by making it easier to tune jobs. Since its open source, it has been adopted by multiple organizations and followed with a lot of interest in the Hadoop and Spark community. In this talk, we will discuss about Dr. Elephant and outline our efforts to expand the scope of Dr. Elephant to be a comprehensive monitoring, debugging and tuning tool for Hadoop and Spark applications. We will talk about how Dr. Elephant performs exception analysis, give clear and specific suggestions on tuning, tracking metrics and monitoring their historical trends. Open Source: https://github.com/linkedin/dr-elephant Session hashtag: #EUdev9