This is a guest blog from our one of our partners: Lynx Analytics
About Lynx Analytics
Lynx Analytics is a data analytics consultancy firm with a focus on graph analytics and proprietary big graph analytics software development. We augment classical data mining methods with our expertise in graph analytics, and apply these methods against large datasets such as call data records, bank transactions, and cell tower usage.
Applying the graph analytics can reveal unexpected patterns about human behavior, emergent properties of customer interactions, untapped market opportunities, to name just a few examples. Graph analysis is also often the only way to establish relationships between diverse datasets, leading to complex insights that were unattainable by analyzing standalone datasets.
Our clients are large multinational telecommunications and financial corporations. Due to the enormous volume of data we needed a scalable solution to perform exploratory, interactive graph data analysis. The existing solutions did not fulfill the needs of our analysts and clients, so we leveraged the power of Apache Spark to develop the LynxKite graph analytics platform.
Why choose Spark?
The LynxKite graph analytics platform is a web application with a rich, clean UI for exploring and manipulating graphs. One critical requirement is to allow the users to work with extremely large dataset sizes interactively in real-time. After evaluating several distributed computation frameworks we found Apache Spark to best fulfill our needs for low latency, ease of use, and production readiness.
Leveraging the power of Spark, LynxKite takes just a few clicks to bucket the customers by age and gender, and visualize the number of calls within and between the buckets. Within a minute, the overlapping communities can be identified in the graph, and for each customer we can find the average age of the most homogeneous community they belong to.
The Architecture & Benefits to Users
The LynxKite frontend is an AngularJS web application running in the browser. It is served by a Play Framework web server, which also receives the AJAX requests from the frontend. The web server process is also the Apache Spark driver application and is connected to our Apache Spark cluster. When the frontend requests new data, such as an aggregate view of the graph, the computations run on the Apache Spark cluster.
The highlights of this technical solution are:
- Latency is low. Many computations complete in less than a second, which is a dream come true for our users. The more computationally intensive operations can be sped up by increasing the cluster size. When the cluster is hosted by a cloud provider, its size can be easily adjusted to fit the needs of the moment.
- Simple is easy. The clean and flexible Apache Spark Scala API allows us to implement graph analytics methods in a very simple and natural way. To quantify this, we compared our solution to open source implementations (on other frameworks) of common graph algorithms, to see a tenfold advantage of the Spark solution with respect to code complexity.
- Complex is possible. The ease of developing with the Apache Spark Scala API has enabled us to create a great variety of more complex analytic methods. Viral modeling can estimate unobserved properties based on a small proportion of nodes with observed properties, using the link structure between them. Time and space can be visualized to explore the diffusion of product usage or understand geographic data on a map.
- Deployments are smooth. We are deploying LynxKite into the private Hadoop clusters of a number of clients. This is a surprisingly straightforward process thanks to the integration of Apache Spark into the Hadoop ecosystem.
What’s Next
Betting big on a pioneering technology such as Apache Spark, we really relied on its developer community. We found the Spark community to be extremely smart, professional, and responsive to our questions, tickets, and pull requests. We are very thankful and hope to contribute more.
We look forward to our presence at Spark Summit West 2015 (San Francisco, June 15–17) where we will talk about the technical challenges of running Apache Spark in an interactive setting.
Find out more at www.lynxanalytics.com