Lynx Analytics develops a big graph analysis engine on top of Apache Spark. One of our recent developments is a recurrent neural network library that learns from the structure of the graph in order to predict missing features of vertices.
A real-life use case is demographic estimation where the task is to predict the age of different customers of a telco by exploring their connections to other people, the age of those people and other classical features like internet or phone usage patterns.
One of the main challenges we faced was to develop a training process for our purposes. The usual way of training a supervised learning algorithm considers each vertex as an independent prediction problem. But due to the use of connections between the vertices in our algorithm we cannot treat vertices independently. On the other hand, if you consider the whole graph as one problem, then you do not have any separate training data at all. In this talk we will show some tricks that we used in order to perform the prediction and the training process on the same graph.
The other main challenge is to handle graphs so big that they do not fit into the memory of a single machine and perform really resource-intensive computations on them. To tackle this problem it is necessary to store and make computations on the graph distributedly. The difficulty of this is that we cannot just simply cut the graph into smaller pieces since we need to propagate data via the edges for the training process.
In the talk we will show core algorithmic ideas to tackle the above-mentioned problems and present some experimental results.
Hanna Gábor is currently working as a Software Engineer at Lynx Analytics R&D. Besides this she is attending a Master's degree program in Mathematics at the Eötvös Loránd University in Budapest, Hungary.