Yana Ponomarova

Data Scientist, Capgemini

Yana Ponomarova is a Data Scientist at Capgemini. She’s passionate about data products and data-driven business models. She has a background in Business Research; she developed event models in Finance and behavioral choice models in Marketing. Over the last two years, she worked on Natural Language Processing for relationship extraction and various Machine Learning applications. She also has been working with Big Data technologies (Hadoop, Spark) for batch and real-time treatments to develop scalable analytical pipelines. Such scalable implementations are at the core of new products development and business transformations with predictive analytics that constitute Yana’s everyday activities.


Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spark

About 80% of the information created and used by an enterprise is unstructured data located in content. This figure is growing at twice the rate of structured data. Therefore, mastering and using the knowledge scattered around the abundance of the unstructured documents in an organization can bring about a lot of value. In the context of our client, a global Oil & Gas company, the valuable information was scattered within large volumes of the engineering reports. Those reports have been written by engineers, in a free and unconstrained format, often times by non-native English speakers, and focusing on the technical characteristics of Oil & Gas operations. The primary challenge for the client was to extract the supply chain relationships (supplier, receiver, object of delivery and transport) from those reports in order to evaluate the interdependency between its sites around the Globe and better manage the operational risks. It was obvious, that due to the sheer volume and complexity of these documents, the problem could not have been successfully tackled by company's analysts. Hence, we have developed an automated solution based on Spark integration of Stanford NLP that processes the semantic structure of the sentences, retrieves pieces of supply chain information, matches those to the pieces of the supply chain coming from other sentences in other reports and, finally, presents it to the final user in a form of a graph. The benefits of Spark implementation allowed to treat entire collection of the reports in memory, easily integrate external Stanford NLP libraries.