Elsevier have long been a data-centric company, and leveraging the corpus of data comprising 30% of the world’s scientific research is essential to its continuing success. Disruptive technologies such as machine learning and natural language processing promise to have a radical impact on science, informing research and driving decisions in fields as disparate as medicine and engineering.
Big data processing tools have created opportunities to unlock much of the data organisations have, making it accessible and useful. This process has become easier in recent years, and platforms like Databricks further lower the barrier to entry. Exploratory work on vast data sets is now approachable, and there is a clear route through to full production status. Emlyn will explain how Elsevier’s big data team is providing the tools to facilitate this workflow, and where Spark and Databricks fit into the picture.
Emlyn is a principal developer for Elsevier, the world's biggest scientific, technical and medical publisher. He is the technical lead for Elsevier's big data platform project, which is a critical component of the business's longer-term strategy. This platform democratizes tools for building new products using synthesized data, and allows the application of advanced analytics and machine learning techniques. Significant data sources include scientific papers, the relationships between them, and event based metrics such as readership information. The core technologies in use are Scala on Spark and platform hosting is entirely on AWS. Prior to working for Elsevier, Emlyn has held senior technical positions with Bank of America and IBM.