Daniel Jeavons - Databricks

Daniel Jeavons

david.scholer@databricks.com Spark Summit EU 2017 Schedule Spark Summit EU 2017 Schedule Comments Share FileEditViewInsertFormatDataToolsAdd-onsHelpAll changes saved in Drive 100% $%123 Calibri 11 daniel 1 of 18 Context: DS1 Dániel Darabos General Manager – Advanced Analytics CoE Screen reader support enabled. 18Schedule Focuses Rooms Speakers Days Explore General Manager – Advanced Analytics CoE, Shell Research Ltd.

Dan is passionate about innovation from data & analytics (a recurring theme throughout his career) but also has extensive experience in business process design and improvement, business transformation and large system (SAP) implementation. He began his working life as an Accenture consultant working in their Upstream practice before joining Shell in 2008, performing a variety of roles in SAP implementation programmes, the Group CIO office and in architecture. Led the Advanced Analytics CoE within TaCIT innovation from its formation in 2013, growing the team from nothing to around 80 people. The Advanced Analytics CoE now has active projects in most parts of the Shell group and has shown significant value from projects which are now publicly referenced – in particular spare part inventory optimization, carbon capture and storage (CCS) monitoring and subsurface analogue identification.



Parallelizing Large Simulations with Apache SparkRSummit Europe 2017

Across all assets globally, Shell carries a huge stock of spare part inventory which ties up large quantities of working capital. Over the past 2 years an interdisciplinary project team has produced a tool, Inventory Optimization Analytics solution (IOTA), based on advanced analytical methods, that helps assets optimise stock levels and purchase strategies. To calculate the recommended stocking inventory level requirement for a material the Data Science team have written a Markov Chain Monte Carlo (MCMC) bootstrapping statistical model in R. Cumulatively, the computational task is large but, fortunately, is one of an embarrassingly parallel nature because the model can be applied independently to each material. The original solution which utilised the R "parallel" package was deployed on a single 48 core PC and took 48 hours to run. In this presentation, we describe how we moved the original solution to a distributed cloud-based Apache Spark framework. Using the new R User Defined Functions API in Apache Spark and with only a minimal amount of code changes the computational run time was reduced to 4 hours. A restructuring of the architecture to "pipeline" the problem resulted in a run time of less than 1 hour. This use case is important because it verifies the scalability and performance of SparkR. Session hashtag: #EUds8