Wayne Jones is a senior Data Scientist in the Shell Advanced Analytics Centre of Excellence. He joined Shell in 2007 and during his ten years in Shell has worked on a wide variety of Data Science and statistical projects across many areas of the business, e.g. Upstream Materials Management, Treasury Cash Forecasting, Downstream Aviation, Gas and Power Trading. Wayne is a chartered statistician, has a BSc honours degree in mathematics from Bangor University of Wales, a MSc in ‘Mathematical Modelling for Industry’ from the University of Loughborough and a PhD in Ecological Modelling from the University of Strathclyde.
Across all assets globally, Shell carries a huge stock of spare part inventory which ties up large quantities of working capital. Over the past 2 years an interdisciplinary project team has produced a tool, Inventory Optimization Analytics solution (IOTA), based on advanced analytical methods, that helps assets optimise stock levels and purchase strategies. To calculate the recommended stocking inventory level requirement for a material the Data Science team have written a Markov Chain Monte Carlo (MCMC) bootstrapping statistical model in R. Cumulatively, the computational task is large but, fortunately, is one of an embarrassingly parallel nature because the model can be applied independently to each material. The original solution which utilised the R "parallel" package was deployed on a single 48 core PC and took 48 hours to run. In this presentation, we describe how we moved the original solution to a distributed cloud-based Apache Spark framework. Using the new R User Defined Functions API in Apache Spark and with only a minimal amount of code changes the computational run time was reduced to 4 hours. A restructuring of the architecture to "pipeline" the problem resulted in a run time of less than 1 hour. This use case is important because it verifies the scalability and performance of SparkR. Session hashtag: #EUds8