Sandy May

Co-organiser of Data Science London and Lead Data Engineer, Elastacloud

Sandy is a Lead Data Engineer and CTO at Elastacloud where he has worked for 4 years on myriad projects ranging from SME to FTSE 100 customers. He is a strong advocate of Databricks on Azure and using Spark to solve Big Data problems; he has recently become a Databricks Champion. Having worked on one of the original Databricks on Azure projects he continues to expand his Big Data knowledge using new Open Source technologies such as Delta Lake and ML Flow.

Sandy co-organises the Data Science London meet-up and continues to try to push what he picks up back to the community so they can learn from his mistakes. Having spoken at Spark Summit, Future Decoded, Red Shirt tours and more his knowledge range covers most of the Azure Data Stack with a keen interest in Big Data, Machine Learning and Data Visualisation.

Past sessions

Technical Leads and Databricks Champions Darren Fuller & Sandy May will give a fast paced view of how they have productionised Data Quality Pipelines across multiple enterprise customers. Their vision to empower business decisions on data remediation actions and self healing of Data Pipelines led them to build a library of Data Quality rule templates and accompanying reporting Data Model and PowerBI reports.

With the drive for more and more intelligence driven from the Lake and less from the Warehouse, also known as the Lakehouse pattern, Data Quality at the Lake layer becomes pivotal. Tools like Delta Lake become building blocks for Data Quality with Schema protection and simple column checking, however, for larger customers they often do not go far enough. Notebooks will be shown in quick fire demos how Spark can be leverage at point of Staging or Curation to apply rules over data.

Expect to see simple rules such as Net sales = Gross sales + Tax, or values existing with in a list. As well as complex rules such as validation of statistical distributions and complex pattern matching. Ending with a quick view into future work in the realm of Data Compliance for PII data with generations of rules using regex patterns and Machine Learning rules based on transfer learning.

In this session watch:
Darren Fuller, Developer, Elastacloud Ltd
Sandy May, Co-organiser of Data Science London and Lead Data Engineer, Elastacloud


Summit Europe 2020 Building a Cross Cloud Data Protection Engine

November 18, 2020 04:00 PM PT

Data Protection is still at the forefront of multiple companies minds with potential GDPR fines of up to 4% of their global annual turnover (creating a current theoretical max fine of $20bn). GDPR effects countries across the world, not just those in Europe, leaving many companies still playing catch up. Additional acts and legislation are coming into place such as CCPA meaning Data Protection is a constantly evolving landscape, with fines that can literally decimate some business. In this session we will go through how we have worked with our customers to create an Azure and AWS implementation of a Data Protection Engine covering Protection, Detection, Re-Identification and Erasure of PII data.

The solution is built with Security and Auditability at the centre of the Architecture, with special consideration for managing a single application across two public clouds; leading us to using Databricks, Delta Lake, Kubernetes and PowerBI. We will deep dive into using Spark to create multiple techniques of Data Protection and how using AI can start to become a game changer in Detecting PII that has been missed in Data. Exploring how Delta Lake empowers us to share PII tokens between cloud providers with ACID transactions, auditing and versioning of data.

With a final look at how Deep Neural Networks can be used to Detect PII within Data, this will be demo packed session. We hope this session shows you that Data Protection doesn't have to be an off the shelf black box, but you can own the risk and solution within your own platform, whilst still remaining secure and compliant.

Speakers: Sandy May and Richard Conway

Renewables AI is at the forefront of innovation in the solar energy market. As the name suggests, we use AI to make predictions on energy output from large portfolios of solar farms. This talk lays out the fundamental architecture, technology and approaches that make the platform work beginning with key features of the Azure Databricks cloud and how it works seamlessly with Azure Data Lake and Azure Event Hubs. There will be good coverage of ML and DL Pipelines and how they are used with image recognition and machine learning through Structured Streaming to make real-time decisions.

Key Takeaways: Prediction of next day irradiance and power ratios with real-time accuracies of 95% Structured streaming of IoT data from hundreds of thousands of inverters at 5 minute intervals Real-time joining of weather data and several other external datasets Use of Deep Learning Pipelines and advanced time series methods to predict 48 hours of future energy production Near-real time processing of image data at frequent intervals to predict cloud cover from onsite cameras and drones Analysis of data and preventative maintenance of fan failures in solar inverters

Session hashtag: #SAISDD11