Elliott Cordo is the Chief Architect at Caserta Concepts. He’s a known big data, data warehouse, and Spark advocate with a passion for helping transform data into powerful information. He has more than a decade of experience overseeing large-scale technology projects, including those involving business intelligence and data analytics. Elliott is recognized for his many innovative approaches to Machine Learning, and his personal favorite, Recommendation Engines. His passion is helping people understand the potential in their data, working closely with clients and partners to learn and develop cutting edge platforms to truly enable their organizations.
Having a complete and decisive view of a customer is the number one operational and analytic goal of nearly every organization today. To enable this "customer 360 perspective," some level of Customer Data Integration (CDI) must be implemented to cleanse and match customer identities within and across various sources of data. CDI has been a long-standing data engineering challenge, not just one of logic and complexity but also of performance and scalability. In the age of the Big Data, Social, and Internet of Things (IoT) revolutions, mastering customer data has become even more difficult. Data volumes have increased and the new and sparse data points being collected need to be integrated into the overall customer story. So thanks to Apache Spark, we have a flexible and high performance platform just right for building modern CDI applications in this new era. In this talk, Elliot presents a real-world Spark-based solution platformed on Apache Spark. He will cover the following topics: · Building an end-to-end CDI pipeline in Apache Spark · Customer Data integration - a discussion of traditional techniques, tools, and challenges for mastering your customer data, including what works, what doesn't, and how we evolve. · New Customer Data - how new data sources such as Social and IoT require innovation including methods for customer matching from statistical patterns, geolocation, and behavior · Pyspark - a look at leveraging Python's rich module ecosystem for data cleansing and standardization matching · GraphX - a peak at hacking Spark's GraphX library for matching and scalable clustering Elliott concludes with a discussion of Data Governance - Rapidly onboarding new data: how to balance rapid agility and time to market with critical decision support and customer interaction.