Rodrigo Ney is a Data Engineer Manager at Nubank, the biggest Fintech outside Asia. His team is responsible for deploying and scaling out Apache Spark, aiming to democratize data usage inside the company. He is fascinated by the development of high quality distributed systems, with a special passion for functional programming and machine learning. Rodrigo also has been working with other tools beside Apache Spark over the last 8 years – Apache Kafka, Apache Airflow, Presto and Data warehouses of different cloud providers are examples.
Nubank is the leading fintech in Latin America. Using bleeding-edge technology, design, and data, the company aims to fight complexity and empower people to take control of their finances. We are disrupting an outdated and bureaucratic system by building a simple, safe and 100% digital environment.
In order to succeed, we need to constantly make better decisions in the speed of insight, and that's what We aim when building Nubank's Data Platform. In this talk we want to explore and share the guiding principles and how we created an automated, scalable, declarative and self-service platform that has more than 200 contributors, mostly non-technical, to build 8 thousand distinct datasets, ingesting data from 800 databases, leveraging Apache Spark expressiveness and scalability.
The topics we want to explore are:
- Making data-ingestion a no-brainer when creating new services
- Reducing the cycle time to deploy new Datasets and Machine Learning models to production
- Closing the loop and leverage knowledge processed in the analytical environment to take decisions in production
- Providing the perfect level of abstraction to users
You will get from this talk:
- Our love for 'The Log' and how we use it to decouple databases from its schema and distribute the work to keep schemas up to date to the entire team.
- How we made data ingestion so simple using Kafka Streams that teams stopped using databases for analytical data.
- The huge benefits of relying on the DataFrame API to create datasets which made possible having tests end-to-end verifying that the 8000 datasets work without even running a Spark Job and much more.
- The importance of creating the right amount of abstractions and restrictions to have the power to optimize.