Andre Jasiskis is responsible for co-designing, maintaining and implementing from scratch the data infrastructure for the biggest Fintech outside Asia over the last 3 years. Has been thinking a lot about data ingestion, batch, and streaming data processing on how to handle the exponential growth in the data to be processed. Apart from being addicted to data processing is addicted to the functional programming paradigm trying to apply its principles everywhere (even in the data platform).
Nubank is the leading fintech in Latin America. Using bleeding-edge technology, design, and data, the company aims to fight complexity and empower people to take control of their finances. We are disrupting an outdated and bureaucratic system by building a simple, safe and 100% digital environment.
In order to succeed, we need to constantly make better decisions in the speed of insight, and that's what We aim when building Nubank's Data Platform. In this talk we want to explore and share the guiding principles and how we created an automated, scalable, declarative and self-service platform that has more than 200 contributors, mostly non-technical, to build 8 thousand distinct datasets, ingesting data from 800 databases, leveraging Apache Spark expressiveness and scalability.
The topics we want to explore are:
- Making data-ingestion a no-brainer when creating new services
- Reducing the cycle time to deploy new Datasets and Machine Learning models to production
- Closing the loop and leverage knowledge processed in the analytical environment to take decisions in production
- Providing the perfect level of abstraction to users
You will get from this talk:
- Our love for 'The Log' and how we use it to decouple databases from its schema and distribute the work to keep schemas up to date to the entire team.
- How we made data ingestion so simple using Kafka Streams that teams stopped using databases for analytical data.
- The huge benefits of relying on the DataFrame API to create datasets which made possible having tests end-to-end verifying that the 8000 datasets work without even running a Spark Job and much more.
- The importance of creating the right amount of abstractions and restrictions to have the power to optimize.