Traditional data architectures are not enough to handle the huge amounts of data generated from millions of users. In addition, the diversity of data sources are increasing every day: Distributed file systems, relational, columnar-oriented, document-oriented or graph Databases.
Letgo has been growing quickly during the last years. Because of this, we needed to improve the scalability or our data platform and endow it further capabilities, like “dynamic infrastructure elasticity”, real-time processing or real-time complex event processing. In this talk, we are going to dive deeper into our journey. We started from a traditional data architecture with ETL and Redshift, till nowadays where we successfully have made an event oriented and horizontally scalable data architecture.
We will explain in detail from the event ingestion with Kafka / Kafka Connect to its processing in streaming and batch with Spark. On top of that, we will discuss how we have used Spark Thrift Server / Hive Metastore as glue to exploit all our data sources: HDFS, S3, Cassandra, Redshift, MariaDB … in a unified way from any point of our ecosystem, using technologies like: Jupyter, Zeppelin, Superset â¦ We will also describe how to made ETL only with pure Spark SQL using Airflow for orchestration.
Along the way, we will highlight the challenges that we found and how we solved them. We will share a lot of useful tips for the ones that also want to start this journey in their own companies.
Session hashtag: #SAISExp2
Ricardo Fanjul is a Data Engineer at Letgo designing the new data architecture. He specializes in high scalable technologies like Spark, Flink, Hadoop, Kafka, Cassandra or Akka. Previously, he worked developing high-scalable distributed systems for companies like ING BANK where he worked in the core team of the new architecture of the Bank. He has a Bachelor's degree in computer science and Master in Web Engineering.