Highly skewed and temporally inhomogeneous key distributions can lead to slower-than-expected execution on common workloads. Understanding the data characteristics in addition to data flow during the execution is extremely beneficial for batch and streaming use cases as well. We show that by making Spark Data-Aware, a more stable and faster execution can be achieved. We expand our data-awareness with record-level tracing capabilities to co-locate services and data-processing workloads effectively.
Zoltán is a researcher and project lead at the Hungarian Academy of Sciences. His main expertise and interest is the data partitioning and scheduling of distributed data processing frameworks. His current work includes research and development on distributed tracing in Spark and QoS scheduling on Hadoop YARN. Zoltán is a speaker in various Big Data related conferences and meetups, including Hadoop & Spark Summit.