Zoltán is a researcher and project lead at the Hungarian Academy of Sciences. His main expertise and interest is the data partitioning and scheduling of distributed data processing frameworks. His current work includes research and development on distributed tracing in Spark and QoS scheduling on Hadoop YARN. Zoltán is a speaker in various Big Data related conferences and meetups, including Hadoop & Spark Summit.
We propose a lightweight on-the-fly Dynamic Repartitioning module for Spark, which can adaptively repartition data during execution with negligible overhead to provide a close-to-uniform partitioning. In our experiments with distributions common in practice (for example power law), the time needed to complete a stage could be reduced by 38% to 59% on the average-case. The approach also improves utilization. By using our full-fledged, real-time visualization tool, we demonstrate that: - dynamic repartitioning works under various popular use-cases, - significant speedup can be achieved for common workloads, - we also show how to fine-tune the partitioning mechanism.
Highly skewed and temporally inhomogeneous key distributions can lead to slower-than-expected execution on common workloads. Understanding the data characteristics in addition to data flow during the execution is extremely beneficial for batch and streaming use cases as well. We show that by making Spark Data-Aware, a more stable and faster execution can be achieved. We expand our data-awareness with record-level tracing capabilities to co-locate services and data-processing workloads effectively.