At Thomson Reuters Reseach and Development, we have found several business tasks with the same computational dependencies as distributed matrix multiplication, such as computing user-product recommendations and interpreting LDA topics. While matrix multiplication is an elementary distributed algorithm, Spark is inefficient on it. This talk proposes a formalism to express these kinds of tasks and demonstrates gains from optimizing it. In naive dense matrix multiplication, a cartesian or join generates n**3 data, which are shuffled and reduced to n**2 data. The shuffle is expensive; moreover, the partitioning of the reduction is known ahead of time because of the key assignment pattern, and this can be used to optimize the procedure. We propose a method called “flyby,” which allows the reduction to be localized and avoids the shuffle. I well introduce flyby and provide timing comparisons on business tasks and synthetic problems.