In her Tech Support role at Alpine Data, Anya helps System Administrators configure their Enterprise [Spark | Hadoop | Database] clusters and run [SQL | R | Pig | MR | Spark] jobs using Alpine’s Web application. She helps Data Scientists tune ML algorithms written in [R | Java | Scala]. She helps Business Analysts operationalize workflows for actionable business insights. Anya was enticed into the big data world while publishing her NIH-funded signal transduction and protein interactome works during graduate and postdoc fellowships at The Mayo Clinic. Anya graduated from Johns Hopkins University with a major in Biomedical Engineering.
Spark offers the promise of speed, but many enterprises are reluctant to make the leap from Hadoop to Spark. Indeed, System Administrators will face many challenges with tuning Spark performance. This talk is a gentle introduction to Spark Tuning for the Enterprise System Administrator, based on experience assisting two enterprise companies running Spark in yarn-cluster mode. The initial challenges can be categorized in two FAQs. First, with so many Spark Tuning parameters, how do I know which parameters are important for which jobs? Second, once I know which Spark Tuning parameters I need, how do I enforce them for the various users submitting various jobs to my cluster? This introduction to Spark Tuning will enable enterprise system administrators to overcome common issues quickly and focus on more advanced Spark Tuning challenges. The audience will understand the "cheat-sheet" posted here: http://techsuppdiva.github.io/ Key takeaways: FAQ 1: With so many Spark Tuning parameters, how do I know which parameters are important for which jobs? Solution 1: The Spark Tuning cheat-sheet! A visualization that guides the System Administrator to quickly overcome the most common hurdles to algorithm deployment. http://techsuppdiva.github.io/ FAQ 2: Once I know which Spark Tuning parameters I need, how do I enforce them at the user level? job level? algorithm level? project level? cluster level? Solution 2: We'll approach these challenges using job & cluster configuration, the Spark context, and 3rd party tools - of which Alpine will be one example. We'll operationalize Spark parameters according to user, job, algorithm, workflow pipeline, or cluster levels.