Spark is well documented and easy to get up and running for both experimentation and data analysis – especially for people coming from the Hadoop world. For a group just starting out in the “big data” technology world however, there are some important items that need to be addressed when you first go live with your new toolkit.
In this talk, we highlight some of the challenges a group starting out with a Spark centered stack may experience along with how they can be dealt with. These include:
– Choosing, maintaining and possibly compiling the right combination of packages to work with Spark (Hadoop/Cassandra, Mesos/Yarn)
– Data serialization/deserialization – especially when working with some binary protocols
– Performance issues with small data
– Deployment/configuration automation
– Preparing for non-developer usage (plugging in the right libraries/third party packages)