Helena has worked exclusively with Scala in production since 2010 on large scale distributed systems in the cloud. As a Senior Cloud Engineer she was on the first Scala team at VMware building multi-tenant cloud automation systems, then in big data architecting, building and deploying streaming and batch analytics pipelines for Cyber Security for real time threat analysis. Most recently she has worked on streaming analytics and machine learning at scale with Apache Spark, Cassandra, Kafka, Akka and Scala. Helena is a committer to the Spark Cassandra Connector and a contributor to Akka, adding new features in Akka Cluster such as the initial version of the cluster metrics API and AdaptiveLoadBalancingRouter. While working at SpringSource she was a contributor to several open source projects such as Spring Integration and Spring AMQP. Helena is a speaker at international Big Data and Scala conferences such as Spark Summit, QCon, Scala Days, and Philly Emerging Technology. She is currently VP of Product Engineering at Tuplejump.
This talk will start by addressing what we need from a big data pipeline system in terms of performance, scalability, distribution, and concurrency as self-healing systems. From there I will step into why this particular set of technologies addresses these requirements and the features of each which support them. I will show how they actually work together, from the application layer to deployment across multiple data centers, in the framework of Lambda Architecture. Finally, I will show how to easily leverage and integrate everything from an application standpoint (in Scala) for fast, streaming computations in asynchronous event-driven environments.
This talk will address how a new architecture is emerging for analytics, based on Spark, Mesos, Akka, Cassandra and Kafka (SMACK). Popular architecture like Lambda separate layers of computation and delivery and require many technologies which have overlapping functionality. Some of this results in duplicated code, untyped processes, or high operational overhead, let alone the cost (i.e. ETL). I will discuss the problem domain and what is needed in terms of strategies, architecture and application design and code to begin leveraging simpler data flows. We will cover how the particular set of technologies addresses common requirements and how collaboratively they work together to enrich and reinforce each other.