Analyzing Log Data With Apache Spark

Download Slides

Contemporary applications and infrastructure software leave behind a tremendous volume of metric and log data. This aggregated “digital exhaust” is inscrutable to humans and difficult for computers to analyze, since it is vast, complex, and not explicitly structured.

This session will introduce the log processing domain and provide practical advice for analyzing log data with Apache Spark, including:
– how to impose a uniform structure on disparate log sources;
– machine-learning techniques to detect infrastructure failures automatically and characterize the text of log messages;
– best practices for tuning Spark, training models against structured data, and ingesting data from external sources like ElasticSearch; and
– a few relatively painless ways to visualize your results.

You’ll have a better understanding of the unique challenges posed by infrastructure log data after this session. You’ll also learn the most important lessons from our efforts both to develop analytic capabilities for an open-source log aggregation service and to evaluate these at enterprise scale.

Learn more:

  • Analyzing Apache Access Logs with Databricks

    « back
  • About William Benton

    William Benton leads a team of data scientists and engineers at Red Hat, where he has applied analytic techniques to problems ranging from forecasting cloud infrastructure costs to designing better cycling workouts. His current focus is investigating the best ways to build and deploy intelligent applications in cloud-native environments, but he has also conducted research and development in the areas of static program analysis, managed language runtimes, logic databases, cluster configuration management, and music technology.