Modern datacenters and IoT networks generate a wide variety of telemetry that makes excellent fodder for machine learning algorithms. Combined with feature extraction and expansion techniques such as word2vec or polynomial expansion, these data yield an embarrassment of riches for learning models and the data scientists who train them. However, these extremely rich feature sets come at a cost. High-dimensional feature spaces almost always include many redundant or noisy dimensions. These low-information features waste space and computation, and reduce the quality of learning models by diluting useful features.
In this talk, Erlandson will describe how Random Forest Clustering identifies useful features in data having many low-quality features, and will demonstrate a feature reduction application using Apache Spark to analyze compute infrastructure telemetry data.
Learn the principles of how Random Forest Clustering solves feature reduction problems, and how you can apply Random Forest tools in Apache Spark to improve your model training scalability, the quality of your models, and your understanding of application domains.
Session hashtag: #SFds8
Erik Erlandson is a Software Engineer at Red Hat, where he investigates analytics use cases and scalable deployments for Apache Spark in the cloud. He also consults on internal data science and analytics projects. Erik is a contributor to Apache Spark and other open source projects in the Spark ecosystem, including the Spark on Kubernetes community project, Algebird and Scala.