Understanding the dynamics of GPU utilization and workloads in containerized systems is critical to creating efficient software systems. We create a set of dashboards to monitor and evaluate GPU performance in the context of TensorFlow. We monitor performance in real time to gain insight into GPU load, GPU memory and temperature metrics in a Kubernetes GPU enabled system. Visualizing TensorFlow training job metrics in real time using Prometheus allows us to tune and optimize GPU usage. Also, because Tensor flow jobs can have both GPU and CPU implementations it is useful to view detailed real time performance data from each implementation and choose the best implementation. To illustrate our system, we will show a live demo gathering and visualizing GPU metrics on a GPU enabled Kubernetes cluster with Prometheus and Grafana.
Diane Feddema is a principal software engineer at Red Hat Inc Canada, Emerging Technologies Group. Diane is currently focused on developing and applying big data techniques for performance analysis, automating these analyses and displaying data in novel ways. Previously Diane was a performance engineer at the National Center for Atmospheric Research, NCAR, working on optimizations and tuning in parallel global climate models. She has a MS in Computer Science from the University of Colorado.
Currently focused on developing analytics platform on OpenShift and leveraging Open Source ML Frameworks: Apache Spark, Tensorflow and more. Designing high performance and scalable ML platform that exposes metrics through cloud-native technology: Prometheus and Kubernetes.