by Steven Chen, Feng Wang, Bhavik Soni, Chengguang Yang, Albert Zhong, Naren Loganathan, Harsh Panchal and Jianwei Xie
Distributed GPU training has become routine across the industry. Teams now train foundation models, fine-tune frontier-scale models, build large vision systems, and run deep recommender networks at scales that were once the domain of frontier labs alone.
Building GPU infrastructure that can meet today's scale requires getting a lot of things right: detecting the failures that take down a run, surfacing the slow degradations that never announce themselves, validating fabric health across thousands of links, scheduling around hardware that will eventually fail, and recovering cleanly when it does. Many of these are foundational, and the harder problems higher up the stack depend on them.
At Databricks AI, we run training workloads at massive scale every week, where failures show up continuously across hardware, fabric, and software. This series covers what it takes to keep GPUs reliable at this scale, starting with the foundation in this first post: the failure modes you encounter running GPUs, the diverse workloads that surface them, and the multi-stage health check system that catches them. Training is the most demanding workload class and the focus here, though the same engineering serves inference and other GPU workloads at Databricks.
Most GPU failures at scale fall into three categories: crashed jobs, silent slowdowns, and numerical corruption. Crashed jobs are the easy case, in the sense that you know immediately when one happens. The harder failures are workloads that complete with wrong numbers in the model, or run at degraded performance for hours without anyone noticing.
Crashed jobs. Distributed training jobs crash for many reasons: a GPU degrading or falling off the bus, RDMA fabric issues, an I/O system hang, a CPU-side rank diverging from the others. From the workload's perspective, almost all of these surface as the same thing: the job crashing with the dreaded NCCL watchdog timeout message in the logs. Every rank blocks on the same collective, the watchdog eventually kills the job, and you restart from the last checkpoint. But the timeout itself tells you almost nothing about the root cause. Diagnosing what actually went wrong often means tracing across hardware, fabric, filesystem, and software layers from a stack trace that only shows the symptom.
Silent slowdowns. A silently degraded GPU can continue to make training progress, with logs looking fine and loss still trending down. However, the throughput is bottlenecked on the slowest GPU, wasting compute and money. These slowdowns come from hardware running in a degraded state, where thermal sensors trip under sustained load, interconnect links downgrade after persistent errors, or memory bandwidth drops as faults accumulate. Each shows up in different hardware-level signals, e.g. DCGM throttle reasons like HW_SLOWDOWN or HW_THERMAL_SLOWDOWN for thermal, or link health for interconnects.
Numerical corruption. Modern GPUs use Error Correction Code (ECC) to detect and automatically correct many transient memory faults without interrupting training. However, not all faults can be recovered. Corruption may originate in memory, interconnects, kernels, or software layers and can propagate before it is detected or contained. In those cases, training may stop immediately or continue with incorrect values. Failures can appear as NaN losses, unstable convergence, or model quality regressions that are only discovered later.
GPU hardware failure event rates can be an order of magnitude higher than CPUs. As a conservative back-of-the-envelope assumption, take each GPU as having a 1% annualized failure event rate. For a job using N GPUs over T days, the probability of at least one event is approximately:
A 256-GPU job running for 30 days has about a 19% chance of seeing a failure. At 1,024 GPUs, that climbs to 57%. At this scale, failures during a run are expected, not exceptional. As a foundation, two engineering investments keep training reliable despite them: stress testing with diverse, cutting-edge workloads that surface failures early, and a multi-stage health check system that catches them across the fleet.
Databricks AI runs a range of demanding training workloads on the same platform customers use: reinforcement learning training for models like KARL, agentic coding models, document intelligence systems like the one behind PDFs in production, and more. These aren't typical training jobs. RL workloads combine training, inference, and reward computation in tight loops across many GPUs. Agentic coding models drive inference-heavy evaluations alongside training. Document intelligence pipelines combine model training with heavy image-based data loading.
Each one stresses the platform in distinct ways, which makes them effective at surfacing operational issues like fabric flakiness, thermal hotspots, and edge cases in collective communication before they reach broader production workloads.

Here's what one recent issue looked like. A training run failed with a NCCL timeout seven hours into training. Investigation showed that a single Infiniband port used for RDMA NCCL collectives had gone down once and recovered. It never flapped again. Our continuous health checks monitor IB port flapping, but a single isolated flap doesn't normally indicate an unhealthy port, so it wouldn't trip the threshold on its own.
The crash came down to which of two NCCL timeouts fires first. Most discussions of NCCL configuration focus on the PyTorch NCCL watchdog timeout, configurable via init_process_group(timeout=...), which kills a hung collective after some configurable duration (typically 10 minutes). A second timeout sits lower in the stack and fires long before it: NCCL_IB_TIMEOUT, at the InfiniBand transport layer, controls how long a connection waits for a downed port to recover before tearing the connection down. Its effective default works out to roughly seven seconds with retries factored in, much shorter than most teams realize. Once a single port-down window exceeds that, the connection is gone and the collective is already dead, regardless of how the PyTorch watchdog timeout is set. By the time the watchdog notices the hang, the run is already committed to crash.
The signal that matters for training impact is cumulative downtime, not flap count. A single sufficiently long flap can crash a multi-day training run, just like repeated flaps over hours. We tuned our NCCL_IB_TIMEOUT defaults to be more resilient, and the same port-down signal lets us crash and restart the job from a checkpoint without leaving GPUs idle until the watchdog fires. This investigation is one of many feeding into the health check system the rest of this post describes.
We built gpu-monitor as a multi-stage health check and observability service that runs on every GPU node, covering the entire node lifecycle. Different categories of check run at different stages, because different failure modes are catchable in different conditions.

Active bootstrap checks run when a node is first provisioned and again every time it's cleaned between customer workloads. Every workload starts on a node that just passed the full check suite. These catch deterministic failures, things that can be reliably surfaced by a targeted test up front. A representative sample of what gpu-monitor runs:
A node failing any active check is immediately removed from the fleet before any workload runs on it. Bad nodes are quarantined, then put through resets and thorough re-testing before either returning to the fleet or being permanently removed.
Passive continuous checks watch for the non-deterministic failure modes from the previous sections, failures that only emerge under sustained workload pressure. For these, gpu-monitor runs a second layer of checks on every active node, such as:
HW_SLOWDOWN, HW_THERMAL_SLOWDOWN, HW_POWER_BRAKE)Nodes showing continuous-check failures are cordoned, drained, and go through the same quarantine process as active bootstrap check failures.
Periodic multi-node active checks validate inter-node fabric behavior that no single node can surface on its own. They run periodically on idle nodes between customer workloads to isolate inter-node fabric issues from single-node degradation the bootstrap layer already catches. Because these run on idle nodes and can be preempted when customer workloads need the nodes, they can be more expensive than what fits inside an active check at provisioning time.
The tests themselves include NCCL collective bandwidth probes across node groups, sweeping payload sizes from 8 bytes to 2 GiB. Different payload sizes matter because NCCL triggers different code paths. Small messages in the KB range run through low-latency protocols like LL and LL128 and are latency-dominated, making p95 latency the useful pass criterion. Medium messages in the MB range cross thresholds where NCCL switches algorithms from tree to ring. Large messages exercise chunking and pipelining as bandwidth limits are reached, making BusBW (bus bandwidth) the useful pass criterion. Hardware issues often surface in only one of those code paths. A representative output of the conditions we check for all-reduce bandwidth in our health check:
| Payload size | p50 latency | p95 latency | AlgBW | BusBW | Pass criterion |
|---|---|---|---|---|---|
| 1 KB | 118 µs | 120 µs | 0.009 GB/s | 0.016 GB/s | Pass if p95 latency ≤ 250 µs. |
| 1 MB | 288 µs | 319 µs | 3.64 GB/s | 6.82 GB/s | Pass if p95 latency ≤ 500 µs. |
| 16 MB | 398 µs | 408 µs | 42.2 GB/s | 79.1 GB/s | Pass if BusBW ≥ 50 GB/s and p95 latency ≤ 750 µs. |
| 128 MB | 1.18 ms | 1.20 ms | 114 GB/s | 213 GB/s | Pass if BusBW ≥ 150 GB/s. |
| 256 MB | 1.68 ms | 1.70 ms | 160 GB/s | 299 GB/s | Pass if BusBW ≥ 225 GB/s. |
| 1 GB | 6.39 ms | 6.50 ms | 168 GB/s | 315 GB/s | Pass if BusBW ≥ 250 GB/s. |
| 2 GB | 9.05 ms | 9.07 ms | 237 GB/s | 445 GB/s | Pass if BusBW ≥ 350 GB/s. |
AlgBW (algorithm bandwidth) measures throughput as the workload sees it. BusBW (bus bandwidth) accounts for the fact that a collective like all-reduce moves each byte across the fabric multiple times, so it better reflects real link utilization and hardware health.
Together, the three layers verify hardware before workloads start, watch it while they run, and validate the broader fabric in between. As new failure modes emerge, we incorporate new health checks and ship gpu-monitor out to the whole fleet.
GPU reliability is a compounding system. New hardware generations and workload patterns keep surfacing failure modes that need to be folded back into the checks, and each one makes the system stronger. This post covered the foundation everything else rests on. Future posts in this series build up from it, into the work that keeps training reliable as runs get larger, architectures change, and RL workloads combine training and inference in the same loop.
Reliable GPU infrastructure at this scale is what makes the next generation of AI products possible. If GPU reliability at scale is the kind of problem you want to work on, we're hiring!
Subscribe to our blog and get the latest posts delivered to your inbox.