Peter Norvig’s “Latency Numbers Every Programmer Should Know” are a classic in software engineer training. The original sixteen numbers represent, for programmers, the hard constraints of our hardware. In the early 2000s, if you cared about writing fast code you knew that a disk seek cost about 10 milliseconds.
“You don’t have to be an engineer to be a racing driver, but you do have to have Mechanical Sympathy.”
Latency numbers are for programmers who want their systems to be fast.
Failure numbers are for programmers who want their systems to be reliable.
| Thing | Type | MTTF (years) | AFR | Notes |
|---|---|---|---|---|
| CPU failure | Hardware | ~1,700 | ~0.06% | Server CPUs very rarely fail outright. Intel IT measured a 0.06% CPU AFR across 223,050 CPUs in 207,956 HPC servers, which converts to an MTTF of roughly 1,700 years by the simple reciprocal math used here.1 |
| Motherboard failure | Hardware | ~260 | ~0.38% | Motherboards are still rare failures, but less rare than CPUs. In the same Intel IT dataset, motherboards had a 0.38% AFR, or roughly 260 years MTTF by the same conversion.1 |
| SSD failure | Hardware | ~100 | ~1% | Enterprise SSD field data is usually around or below 1% AFR at the headline level, with model, age, and write workload hiding underneath. Backblaze’s SSD boot-drive data is in this ballpark, though it is a much smaller SSD sample than its HDD fleet.2 |
| HDD failure | Hardware | ~60 | ~1.5% | Backblaze’s 2025 fleet snapshot reports 1.36% annual AFR and 1.30% lifetime AFR across hundreds of thousands of drives.3 Use 1-2% unless you know the specific drive model and age. |
| RAM uncorrectable error | Hardware | ~75 | ~1-4% | In Google’s DRAM study, 1.29% of machines per year had at least one uncorrectable error, with individual platforms reaching 4.15%.4 One uncorrectable error typically means a machine shutdown and DIMM replacement. |
| AWS regional outage, non-us-east-1 | Service | ~4 | ~25% | Here a failure means a region-scale incident big enough to require application-level mitigation, not every status page blip. |
| AWS regional outage, us-east-1 | Service | ~2 | ~50% | us-east-1 deserves its own row because it is old, huge, and entangled with many AWS control planes. See the October 2025 AWS outage for the shape of one such event. |
| ElastiCache node failure | Service | ~0.3 | ~300% | AWS documents node replacement and failover as normal ElastiCache operating behavior.5 The rate here is based on internal Modal operational data: roughly three node failures or replacements per node-year in the fleet. |
| NVIDIA A100 critical error6 | Hardware | ~0.18 (65 days) | ~560% | Internal Modal fleet measurements. At this rate, a fleet of 1,000 A100s should expect about 15 critical GPU errors per day. |
| NVIDIA H100 critical error | Hardware | ~0.14 (50 days) | ~730% | Internal Modal fleet measurements. |
| Cloud VM unavailability | Service | ~20-100 | ~1-5% | Cloud providers publish availability SLAs, not clean per-VM failure rates.7 For a single cloud VM, I use 1-5% as a rough annual chance that the VM needs recovery or replacement because the underlying host, network, or power failed underneath it. |
| Cloud VM disk loss | Service | ~500-1,000 | ~0.1-0.2% | AWS EBS gp2, gp3, io1, st1, and sc1 volumes are documented at 99.8-99.9% durability, which AWS also states as 0.1-0.2% annual failure rate.8 io2 Block Express is a different class at 99.999% durability, or 0.001% AFR. |
| Production bug or defect | Software | ~0.001-0.005 (12h-2d) | ~20k-100k% | The most frequent failure mode is us. For active services deploying many times per day, DORA’s change fail rate and deployment rework rate turn into a daily rhythm of defects, hotfixes, and regressions.9 |
- MTTF is mean time to failure: the average elapsed time between failures of a component, such as a disk. For repairable systems this is often discussed as MTBF.
- AFR is annualized failure rate. For low-rate component failures, read it as the approximate fraction of a population expected to fail in a year. For repeat-failing rows above 100%, read it as an annualized event rate: 300% means about three failures per component-year.
- I use the simple conversion
MTTF ~= 1 / AFRwhen AFR is expressed as failures per component-year. This is a napkin-math table, not a claim that failures are independent, exponentially distributed, or evenly spread over time. The hardware papers are very clear that failures are correlated and messy. - Obviously these estimates hinge on the definition of a failure. A fault is usually one component deviating from its specification. A failure is when the system as a whole stops servicing the client or user. An example of a fault is when a memory cell in an NVIDIA GPU dies. This does not necessarily fail the device.
- This table does not attempt to rank severity. Severity is dependent on the relationship of the failed component to the rest of the system. A dead disk in a healthy replicated storage system is routine; a dead disk under a single-node Postgres instance can be the whole show.
-
Intel IT’s Green Computing at Scale reports component annualized failure rates from 207,956 HPC servers observed from May 2019 through June 2020, including 223,050 CPUs and 207,956 motherboards. ↩ ↩2
-
“Critical error” here means an NVIDIA Xid or SXid error that is not recoverable without application and GPU reset. NVIDIA’s Xid docs classify some errors with immediate actions such as
RESET_GPUorRESTART_APP. See NVIDIA Xid Errors. ↩