How do we prove self-driving cars are safe, the verification challenge

Statistical safety claims for autonomous vehicles need impossible mileage. RAND Corporation estimated 11 billion miles of testing to demonstrate, with 95 percent confidence, that an AV is 20 percent safer than a human driver. At a sustained 100 vehicle fleet running 24/7 at 25 mph, that takes 500 years. The way out is multi-formalism verification: combine formal models of the controllers, simulation of the perception stack, road testing for the long tail, and compose the evidence into a single safety case (ISO 21448 SOTIF). My TechRxiv preprint (2025) walks through the composition rules.

Why the obvious approach does not work

The intuitive way to prove an AV is safe is to drive it. Compare crash rates against human drivers. Run for long enough and the comparison is statistically meaningful.

The problem is the human-driver baseline. Fatal crashes happen roughly once per 100 million vehicle miles in the US. To detect a 20 percent safety improvement with 95 percent confidence requires, by RAND's calculation, 11 billion miles of AV operation. Even Waymo's roughly 50 million autonomous miles as of 2025 is three orders of magnitude short.

The pieces of the puzzle

The puzzle is that the AV is not one system. It is at least four:

Perception. Cameras, lidar, radar, fused into a world model. Mostly deep learning. Not formally verifiable today.
Prediction. What will the pedestrian do, what will the cyclist do, what will the car in the next lane do. Mostly statistical models.
Planning. Given the world model, choose a trajectory. Often a hybrid of optimisation and rule-based logic.
Control. Execute the trajectory. Classical control theory, formally tractable.

Each layer needs a different verification approach. No single formalism covers them all.

The composition rules

Multi-formalism verification means: pick the right formalism for each layer, then compose the safety arguments into a single case.

Control layer. Use Lyapunov-style proofs of stability and reachability. Tools: dReal, Flow*, CORA. Coverage: complete, for the linearised dynamics within the operating envelope.
Planning layer. Use temporal logic (LTL or STL) to specify rules ("always yield to pedestrians at marked crossings"). Verify with model checking. Coverage: complete for the encoded rule set.
Prediction and perception layers. These are statistical. Use stress simulation (millions of synthetic scenarios with edge cases injected) plus targeted real-world testing for the long tail.
Composition. A goal structuring notation (GSN) safety case ties the pieces together, with explicit assumptions about what each formalism does and does not cover.

What ISO 21448 (SOTIF) demands

The relevant standard is ISO 21448, Safety of the Intended Functionality. It requires that the safety case address not only random hardware faults (covered by ISO 26262) but also functional insufficiencies. Edge cases the perception stack misclassifies, scenarios the planner did not anticipate. These are not bugs, they are gaps in the specification of "intended functionality".

SOTIF demands an explicit accounting of known-safe scenarios, known-unsafe scenarios, and the unknown set. The verification job is to shrink the unknown set as fast as feasible, and to argue that what remains is below an acceptable risk threshold.

Where the formal proofs can be definitive

For specific subsystems with clean formal properties, the proofs can be definitive.

Emergency braking activation logic. Verifiable as a state machine with reactive temporal-logic properties.
Lane-keep controller within a defined operating design domain. Reachability analysis bounds the trajectory error.
Right-of-way rules at intersections. Encodable as STL, model-checked against scenario models.

These are pieces, not the whole. But they are pieces where the safety case can be airtight, freeing the simulation and road-testing budget for the genuinely uncertain parts of the stack.

What I am working on

My 2025 TechRxiv preprint, Multi-Formalism Verification for Cyber-Physical Systems, lays out the composition rules in detail. The headline contribution is a method for combining differently-typed safety claims (deterministic from formal verification, probabilistic from simulation, statistical from field data) into a single SOTIF-compliant safety case. The mechanism is a Bayesian network over the underlying assumptions, with sensitivity analysis to identify which assumption changes would invalidate the case.

The takeaway

The 11 billion miles number is not a target to hit, it is a sign that the strategy is wrong. The path to provably safe AVs is not more miles, it is better composition of the evidence we already collect. The verification science exists. The work is to standardise its use.