Skip to main content
Home/Notes/AI under partial observability

Note · 2026-01-24

How AI makes decisions when it cannot see everything

Almost every real-world decision an AI makes happens under partial observability. The clean MDP assumption is the exception, not the rule.

A POMDP is a Markov decision process where the agent observes only part of the true state. Almost every real-world AI runs in this setting (autonomous driving, network control, cognitive radio, medical diagnosis, fraud detection). Two ingredients turn POMDPs from intractable to workable: a state abstraction that compresses observation history without exponential blowup, and an exploration policy that respects the noise in the agent's own value estimates. My IEEE ICEACE 2024 paper shipped both.

The problem in one paragraph

Standard reinforcement learning assumes the agent knows the state of the world at every time step. Partial observability says: it does not. The agent observes a measurement, the measurement is consistent with many true states, and the right action depends on which true state you are in. The classic example is poker. The cards in opponents' hands matter for your decision. You cannot see them. You can infer.

Why this is the common case

Examples where the state is partially observable:

  • Autonomous driving, occluded vehicles and pedestrians, sensor noise, weather effects.
  • Cognitive radio, the spectrum slice you sense is a tiny subset of total spectrum.
  • Medical diagnosis, lab results are noisy proxies for underlying condition.
  • ICS anomaly detection, the SCADA telemetry you observe is downsampled and filtered.
  • Fraud detection, you see transactions, not the actor's true intent.
  • Network control, you see packet flows, not endpoint state.

The clean MDP assumption (full observability) is closer to the exception than the rule. Most production AI quietly papers over this with "treat the observation as the state" tricks that work in some regimes and fail in others.

The two hard problems

Two specific hard problems for an agent in a POMDP.

First, state representation. You need to compress the history of observations into something the value function can read, without that something blowing up exponentially. Naive approaches keep the entire history (intractable). Recurrent neural networks compress it but need a lot of training data. Suffix memory does it with non-parametric instance methods.

Second, exploration. The Q-value estimates are themselves noisy because they are aggregating over latent states. Epsilon-greedy ignores that noise and picks confidently from a confused estimator. Boltzmann (softmax) sampling respects the noise: when the estimator is uncertain, the agent explores more.

What suffix memory does

Andrew McCallum's Utile Suffix Memory (1995) was an early instance-based POMDP method. The idea is elegant: the relevant abstraction of history is the shortest suffix that is predictive of value. Build a tree of suffixes, split nodes when a finer suffix gives a statistically significantly different value estimate, prune nodes that do not.

The problem with USM as published: it can grow unbounded when the environment has long-horizon dependencies, and it pairs poorly with epsilon-greedy. My 2024 IEEE ICEACE paper introduced Compressed Suffix Memory (CSM), which solves both.

What CSM adds

Two changes:

  • Heuristic depth bound, instead of letting the suffix tree grow until splits stop being significant, cap depth based on a problem-specific heuristic (in benchmark mazes, the diameter of the state graph). Stops the exponential blowup.
  • Boltzmann exploration, replace epsilon-greedy with softmax over Q-values, with temperature annealing. Respects the uncertainty in the value estimates.

Result on benchmark POMDP mazes: CSM converges to the optimal policy faster than vanilla USM and stays robust to longer-horizon dependencies. Sole-author IEEE ICEACE 2024 paper, see the research deep-dive.

Where this matters beyond mazes

The same instinct shows up across the rest of the work.

  • Cognitive radio, the agent senses a spectrum slice (observation), not the full spectrum (state). See 6G cognitive radio security and the Green Cognitive Radio research.
  • ICS anomaly detection, telemetry is observation, plant state is hidden, attacker behaviour is doubly hidden.
  • Federated learning under non-IID data, each client sees a partial slice of the joint distribution.

Once you start naming the partial-observability shape of a problem out loud, you stop pretending it is a clean MDP and start picking algorithms that respect the noise.

The wider point

The mathematical name for "the agent does not know what state it is in" is POMDP. Most production ML systems get away with ignoring it. The ones that need to be reliable cannot. As we deploy AI into more safety-critical settings (driving, medicine, infrastructure), the POMDP toolkit moves from research curiosity to operational requirement.

Related

This article was originally published on Medium. The canonical version lives here.