Optimizing Reinforcement Learning in Partially Observable Environments Using Compressed Suffix Memory Algorithm

Badami, Shujaatali

doi:10.1109/ICEACE63551.2024.10899025

Home / Research / RL Compressed Suffix Memory

Reinforcement learning under fog, without the bloat.

Compressed Suffix Memory: a heuristic-bounded, Boltzmann-sampled successor to USM that learns faster on POMDPs without the exponential state explosion.

Shujaatali Badami ·Sole author ·IEEE ICEACE · 2024 ·23 December 2024 ·~6 min read

At a glance

VenueIEEE ICEACE2024

AlgorithmCSMCompressed Suffix Memory

Improves onUSMUtile Suffix Memory

SamplingBoltzmannsoftmax exploration

Wins onBenchmark mazesspeed + effectiveness

Author positionSole1/1

DOI ↗ IEEE Xplore ↗ Cite All publications

Diagram · Compressed Suffix Memory loop

The CSM agent. Suffix-tree memory replaces fixed window history. Boltzmann sampling tunes the explore/exploit knob.

01Why partial observability is hard

POMDPs are reinforcement learning's hardest legitimate setting. The agent never sees the true state, just a noisy partial slice, and has to keep enough history to act sensibly. Robotic exploration, autonomous driving, network control, almost every interesting RL problem in the wild is really a POMDP wearing a costume.

Instance-based methods like Utile Suffix Memory build a suffix tree over observation-action chains, treating tree nodes as the state space. They work, but the tree explodes exponentially during iteration, fills up with redundant states, and ε-greedy on top tends to overfit.

02What CSM changes

Three modifications, each compounding:

Heuristic-bounded depth. Run a short blind exploration to estimate the effective path length from start to goal, then cap suffix-tree depth around that bound. The cap removes most of the redundancy without losing the predictive subtrees.
Adaptive instance density thresholds. Instead of a fixed instance count for splitting nodes, scale it with locally observed reward variance. Sparse, low-variance regions stay coarse, dense, ambiguous regions get refined.
Boltzmann sampling. Replace ε-greedy with softmax over Q-values. The same exploration budget gets spent more wisely, which matters disproportionately under partial observability.

03What the benchmarks show

"USM was right about the shape of the answer. CSM is right about how big to make it."

On benchmark mazes the comparison is clean: CSM converges faster, lands at higher final reward, and produces a much smaller suffix tree at convergence. The advantage scales with maze complexity, exactly the regime where USM was struggling.

04Where this connects

This work is the RL ancestor of the cognitive-radio paper, where deep reinforcement learning under partial spectrum observability is the same problem with a different reward signal. The same instinct, exploit cheap structural information before unleashing expensive learning, runs through both.

FAQWhat people ask me about this paper

Q1Why not just use a deep RNN?

Deep methods need training data CSM-style instance methods do not. For low-data POMDPs, especially safety-critical ones where you cannot reset the world, instance methods remain the right tool.

Q2Is CSM compatible with deep function approximation?

Yes. The compressed suffix tree gives you a state abstraction that you can hand to a deep value head. Hybrid setups are an active follow-up direction.

Q3What is the worst case for CSM?

Environments where blind exploration radically misestimates the effective path length. The depth cap is then wrong, and the algorithm wastes budget recovering. A mild back-off on the cap fixes most cases.

Q4How does Boltzmann beat ε-greedy here specifically?

Under partial observability, Q-values are themselves noisy estimates over latent states. Softmax respects that noise, ε-greedy ignores it and picks confidently from a confused estimator.

Q5How does this connect to the green-cognitive-radio paper?

Cognitive radio is a POMDP: the agent never sees the full spectrum state, just sensed slices. The DRL controller in that paper uses the same exploration-vs-exploitation logic CSM formalises here.

CITEHow to cite this paper

@inproceedings{badami2024rlcsm,
  author    = {Shujaatali Badami},
  title     = {Optimizing Reinforcement Learning in Partially Observable Environments Using Compressed Suffix Memory Algorithm},
  booktitle = {IEEE ICEACE 2024},
  year      = {2024},
  publisher = {IEEE},
  doi       = {10.1109/ICEACE63551.2024.10899025}
}

S. Badami, "Optimizing Reinforcement Learning in Partially Observable Environments Using Compressed Suffix Memory Algorithm," in IEEE ICEACE 2024, 2024, doi: 10.1109/ICEACE63551.2024.10899025.

Badami, S. (2024). Optimizing Reinforcement Learning in Partially Observable Environments Using Compressed Suffix Memory Algorithm. In IEEE ICEACE 2024. https://doi.org/10.1109/ICEACE63551.2024.10899025

TY  - CONF
AU  - Badami, Shujaatali
TI  - Optimizing Reinforcement Learning in Partially Observable Environments Using Compressed Suffix Memory Algorithm
T2  - IEEE ICEACE 2024
PB  - IEEE
PY  - 2024
DO  - 10.1109/ICEACE63551.2024.10899025
ER  -

At a glance

Diagram · Compressed Suffix Memory loop

01Why partial observability is hard

02What CSM changes

03What the benchmarks show

04Where this connects

FAQWhat people ask me about this paper

CITEHow to cite this paper

SEE ALSORelated work in this portfolio

AI-driven green cognitive radio

Quantum kernels for ICS anomaly detection

AI-optimised VLSI for sustainable IoT

Related research