Reinforcement learning under fog, without the bloat.
Compressed Suffix Memory: a heuristic-bounded, Boltzmann-sampled successor to USM that learns faster on POMDPs without the exponential state explosion.
At a glance
01Why partial observability is hard
POMDPs are reinforcement learning's hardest legitimate setting. The agent never sees the true state, just a noisy partial slice, and has to keep enough history to act sensibly. Robotic exploration, autonomous driving, network control, almost every interesting RL problem in the wild is really a POMDP wearing a costume.
Instance-based methods like Utile Suffix Memory build a suffix tree over observation-action chains, treating tree nodes as the state space. They work, but the tree explodes exponentially during iteration, fills up with redundant states, and ε-greedy on top tends to overfit.
02What CSM changes
Three modifications, each compounding:
- Heuristic-bounded depth. Run a short blind exploration to estimate the effective path length from start to goal, then cap suffix-tree depth around that bound. The cap removes most of the redundancy without losing the predictive subtrees.
- Adaptive instance density thresholds. Instead of a fixed instance count for splitting nodes, scale it with locally observed reward variance. Sparse, low-variance regions stay coarse; dense, ambiguous regions get refined.
- Boltzmann sampling. Replace ε-greedy with softmax over Q-values. The same exploration budget gets spent more wisely, which matters disproportionately under partial observability.
03What the benchmarks show
"USM was right about the shape of the answer. CSM is right about how big to make it."
On benchmark mazes the comparison is clean: CSM converges faster, lands at higher final reward, and produces a much smaller suffix tree at convergence. The advantage scales with maze complexity, exactly the regime where USM was struggling.
04Where this connects
This work is the RL ancestor of the cognitive-radio paper, where deep reinforcement learning under partial spectrum observability is the same problem with a different reward signal. The same instinct, exploit cheap structural information before unleashing expensive learning, runs through both.
FAQWhat people ask me about this paper
Q1Why not just use a deep RNN?
Q2Is CSM compatible with deep function approximation?
Q3What is the worst case for CSM?
Q4How does Boltzmann beat ε-greedy here specifically?
Q5How does this connect to the green-cognitive-radio paper?
CITEHow to cite this paper
@inproceedings{badami2024rlcsm,
author = {Shujaatali Badami},
title = {Optimizing Reinforcement Learning in Partially Observable Environments Using Compressed Suffix Memory Algorithm},
booktitle = {IEEE ICEACE 2024},
year = {2024},
publisher = {IEEE}
doi = {10.1109/ICEACE63551.2024.10899025}
}S. Badami, "Optimizing Reinforcement Learning in Partially Observable Environments Using Compressed Suffix Memory Algorithm," in IEEE ICEACE 2024, 2024, doi: 10.1109/ICEACE63551.2024.10899025.
Badami, S. (2024). Optimizing Reinforcement Learning in Partially Observable Environments Using Compressed Suffix Memory Algorithm. In IEEE ICEACE 2024. https://doi.org/10.1109/ICEACE63551.2024.10899025
TY - CONF AU - Badami, Shujaatali TI - Optimizing Reinforcement Learning in Partially Observable Environments Using Compressed Suffix Memory Algorithm T2 - IEEE ICEACE 2024 PB - IEEE PY - 2024 DO - 10.1109/ICEACE63551.2024.10899025 ER -