Why "cognitive" radio at all
5G already uses massive MIMO and beamforming. 6G goes further: dense heterogeneous deployments, terahertz bands, ultra-dynamic spectrum sharing, integrated sensing and communication, and orders-of-magnitude more devices per square kilometre. A statically-configured radio cannot keep up. The next-generation radio learns: senses the spectrum, predicts primary-user activity, allocates dynamically.
That makes it a POMDP. The radio observes spectrum slices, not the underlying state. Decisions are sequential. Rewards are delayed. Deep reinforcement learning is the natural tool.
Where security gets weird
Securing a static radio means securing the protocol stack: authentication, key exchange, ciphers, integrity. Standard cryptography work. PQ-EDHOC, AES-GCM, the rest.
Securing a learning radio means all of that plus securing the policy. Adversarial inputs to the spectrum sensor can poison the policy gradient. Reward shaping can be gamed. The agent's exploration policy can leak information about secondary-user identity. The attack surface is no longer just the protocol stack: it is the controller's induction.
The energy-aware reward insight
The default reward in cognitive radio is spectrum efficiency. Bits-per-Hz, throughput, fairness. Energy comes in as a constraint, not an objective. That works for plug-in deployments but not for battery- or solar-powered 6G IoT, which is a large fraction of the projected device population.
The 2025 Bentham CYBPRO paper treats energy as a first-class component of the reward, with an explicit Pareto coefficient the controller can move along at runtime. Result: 25 to 30 percent energy savings at 0.93 sensing AUC, where the equivalent fixed-allocation baseline either burns more power or misses primary-user activity.
The POMDP through-line to the rest of my work
The exploration-vs-exploitation tradeoff under partial observability is the same problem that shows up in my IEEE ICEACE 2024 sole-author paper on Compressed Suffix Memory. Suffix-tree-based instance methods give you a state abstraction that scales to non-trivial POMDPs without the exponential state explosion of vanilla USM, and Boltzmann sampling beats epsilon-greedy when Q-values are themselves noisy estimates over latent states. See the CSM research deep-dive.
The same instinct shows up in the energy-aware cognitive radio paper. Same POMDP playbook, different payoff structure.
What is next
- Adversarial-robustness benchmark for the policy itself, not just the protocol stack.
- Hybrid PQC-protected telemetry feeding the controller, so the input channel is at least cryptographically protected even when the policy is not.
- Federated training across edge nodes, the IoT angle that ties this to IoT PQ-EDHOC and the smart-meter PQC stack.