Why the B-tree wins on the cloud
Server databases, PostgreSQL, MySQL InnoDB, SQLite, all default to B-tree (or B+-tree) storage. The reason is balance. Reads are O(log n) with a low constant, writes are O(log n) with the same constant, range scans are sequential along leaf pages. RAM caches the hot pages, the disk handles the cold ones, and SSD makes random IO cheap enough that the read-write balance does not hurt. On a server with 64 GB of RAM and an NVMe SSD, B-trees are excellent.
Why the B-tree loses on the edge
Take the same B-tree and run it on an ESP32 with 320 KB of SRAM, 4 MB of flash, and a sensor that produces 100 writes a second. Three things break.
- Write amplification. Each insert touches a leaf page, possibly a parent, possibly a root. A 100-byte sensor reading triggers a 4 KB page write. Multiply by 100 per second, by the SSD's 10000-cycle endurance, and the device dies in months.
- Power loss. A B-tree split in progress, with the write half-flushed, leaves the index inconsistent. WAL recovery helps, but recovery requires RAM that the ESP32 does not have spare.
- Range scans of recent data. The most common edge query is "give me the last 10 minutes of telemetry". B-trees order by key, not by insertion time, so the recent data is scattered across whichever leaves the keys hashed into.
What an LSM-tree does differently
An LSM-tree splits the storage into two regions. The memtable is a small in-memory sorted structure (skiplist or red-black tree). Writes hit the memtable plus a sequential append-only WAL on disk. When the memtable fills, it flushes to a sorted run on disk (an SSTable). Subsequent flushes produce more SSTables. A compaction process merges and shrinks them in the background.
The four properties that fall out:
- Writes are sequential. The WAL is append-only. The memtable is in RAM. There is no random write ever, until the background flush runs.
- Power loss is cheap. The WAL is the source of truth. On reboot, replay the WAL, rebuild the memtable, done.
- Recent data is colocated. The newest SSTable holds the newest data. Range scans of "last 10 minutes" hit one or two files, not a tree-wide scan.
- Flash endurance is preserved. Sequential writes are friendly to flash translation layers (FTL). Garbage collection on the SSD does less work.
The cost: read amplification
The trade-off is reads. A point lookup may need to check the memtable, plus every SSTable on disk. With 10 SSTables, that is 11 lookups. Two mitigations make this practical at the edge: Bloom filters per SSTable (skip the file if the key is definitely not in it) and a per-level cache. With both, average reads cost roughly 1 to 2 IOs.
Tuning for the edge
The default LSM tunings (RocksDB, LevelDB) target servers with abundant RAM. For the edge, the levers worth pulling:
- Smaller memtable. 64 KB instead of 64 MB. Forces more frequent flushes but keeps the device responsive.
- Tiered compaction over levelled compaction. Levelled compaction has lower read amplification but does much more disk work. Tiered is more write-friendly.
- Time-to-live tombstones. Telemetry data has a known retention window. Drop it during compaction rather than keeping forever.
- Bloom filter bits per key. 8 to 10 is the standard server choice. On the edge, 4 to 6 is enough and saves RAM.
What I built
This pattern is the basis of my IEEE Xplore paper on tuning LSM-trees for IoT and edge environments (Performance Tuning of LSM-Tree Database, 2024). The headline finding is that the right combination of memtable size, compaction strategy, and Bloom filter sizing yields a 3 to 5 times improvement in 95th-percentile write latency on a 4 MB device, with no measurable read regression.
The takeaway
Storage engines reflect the assumptions of the workload they were designed for. B-trees fit cloud workloads. LSM-trees fit edge workloads. The job, when designing for the edge, is to recognise which assumptions actually hold and pick the primitive that matches.