阅读视图

发现新文章,点击刷新页面。

Recent LLVM hash table improvements

LLVM has several hash tables. They used quadratic probing within-band sentinel keys (empty, tombstone); recent work has been replacingthat with linear probing with tombstone key removed.

  • DenseMap (replacement forstd::unordered_map):DenseMapInfo::getEmptyKey() /getTombstoneKey().
    • DenseSet: implemented using DenseMap
    • compiler-rt/lib/sanitizer_common/sanitizer_dense_map.hports the implementation for sanitizers.
  • SmallPtrSet (replacement forstd::unordered_set<T *>): hard-coded -1(empty) and -2 (tombstone).
  • StringMap (replacement forstd::unordered_map<std::string, V>)
    • StringSet: implemented usingStringMap

For the open-addressed DenseMap andSmallPtrSet, pointers, references, and iterators areinvalidated by insert. StringMap is different: each entrylives in a heap-allocated StringMapEntry<V> node, soentry pointers survive grow. std::unordered_map, beingnode-based, keeps surviving-element pointers valid across both insertand erase and only invalidates the erased element's own iterator. LLVMcode rarely needs that stronger contract — callers do not holdlong-lived references into the container across mutation — and that gapis what gives pass to relocating erase and bit-array occupancy.

Recently,

  • Tombstones have been removed from DenseMap and SmallPtrSet.erase() also invalidates pointers.
  • DenseMap has also retired its empty-key sentinel, leading tosignificant performance improvements. DenseMap with integer keys(int/unsigned/size_t) had-1/-2 reserved — a footgun, now fixed.
  • StringMap got Algorithm R deletion too. Its entries are separatelyheap-allocated, so erase keeps entry pointers valid but invalidatesiterators; erase-while-iterating moved to remove_if.

SmallPtrSet

SmallPtrSet has a small mode, used when the number of elements isbelow a threshold, and a large mode. The tombstone state was removed by#96762 in2024. The patch changed erase() operation to invalidateiterators, requiring treewide fixes.

The large mode used a quadratic probing with two sentinel keys(empty, tombstone).

Knuth TAOCP Vol. 3 §6.4 Algorithm R describes an algorithm thatavoids lazy deletion. #197637 hasimplemented it.

I've investigated Robin Hood Hashing and Abseil Swiss Table familyimplementations - both would lead to inferior performance. Robin HoodHashing primarily improves find-miss on high-load factors but imposes asmall cost on find-hit and inserts.

DenseMap

Two major changes have been merged:

  • Replace the tombstone state with Algorithm R deletion. (#200595)
  • Replace the empty state with a bitarray. (#201281)

What the workload looks like

An instrumented clang counts every (KeyT, ValueT)operation while compilingllvm/lib/Analysis/ScalarEvolution.cpp. 597distinct DenseMap/DenseSet types,~186M user ops:

op count probes mean
find-hit 65.2M 1.55
find-miss 65.7M 1.28
insert 47.8M 1.71
erase 7.0M
grow 1.5M

Lookups are ~70% of the load; insert ~26%, erase ~4%. Probe means sitbetween 1.3 and 1.7 — well inside what linear probing handles.operator[] is classified at the bucket it lands on: apresent key is counted as find-hit, an empty slot as insert.

A few instantiations carry most of the traffic:

type find-hit find-miss (mean) insert erase grow peak
DenseMap<const void *, Pass *> 680K 24.8M (1.03) 202K 202K 9.3K 1 KiB
DenseMap<Value *, ValueHandleBase *> 2.8M (2.02) 0 2.2M 2.2M 11 1 MiB
DenseMap<const Value *, StringMapEntry<Value *> *> 3.0M 1.4M (2.26) 540K 439K 13 4 MiB
DenseMap<DeclarationName, StoredDeclsList> 772K 1.2M (1.94) 143K 0 9.0K 64 KiB
DenseMap<LazyCallGraph::Node *, LazyCallGraph::SCC *> 1.3M 98K (2.95) 175K 173K 15 258 KiB
DenseMap<AnalysisKey *, bool> 2.8M 2.0M (1.52) 2.0M 0 124K 1 KiB

DenseMap<const void *, Pass *> runs 25.5M lookupsat 1.03 mean — volume is not clustering.DenseMap<Value *, ValueHandleBase *> is the singlelargest erase consumer at 2.2M; this is the workload where the AlgorithmR relocating-erase callback earns its keep. Its find-miss column is zerobecause every access in llvm/lib/IR/Value.cpp goes throughoperator[] or erase — there is nofind/lookup call site, so an empty-slot probeis bucketed as insert rather than find-miss. The bucket's value-slotaddress is captured byValueHandleBase *&Entry = Handles[V] and stored as alinked-list back-pointer (PrevPtr), which is exactly whatthe new OnMoved callback refreshes when Algorithm R shiftsneighbors. StringMapEntry peaks at 4 MiB, the regime wherethe used-bit array's byte overhead matters most. Structural andgraph-pointer keys (DeclarationName,LazyCallGraph::Node *) still cost a couple of extra probesper miss.

Code size matters. SIMD tables may be a poor fit.

Linear probing +Algorithm R deletion (#200595)

Linear probing needs a strong pointer hash. The oldDenseMapInfo<T*>::getHashValue((p>>4)^(p>>9)) never mixes the high bits ofthe address. Pointers from one allocator grown across multiple slabs mapto the same narrow bucket range, bad under linear probing. Quadraticprobing had masked this issue. #197390switched to a stronger mixer, unblocking both SmallPtrSet and DenseMap.#199369tightened DenseMap's erase() to invalidate iterators underLLVM_ENABLE_ABI_BREAKING_CHECKS, surfacing stale-iteratorcall sites before Algorithm R could crash on them.

#200595 replaces quadratic probing plus tombstones with linearprobing plus Algorithm R. Two bucket states, no lazy markers. AlgorithmR requires each entry to sit on the contiguous probe chain from its homebucket; that's linear probing, not quadratic — the probe scheme had tochange. Erase walks forward from the hole and pulls back any entry whosehome bucket lies behind the hole, so entry pointers may beinvalidated:

1
2
3
4
5
6
7
8
9
10
11
12
13
// Erase buckets[i]; then close the gap.
unsigned j = i;
while (true) {
j = (j + 1) & mask;
if (!used(j)) break;
unsigned home = hash(buckets[j].key) & mask;
// The hole at i lies on j's probe chain from home → shift j into i.
if (((i - home) & mask) < ((j - home) & mask)) {
buckets[i] = std::move(buckets[j]);
i = j;
}
}
unsetUsed(i);

The new erase(Key, OnMoved) anderase(iterator, OnMoved) overloads fire a callback once pershifted bucket; ValueHandleBase::RemoveFromUseList usesthis to refresh PrevPtr.

The first attempt had a war story. The original landed as #199615,then got reverted by #200421after a SCEV crash: PoisoningVH cached a bare bucketpointer across an op that, post-Algorithm R, relocates. The bug waslatent under tombstones, which don't relocate, and only surfaced oncethe algorithm flipped. Fixed in #200540,relanded as #200595.

On stage1-O3 the change is -1.34% instructions andall ten CTMark benchmarks improve between -0.85% and -1.61% (compare).Clang wall is -1.54%, stage1 binary-1.40% — fewer bucket states means less generated code.The commit message remarks that "the in-band sentinel value approach… is the best, or at least very difficult to beat." The used-bitarray below earns the right to violate this.

Packed used-bit array (#201281)

The empty-slot test has switched from getEmptyKey() to abit test. There is a 1-bit-per-bucket uint32 array sharingone allocation with the buckets.

This is a trade-off

  • Find-miss, iteration, and large-bucket inserts win: the bit arrayterminates the probe walk and skips the empty buckets without everloading the bucket key.
  • Find-hit and small-bucket inserts lose: a hit loads the matchedbucket anyway, and a small-bucket insert pays one extra word write.

The aggregate numbers are interesting (compare).Stage1-O3 instructions are -0.99%, but stage2-O3 is+0.13% and stage2-O0-g is +0.34% —instructions actually go up. Clang wall time is-1.37% while instructions are only-0.49%, and size-file is+0.87% because the bit array costs bytes.instructions:u is the wrong cost model when the savings arecache loads: the counter sees the new bit-test sequence; it does not seethe bucket load you didn't issue.

After this patch, DenseMap<int/unsigned/size_t, X>accepts every value of the key domain.

Tree-wide simplification

  • getTombstoneKey() went out as ADT #200959,IR+Analysis#200958, CodeGen+Transforms#200956, llvm-rest#200957, Target#200955, and clang#200634.
  • getEmptyKey() removal (post-#201281) followed in BOLT#201986, clang#201987, flang#201988, lld #201989,lldb#201990, mlir#201991, Polly#201992, Target#201993, CodeGen+Transforms#201994, llvm#201996, IR+Analysis#201997, and ADT#201998.

Several of these commits showed small but consistent wins on thetracker. The IR+Analysis cleanups in particular — #200958for getTombstoneKey() and #201997for getEmptyKey() — each measured around -0.05% to -0.10%stage1-O3 instructions and -0.23% clang wall time. But the headline winis in the trait itself, not the timings: integer keys are now safe, theMLIR #54908 crash class is gone, and the AssertingVH downcast UB becomesmoot. The design discussion lives in #200183.

StringMap

StringMap keys are arbitrary byte strings, so each entryis a heap-allocated StringMapEntry<V> node; a bucketis just a pointer, paired with a 32-bit hash in a parallel array thatrejects mismatches before the byte-wise key compare.

#202103drops the tombstone and switches to linear probing with Algorithm R;xxh3 is strong, so there was no weak-mixer hazard to fix first. Theblast radius is milder than DenseMap: erase relocates onlythe bucket pointers, not the nodes, so entry pointers and referencessurvive — only iterators are invalidated.

That invalidation broke the erase-while-iterating idiom tombstoneshad supported by accident. #202237 and#202520add a DebugEpochBase so a stale iterator asserts underLLVM_ENABLE_ABI_BREAKING_CHECKS, and #202272 addsremove_if (one O(N) pass, like SmallPtrSet)for the migrated call sites.

No used-bit array: StringMap lacks the in-bucket keyload that justified it on DenseMap — occupancy is apointer-null test and the hash array already serves as the rejectfilter. A measured port helped find-miss but left find-hit flat,regressed iteration, and added a third array with no aggregate win.

Rejected alternatives

Prototyped, measured, rejected on SmallPtrSet andDenseMap.

  • Verstable (metadata + home-rooted chains). Winsnegative-probe and iteration; loses clang self-build —metadata + chain logic inlines into every call site for +4–10%binary. Combined allocation, NOINLINE cold paths, fragmentremoval all tried; ~+3% instructions / ~+5% size at best.
  • Robin Hood. The find-miss early-out depends onknowing each resident's displacement; without stored metadata itrequires re-hashing every resident on the probe walk, which costs morethan the saved probes. Storing the displacement recovers the early-outbut adds a metadata cache line, and the swap-carry on insert isintrinsic — +10–20% insert vs Algorithm R regardless oflayout.
  • boost::unordered_flat_map. Bad atpointer keys and code size.

DenseMap variantoptimized for pointer key?

I considered whether DenseMap for pointer keys should keep an in-bandsentinel variant (#200183) —it would avoid the used-bit array's max-rss overhead. Measurements onAMD Zen 4 and Apple M4 ruled it out: the used-bit version is fasterdespite the extra memory.

Takeaways

  • Linear probing + a good mixer beats quadratic + tombstones on apointer-heavy workload. #197390fixed the weak pointer hash.
  • Swiss Table family implementations are poor at small keys. Ifdeletion performance does not matter, you don't need its heavyimplementation.
  • SmallPtrSet was the dry run that de-riskedDenseMap: same algorithm, smaller blast radius, sameconclusion on rejected alternatives.
  • Tombstones aren't free at 4% erase — they slow down insertion.

Aggregate compile-timeimpact

All numbers below compare one squashed commit against its parent, thepre-series baseline 5d5220c5 (the commit before[SmallPtrSet] Drop tombstones in large mode, #197637). Thesquash bundles the whole series so the tracker sees the cumulativeeffect as a single from→to:

  • SmallPtrSet/DenseMap: #197637, #198982, #199369, #200540, #200595,#201281, #201742
  • IR/Analysis getTombstoneKey/getEmptyKeycleanups: #200958, #201997
  • StringMap: #202103, #202237, #202272, #202520

Aggregate measurements from llvm-compile-time-tracker.com:

Significant (≥3σ vs. measured noise): 🟢 improvement · 🔴 regression.Unmarked = within noise.

Configuration instructions:u max-rss
stage1-O3 🟢 -0.38% +0.06%
stage1-ReleaseThinLTO 🟢 -0.40% -0.19%
stage1-ReleaseLTO-g 🟢 -0.43% -0.01%
stage1-O0-g 🟢 -0.19% -0.16%
stage1-aarch64-O3 🟢 -0.35% +0.01%
stage1-aarch64-O0-g 🟢 -0.18% -0.20%
stage2-O3 🟢 -0.42% -0.13%
stage2-O0-g 🟢 -0.18% -0.03%

Clang build:

Metric Old New Δ
instructions:u 20882372M 20784665M 🟢 -0.47%
wall-time 350.61s 352.30s +0.48%
size-file 137834KiB 137726KiB 🟢 -0.08%
size-file (stage1) 154243KiB 154177KiB 🟢 -0.04%

Build wall-time is a single-shot measurement and noisier than thetracker's per-config medians; instructions:u and size-file are thecleaner signals here.

To localize the win, compile the preprocessed sqlite3 amalgamationunder callgrind and sum self-instructions (Ir) per sourcefile; a clang built with -g1 lets the line tables mapinlined header code back to its header. Profile the cc1invocation directly so the driver's fork/exec doesn't hide the work:

1
2
3
4
5
# clang built with -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS=-g1 -DLLVM_ENABLE_ASSERTIONS=off
cc1=$(clang -### -O1 -w -c sqlite3.i -o /dev/null 2>&1 | grep -- -cc1)
eval valgrind --tool=callgrind --callgrind-out-file=cg.out --dump-line=yes $cc1
# sum the Ir column grouped by `fl=` source file
# (positions: instr line ⇒ Ir is the 3rd field; skip the cost line after each calls=)

Baseline 5d5220c5 vs the squashed series(whole compile matches callgrind's ownsummary: line, 22.59B → 22.25B):

File Before After Δ
DenseMap.h 1537M 1462M -4.8%
DenseMapInfo.h 320M 238M -25.7%
SmallPtrSet.{h,cpp} 611M 554M -9.3%
StringMap.{h,cpp} 19.9M 17.5M -12.1%
four containers 2488M 2272M -8.7%
whole compile 22590M 22247M -1.5%
❌