普通视图

发现新文章,点击刷新页面。
昨天以前MaskRay

Fighting Hyrum's Law in LLVM

作者 MaskRay
2026年5月10日 15:00

With a sufficient number of users of an API, it does not matterwhat you promise in the contract: all observable behaviors of yoursystem will be depended on by somebody. — Hyrum's Law

In a compiler, the most common form of Hyrum's Law is dependence onunspecified behavior — hash bucket order, the order of equalelements after std::sort, padding offsets. The same framingcovers a few cases that are technically undefined behavior (use of aninvalidated iterator) or plain incidental properties (ABI struct layout,ELF section offsets).

When the compiler itself harbors such a dependency, the symptom isusually output that varies build-to-build: an unstable sort that landsdifferently after the standard library changes, a hash map whoseiteration order shifts when the hash function does. Occasionally thevariation is run-to-run within a single build —DenseMap<void *, X> keys with an ASLR-derived seedreorder buckets each invocation. Either way, reproducible builds,bisection, and bug reports all assume same input → same output, and astealth Hyrum dependency breaks that.

This post surveys some mechanisms that perturb the contract's blindspots so dependencies cannot quietly form.

Hash seed perturbation

The first line of defense is the hash function itself.llvm/include/llvm/ADT/Hashing.h:

1
2
3
4
5
6
7
8
inline uint64_t get_execution_seed() {
#if LLVM_ENABLE_ABI_BREAKING_CHECKS
return static_cast<uint64_t>(
reinterpret_cast<uintptr_t>(&install_fatal_error_handler));
#else
return 0xff51afd7ed558ccdULL;
#endif
}

The seed XORed into every llvm::hash_value is theruntime address of install_fatal_error_handler — underASLR, different every process. The header comment is explicit:

the seed is non-deterministic per process (address of a functionin LLVMSupport) to prevent having users depend on the particular hashvalues.

Every hash_combine / hash_integer_valuecall picks up the seed, and every DenseMap<K, V>keyed by a hash_value-using type then reorders its bucketsper run. MD5, BLAKE3, SHA1, SHA256 stay byte-stable — those are theright tools when you actually want a digest.

My commitce80c80dca45 introduced the seed in 2024.

Container iteration order

Code can grow dependencies on the iteration order.LLVM_ENABLE_REVERSE_ITERATION walks hash containersbackwards to flag violations.llvm/include/llvm/Support/ReverseIteration.h:

1
2
3
4
5
6
7
template <class T = void *> constexpr bool shouldReverseIterate() {
#if LLVM_ENABLE_REVERSE_ITERATION
return detail::IsPointerLike<T>::value;
#else
return false;
#endif
}

DenseMap flips its BucketItTy tostd::reverse_iterator<pointer>;SmallPtrSet swaps begin() andend(); StringMap bitwise-NOTs the hash beforebucket selection — the only thing that perturbs StringMap,since its hash bypasses get_execution_seed.

Unlike the hash seed, reverse iteration isn't auto-on withassertions; -DLLVM_REVERSE_ITERATION=ON opts in explicitly.In 2026 has already merged fixes triggered by it: 7f703cabf728(MLIR SSA-value completion order), 0b3afd35c41d(MLIR SROA alloca order), and f5e2c5ddcec7(a clang test).

Iterator invalidation

Orthogonal to iteration order: what happens to an existing iteratorafter a mutation. llvm/include/llvm/ADT/EpochTracker.h:

1
2
3
4
5
6
7
8
9
10
11
12
class DebugEpochBase {
uint64_t Epoch = 0;
public:
void incrementEpoch() { ++Epoch; }
~DebugEpochBase() { incrementEpoch(); } // catches use-after-free

class HandleBase {
bool isHandleInSync() const {
return *EpochAddress == EpochAtCreation;
}
};
};

DenseMap and friends inherit fromDebugEpochBase. Mutations bump the epoch; iterators captureit at construction and assert on mismatch. The destructor bumps too, sostale iterators into destroyed containers assert rather than read freedmemory.

Without it, mutate-during-iteration "happens to work" depending onbucket layout — and bucket layout is what the hash seed and reverseiteration above perturb. The epoch check turns the latent bug into aclean assert regardless of which "lucky" layout the run lands on.Collapses to a no-op under NDEBUG.

Pre-shuffling unstable sorts

The same defensive pattern shows up twice in the monorepo, indifferent sub-projects, years apart.

llvm::sort underEXPENSIVE_CHECKS

llvm/include/llvm/ADT/STLExtras.h:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#ifdef EXPENSIVE_CHECKS
namespace detail {
inline unsigned presortShuffleEntropy() {
static unsigned Result(std::random_device{}());
return Result;
}

template <class IteratorTy>
inline void presortShuffle(IteratorTy Start, IteratorTy End) {
std::mt19937 Generator(presortShuffleEntropy());
llvm::shuffle(Start, End, Generator);
}
} // end namespace detail
#endif

template <typename IteratorTy, typename Compare>
inline void sort(IteratorTy Start, IteratorTy End, Compare Comp) {
#ifdef EXPENSIVE_CHECKS
detail::presortShuffle<IteratorTy>(Start, End);
#endif
std::sort(Start, End, Comp);
}

std::sort and qsort are unstable; codeobserving the order of equal elements is depending on undocumentedbehavior. Pre-shuffling makes that observation different every run. commit5a3d47fabcb6 added the wrapper in 2018, motivated by PR35135.

LLVM also ships its own llvm::shuffle rather thancalling std::shuffle, "so that LLVM behaves the same whenusing different standard libraries." A reproducibility tool whosereproducibility depends on the host stdlib is worse than no tool — andthe linker section below relies on this.

llvm::stable_sort deliberately does not pre-shuffle; itis the explicit opt-in for code that legitimately needs ordering ofequal elements.

libc++_LIBCPP_DEBUG_RANDOMIZE_UNSPECIFIED_STABILITY

libc++ has a near-perfect parallel mechanism, designed for downstreamusers rather than the project's own internals.libcxx/include/__debug_utils/randomize_range.h:

1
2
3
4
5
6
7
8
9
10
template <class _AlgPolicy, class _Iterator, class _Sentinel>
_LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX14
void __debug_randomize_range(_Iterator __first, _Sentinel __last) {
#ifdef _LIBCPP_DEBUG_RANDOMIZE_UNSPECIFIED_STABILITY
if (!__libcpp_is_constant_evaluated())
std::__shuffle<_AlgPolicy>(__first, __last, __libcpp_debug_randomizer());
#else
(void)__first; (void)__last;
#endif
}

Three callsites:

  • std::sort — pre-shuffles the input.
  • std::partial_sort — pre-shuffles the input andre-shuffles the unsorted tail afterward.
  • std::nth_element — pre-shuffles, then re-shuffles eachside of the partition.

Seed handling rhymes with get_execution_seed: ASLR orstatic std::random_device for per-process variation, with_LIBCPP_RANDOMIZE_UNSPECIFIED_STABILITY_SEED=<n> as afixed-seed escape hatch. Off by default; C++11 and later only.

libcxx/docs/DesignDocs/UnspecifiedBehaviorRandomization.rstexplains the motivation:

Google has measured couple of thousands of tests to be dependenton the stability of sorting and selection algorithms. As we also plan onupdating (or least, providing under flag more) sorting algorithms, thiseffort helps doing it gradually and sustainably.

It cites PR20837 — aworst-case O(n²) std::sort — as the upgradelibc++ specifically wanted to ship. The shuffle is the gating tool: ifdownstream tests pass with it enabled, they will pass after thealgorithm change too.

Comparing the two is more interesting than either alone:

  • llvm::sort's wrapper is internal hygiene: LLVM is itsown primary user, so the shuffle lives in STLExtras.hbehind a build flag with no docs.
  • libc++'s wrapper is user-facing — DesignDocs/ page,public macro, public seed override, explicit "Patches welcome."invitation. It has to be: libc++'s users are not libc++, and thecontract being defended is the C++ standard itself.
  • libc++ generalizes the primitive:__debug_randomize_range applies at three callsites, eachdeclaring which sub-range the algorithm leaves unspecified. LLVM'swrapper only covers the simpler equal-element case.
  • Hashed containers — std::unordered_* iteration order —are unspecified in both, but libc++ does not randomize them.LLVM-the-library does; on this one surface LLVM is ahead of its ownstdlib.
Linkeroutput: --shuffle-sections and--randomize-section-padding

Two ELF-only lld knobs perturb layout details that no contractcovers.

--shuffle-sections=<glob>=<seed>

lld/ELF/Writer.cpp:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
for (const auto &patAndSeed : ctx.arg.shuffleSections) {
...
const uint32_t seed = patAndSeed.second;
if (seed == UINT32_MAX) {
// If --shuffle-sections <section-glob>=-1, reverse the section order.
// The section order is stable even if the number of sections changes.
// This is useful to catch issues like static initialization order
// fiasco reliably.
std::reverse(matched.begin(), matched.end());
} else {
std::mt19937 g(seed ? seed : std::random_device()());
llvm::shuffle(matched.begin(), matched.end(), g);
}
}

Three regimes in one option:

  • seed = -1 — deterministic reverse, stable even as newsections appear. Glob .init_array* to -1,rebuild, run the test suite: anything that breaks is a realstatic-init-order bug. One flag, no Frankenstein link script.
  • seed > 0 — deterministic random shuffle,reproducible across runs and hosts (because llvm::shuffleis host-independent). Useful in CI without breaking bisection.
  • seed = 0std::random_device()-seeded.Fresh nondeterminism every link.

History: 423cb321dfaeintroduced the =-1 reverse mode; 16c30c3c23efgeneralized to per-glob seeds, which is what makes the.init_array*=-1 recipe possible; c135a68d426ffixed a bug where the feature itself produced an invalid dynamicrelocation order — even Hyrum mitigations have correctness traps.

--randomize-section-padding=<seed>

The sister option perturbs section offsets by insertingpadding between input sections and at segment starts(lld/ELF/Writer.cpp):

1
2
3
4
static void randomizeSectionPadding(Ctx &ctx) {
std::mt19937 g(*ctx.arg.randomizeSectionPadding);
// Insert padding between input sections and at segment starts.
}

Callers grow dependencies on padding-induced offsets the linker neverpromised — profile-guided pipelines, side-channel research, exploittoolchains pinning to specific addresses. A seeded perturbation makesthose dependencies visible.

Both options are ELF-only; MachO and COFF ports have nothingequivalent.

ABI break detection

llvm/include/llvm/Config/abi-breaking.h.cmake:

1
2
3
4
5
6
7
8
9
#if LLVM_ENABLE_ABI_BREAKING_CHECKS
ABI_BREAKING_EXPORT_ABI extern int EnableABIBreakingChecks;
LLVM_HIDDEN_VISIBILITY
__attribute__((weak)) int *VerifyEnableABIBreakingChecks =
&EnableABIBreakingChecks;
#else
ABI_BREAKING_EXPORT_ABI extern int DisableABIBreakingChecks;
...
#endif

Every TU including the header takes a weak reference toEnableABIBreakingChecks orDisableABIBreakingChecks depending on its own build flag.Mixing the two against the same libLLVM produces anunresolved symbol at link time. MSVC gets the same guarantee via#pragma detect_mismatch.

Out-of-tree users routinely compile against headers from one tree andlink against a different libLLVM. Without this gate,whichever struct layout the link happens to pick silently miscompiles;with it, the link fails.

What LLVM is not doing

The mechanisms above all target surfaces no stable consumer shouldcare about: bucket order, equal-element sort order, init-array order.Debuggers, profilers, sanitizers, and reproducible-build infrastructureconsume those outputs and need them stable.

In some cases, stronger guarantee is only provided with explicitoptions. For example, Bitcode and textual IR preserve use-list orderonly under -preserve-bc-uselistorder /-preserve-ll-uselistorder.

A near-cousin: clang's -frandomize-layout-seed /__attribute__((randomize_layout)). Mechanically the same —seeded std::shuffle on struct fields — and it doescoincidentally invalidate offsetof dependencies. But theintent is exploit mitigation, cribbed from GrSecurity's Randstruct GCCplugin: per-build kernel hardening, not a developer tool.

Recent lld/ELF performance improvements

作者 MaskRay
2026年4月12日 15:00

Updated in 2026-05.

Since the LLVM 22 branch was cut, I've landed patches thatparallelize more link phases and cut task-runtime overhead. This postcompares current main against lld 22.1, mold, and wild.

Headline: a Release+Asserts clang --gc-sections link is1.34x as fast as lld 22.1; Chromium debug with --gdb-indexis 1.09x as fast. mold and wild are still ahead — the last sectionexplains why.

Benchmark

lld-0201 is main at 2026-02-01 (6a1803929817);lld-HEAD is main at 2026-05-16 (20b0089ea340), whichincludes [ELF] Parallelize input file loading(commit 83f8eee57d5a) plus the later task-runtime andSymbol cleanups below. mold andwild run with --no-fork so the wall-clocknumbers include the linker process itself.

Three reproduce tarballs, --threads=8,hyperfine -w 2 -r 10 (-w 3 -r 30 for thesub-second clang-relassert link), pinned to CPU cores withnumactl -C 20-28.

Workload lld-0201 lld-HEAD mold wild
clang-23 Release+Asserts, --gc-sections 1.262 s 940.7 ms 599.4 ms 375.8 ms
clang-23 Debug (no --gdb-index) 4.409 s 4.038 s 2.745 s 1.472 s
clang-23 Debug (--gdb-index) 6.038 s 5.627 s 4.418 s N/A
Chromium Debug (no --gdb-index) 6.094 s 5.654 s 2.864 s 2.033 s
Chromium Debug (--gdb-index) 7.708 s 7.070 s 4.196 s N/A

Note that llvm/lib/Support/Parallel.cpp design keeps themain thread idle during parallelFor, so--threads=N really utilizes N+1 threads.

wild does not yet implement --gdb-index — it silentlywarns and skips, producing an output about 477 MB smaller on Chromium.For fair 4-way comparisons I also strip --gdb-index fromthe response file; the no --gdb-index rows above use thatsetup.

A few observations before diving in:

  • The --gdb-index surcharge on the Chromium link is+1.42 s for lld (5.65 s → 7.07 s) versus+1.33 s for mold (2.86 s → 4.20 s). This is currently oneof the biggest remaining gaps.
  • Excluding --gdb-index, mold is 1.47x–1.97x as fast andwild 2.50x–2.78x as fast on this machine. There is plenty of roomleft.
  • clang-23 Release+Asserts --gc-sections (workload 1) hascollapsed from 1.262 s to 941 ms, a 1.34x speedup over ~15 weeks. Mostof that came from the parallel --gc-sections mark, parallelinput loading, and the task-runtime cleanup below — each contributing amultiplicative factor.

macOS (Apple M4) notes

The same clang-23 Release+Asserts --gc-sectionsworkload, same lld-0201 (6a1803929817) andlld-HEAD (20b0089ea340) commits, on an Apple M4 (macOS26.2, system allocator for all four linkers), --threads=8,hyperfine -w 3 -r 30. wild has no --chroot, soit gets --sysroot= pointed at the reproduce directory toresolve the absolute paths in thelibc.so/libm.so GROUP scripts;the work performed is identical.

Linker Wall User Sys (User+Sys)/Wall
lld-0201 365.4 ± 4.3 ms 515.7 ms 265.8 ms 2.14x
lld-HEAD 261.9 ± 4.2 ms 493.0 ms 471.9 ms 3.68x
mold 256.6 ± 3.0 ms 909.9 ms 345.5 ms 4.89x
wild 131.4 ± 1.0 ms 468.0 ms 319.6 ms 5.99x

Parallelize--gc-sections mark

Garbage collection had been a single-threaded BFS overInputSection graph. On a Release+Asserts clang link,markLive was ~315 ms of the 1562 ms wall time (20%).

commit6f9646a598f2 adds markParallel, a level-synchronizedBFS. Each BFS level is processed with parallelFor; newlydiscovered sections land in per-thread queues, which are merged beforethe next level. The parallel path activates when!TrackWhyLive && partitions.size() == 1.Implementation details that turned out to matter:

  • Depth-limited inline recursion (depth < 3) beforepushing to the next-level queue. Shallow reference chains stay hot incache and avoid queue overhead.
  • Optimistic "load then compare-exchange" section-flag dedup insteadof atomic fetch-or. The vast majority of sections are visited once, sothe load almost always wins.

On the Release+Asserts clang link, markLive dropped from315 ms to 82 ms at --threads=8 (from 199 ms to 50 ms at--threads=16); total wall time 1.16x–1.18x.

Two prerequisite cleanups were needed for correctness:

  • commit6a874161621e moved Symbol::used into the existingstd::atomic<uint16_t> flags. The bitfield waspreviously racing with other mark threads.
  • commit2118499a898b decoupled SharedFile::isNeeded from themark walk. --as-needed used to flip isNeededinside resolveReloc, which would have required coordinatedwrites across threads; it is now a post-GC scan of global symbols.

Parallelize input fileloading

Historically, LinkerDriver::createFiles walked thecommand line and called addFile serially.addFile maps the file (MemoryBuffer::getFile),sniffs the magic, and constructs an ObjFile,SharedFile, BitcodeFile, orArchiveFile. For thin archives it also materializes eachmember. On workloads with hundreds of archives and thousands of objects,this serial walk dominates the early part of the link.

commit83f8eee57d5a rewrites addFile to record aLoadJob for each non-script input together with a snapshotof the driver's state machine (inWholeArchive,inLib, asNeeded, withLOption,groupId). After createFiles finishes,loadFiles fans the jobs out to worker threads. Linkerscripts stay on the main thread because INPUT() andGROUP() recursively call back intoaddFile.

A few subtleties made this harder than it sounds:

  • BitcodeFile and fatLTO construction callctx.saver / ctx.uniqueSaver, both of which arenon-thread-safe StringSaver /UniqueStringSaver. I serialized those constructors behind amutex; pure-ELF links hit it zero times.
  • Thin-archive member buffers used to be appended toctx.memoryBuffers directly. To keep the outputdeterministic across --threads values, each job nowaccumulates into a per-job SmallVector which is merged intoctx.memoryBuffers in command-line order.
  • InputFile::groupId used to be assigned inside theInputFile constructor from a global counter. With parallelconstruction the assignment race would have been unobservable but stillugly; b6c8cba516daabced0105114a7bcc745bc52faaehoists ++nextGroupId into the serial driver loop and storesthe value into each file after construction.

The output is byte-identical to the old lld and deterministic across--threads values, which I verified with diffacross --threads={1,2,4,8} on Chromium.

A --time-trace breakdown is useful to set expectations.On Chromium, the serial portion of createFiles accounts foronly ~81 ms of the 5.9 s wall, and loadFiles (after thispatch) runs in ~103 ms in parallel. Serial readFile/mmap isnot the bottleneck. What moves the needle is overlapping the per-fileconstructor work — magic sniffing, archive member materialization,bitcode initialization — with everything else that now kicks off on themain thread while workers chew through the job list.

Extending parallelrelocation scanning

Relocation scanning has been parallel since LLVM 17, but three caseshad opted out via bool serial:

  1. -z nocombreloc, because .rela.dyn mergedrelative and non-relative relocations and needed deterministicordering.
  2. MIPS, because MipsGotSection is mutated duringscanning.
  3. PPC64, because ctx.ppc64noTocRelax (aDenseSet of (Symbol*, offset) pairs) waswritten without a lock.

commit076226f378df and commitdc4df5da886e separate relative and non-relative dynamic relocationsunconditionally and always build .rela.dyn withcombreloc=true; the only remaining effect of-z nocombreloc is suppressing DT_RELACOUNT. commit2f7bd4fa9723 then protects ctx.ppc64noTocRelax with thealready-existing ctx.relocMutex, which is only taken onrare slow paths. After these changes, only MIPS still runs scanningserially.

Target-specific relocationscanning

Relocation scanning used to go through a generic loop inRelocations.cpp that calledTarget->getRelExpr through a virtual for everyrelocation — once to classify the expression kind (PC-relative, PLT,TLS, etc.) and again from the TLS-optimization dispatch. On anyrealistic link that is a hot inner loop running over tens of millions ofrelocations, and the virtual call plus its post-dispatch switch are areal fraction of the cost.

The fix is to move the whole per-section scan loop intotarget-specific code, so each Target::scanSection /scanSectionImpl pair can inline its owngetRelExpr, handle TLS optimization in-place, andspecialize for the two or three relocation kinds that dominate on thatarchitecture. Rolled out across most backends in early 2026:

  • 4b887533389cx86 (i386 / x86-64). On lld's own object files,R_X86_64_PC32 and R_X86_64_PLT32 make up ~95%of relocations and now hit an inlined hot path.
  • 371e0e2082e9AArch64, 4ea72c1e8cbdRISC-V, cd01e6526af6LoongArch, c04b00de7508ARM, 6d9169553029Hexagon, aec1c984266cSystemZ, 5e87f8147d68PPC32, aecc4997bf12PPC64.

Besides devirtualization, inlining TLS relocation handling intoscanSectionImpl let the TLS-optimization-specificexpression kinds be replaced with general ones:R_RELAX_TLS_GD_TO_LE / R_RELAX_TLS_LD_TO_LE /R_RELAX_TLS_IE_TO_LE fold into R_TPREL,R_RELAX_TLS_GD_TO_IE folds into R_GOT_PC, andgetTlsGdRelaxSkip goes away. What remains in the shareddispatch path — getRelExpr called fromrelocateNonAlloc and relocateEH — is a muchsmaller set.

Average Scan relocations wall time on a clang-14 link(--threads=8, x86-64, 50 runs, measured via--time-trace) drops from 110 ms to 102 ms, ~8% from the x86commit alone.

Faster getSectionPiece

Merge sections (SHF_MERGE) split their input into"pieces". Every reference into a merge section needs to map an offset toa piece. The old implementation was always a binary search inMergeInputSection::pieces, called fromMarkLive, includeInSymtab, andgetRelocTargetVA.

commit42cc45477727 changes this in two ways:

  1. For non-string fixed-size merge sections,getSectionPiece uses offset / entsizedirectly.
  2. For non-section Defined symbols pointing into mergesections, the piece index is pre-resolved duringsplitSections and packed into Defined::valueas ((pieceIdx + 1) << 32) | intraPieceOffset.

The binary search is now limited to references via section symbols(addend-based), which is common on AArch64 but rare on x86-64 where theassembler emits local labels for .L references intomergeable strings. The clang-relassert link with--gc-sections is 1.05x as fast.

Optimizingthe underlying llvm/lib/Support/Parallel.cpp

All of the wins above rely onllvm/lib/Support/Parallel.cpp, the tiny work-stealing-ishtask runtime shared by lld, dsymutil, and a handful of debug-info tools.Four changes in that file mattered:

  • commitc7b5f7c635e2 — parallelFor used to pre-split work intoup to MaxTasksPerGroup (1024) tasks and spawn each throughthe executor's mutex + condvar. It now spawns onlyThreadCount workers; each grabs the next chunk via anatomic fetch_add. On a clang-14 link(--threads=8), futex calls dropped from ~31K to ~1.4K(glibc release+asserts); wall time 927 ms → 879 ms. This is the reasonthe parallel mark and parallel scan numbers are worth quoting at all —on the old runtime, spawn overhead was a real fraction of the work beingparallelized.
  • commit9085f74018a4 — TaskGroup::spawn() replaced themutex-based Latch::inc() with an atomicfetch_add and passes the Latch& throughExecutor::add() so the worker calls dec()directly. Eliminates one std::function construction perspawn.
  • commit5b1be759295c — removed the Executor abstract baseclass. ThreadPoolExecutor was always the onlyimplementation; add() and getThreadCount() arenow direct calls instead of virtual dispatches.
  • commit8daaa26efdda — enables nested parallel TaskGroup viawork-stealing. Historically, nested groups ran serially to avoiddeadlock (the thread that was supposed to run a nested task might beblocked in the outer group's sync()). Worker threads nowactively execute tasks from the queue while waiting, instead of justblocking. Root-level groups on the main thread keep the efficientblocking Latch::sync(), so the common non-nested case paysnothing. In lld this lets SyntheticSection::writeTo callswith internal parallelism (GdbIndexSection,MergeNoTailSection) parallelize automatically when calledfrom inside OutputSection::writeTo, instead of degeneratingto serial execution on a worker thread — which was the exact situationD131247 had worked aroundby threading a root TaskGroup all the way down.

Small wins worth mentioning

  • 036b755daedbparallelizes demoteAndCopyLocalSymbols. Each file collectslocal Symbol* pointers in a per-file vector viaparallelFor, which are merged into the symbol tableserially. Linking clang-14 (--no-gc-sections) with its 208K.symtab entries is 1.04x as fast.

The xxh3 hash_combine swap (71d78b2220e4)and the Symbol constructor-init plusredundant-memset removal (905a88b92343,20b0089ea340)have minor improvements.

lld build clang-relassert link
pre-xxh3 (525fab579da1) 939.0 ± 29.0 ms
xxh3 (71d78b2220e4) 932.2 ± 23.3 ms
pre Symbol-init (2e4c820c05fd) 928.4 ± 31.4 ms
HEAD (20b0089ea340) 926.3 ± 30.0 ms

Where lld still loses time

To locate the gap I ran lld --time-trace,mold --perf, and wild --time on theclang-relassert link (clang-23 Release+Asserts,--gc-sections, --threads=8; per-phase numbersare 5-run averages). Grouped into comparable phases:

Work scope lld-0201 lld-HEAD mold wild
mmap + parse sections + merge strings + symbol resolve 391 ms 320 ms 218 ms 120 ms
--gc-sections mark 285 ms 76 ms 36 ms — *
Scan relocations 116 ms 91 ms 61 ms — *
Assign / finalize / symtab 77 ms 86 ms 27 ms 86 ms
Write sections 87 ms 87 ms 77 ms 103 ms
Wall (hyperfine) 1262 ms 941 ms 599 ms 376 ms

* wild fuses --gc-sections marking and relocation-drivenlive-section propagation into one Find required sectionspass (62 ms), so these two rows are effectively merged.

A subtlety on wild's parse number: wild'sLoad inputs into symbol DB phase by itself is only 24 ms,but it does only mmap + .symtab scan +global-name hash bucketing. Section-header parsing, mergeable-stringsplitting, COMDAT handling, and symbol resolution are deferred to laterwild phases. The 120 ms row above sums those(Load inputs into symbol DB 24 +Resolve symbols 13 +Resolve alternative symbol definitions 4 +Section resolution 21 + Merge strings 58) soit covers the same work lld calls Parse input files.

Meaningful gaps, in order of absolute impact:

Parse: lld-HEAD 320 ms vs wild 120 ms ≈ 2.7x. Thebiggest remaining cross-linker gap on this workload, and the samepattern holds on the larger workloads below. The phase is alreadyparallel; the gap is constant factor in the per-object parse path(reading section headers, interning strings, splitting CIEs/FDEs,merging globals into the symbol table). On clang-relassert the 200 msparse gap alone accounts for ~35% of the 565 ms wall-clock gap betweenlld-HEAD and wild.

Assign / finalize / symtab: 86 ms vs mold 27 ms ≈3.2x. finalizeAddressDependentContent,assignAddresses, finalizeSynthetic,Add symbols to symtabs, and Finalize .eh_frametogether cost ~86 ms on this workload; mold's equivalents(compute_section_sizes, compute_symtab_size,create_output_sections, set_osec_offsets)total 27 ms. This gap grows linearly with the number of.symtab entries — on clang-debug it's 127 ms lld vs 27 msmold, on Chromium 570 ms vs ~80 ms. I have a local branch that turnsSymbolTableBaseSection::finalizeContents into aprefix-sum-driven parallel fill and replaces thestable_partition + MapVector shuffle withper-file lateLocals buffers. 1640 ELF tests pass; notposted yet.

markLive: 76 ms, 3.7x faster than the Feb 1baseline (285 ms). This is apples-to-oranges comparison: lldsupports __start_/__stop_ edges,SHF_LINK_ORDER dependencies, linker scriptsKEEP, and others features. lld correctly handles--gc-sections --as-needed with Symbol::used(tests gc-sections-shared.s, weak-shared-gc.s,as-needed-not-in-regular.s):

  • mold over-approximates DT_NEEDED on twoaxes: it emits DT_NEEDED for DSOs referenced onlyvia weak relocs, and for DSOs referenced only from GC'd sections. Italso retains undefined symbols that are only reachable from deadsections in .dynsym.
  • wild handles weak refs correctly but not dead-sectionrefs: weak-only references do not force DT_NEEDED(matching lld), but DSOs referenced only from GC'd sections still getDT_NEEDED entries. wild does drop the correspondingundefined symbols from .dynsym, so itsDT_NEEDED decision and its symtab-inclusion decisiondiverge slightly.
  • lld is strictest on all three axes

Scan relocations: 91 ms vs 61 ms. Clean 1.5x ratio,small absolute. Target-specific scanning (theAdd target-specific relocation scanning for …) removed somedispatch overhead; what remains isInputSectionBase::relocations overhead. wild foldsrelocation-driven liveness into Find required sections,which is why there's no separate wild row.

Interestingly, writing section content is not a gap(77–103 ms across all four). The earlier assumption that.debug_* section writes were a lld weakness didn't survivemeasurement.

One cost that only shows up on debug-info-heavy workloads is--gdb-index construction, which lld does in ~1.3 s vsmold's ~0.9 s on Chromium. The work is embarrassingly parallel perinput, but lld funnels string interning through a shardedDenseMap; mold uses a lock-free ConcurrentMapsized by HyperLogLog. wild does not yet implement--gdb-index.

wild is worth calling out separately: its user time is comparable tolld's but its system time is roughly half, and its parse phase is 4-8xfaster than either of the C++ linkers across all three workloads. moldis at the other extreme — the highest user time on every workload,bought back by aggressive parallelism.

❌
❌