Recent lld/ELF performance improvements
Since the LLVM 22 branch was cut, I've landed patches thatparallelize more link phases and cut task-runtime overhead. This postcompares current main against lld 22.1,
Headline: a Release+Asserts clang --gc-sections link is1.37x as fast as lld 22.1; Chromium debug with --gdb-indexis 1.07x as fast. mold and wild are still ahead — the last sectionexplains why.
Benchmark
lld-0201 is main at 2026-02-01 (6a1803929817);lld-load is main plus the new[ELF] Parallelize input file loading. mold andwild run with --no-fork so the wall-clocknumbers include the linker process itself.
Three reproduce tarballs, --threads=8,hyperfine -w 1 -r 10, pinned to CPU cores withnumactl -C.
| Workload | lld-0201 | lld-load | mold | wild |
|---|---|---|---|---|
clang-23 Release+Asserts, --gc-sections
|
1.255 s | 917.8 ms | 552.6 ms | 367.2 ms |
clang-23 Debug (no --gdb-index) |
4.582 s | 4.306 s | 2.464 s | 1.565 s |
clang-23 Debug (--gdb-index) |
6.291 s | 5.915 s | 4.001 s | N/A |
Chromium Debug (no --gdb-index) |
6.140 s | 5.904 s | 2.665 s | 2.010 s |
Chromium Debug (--gdb-index) |
7.857 s | 7.322 s | 3.786 s | N/A |
Note that llvm/lib/Support/Parallel.cpp design keeps themain thread idle during parallelFor, so--threads=N really utilizes N+1 threads.
wild does not yet implement --gdb-index — it silentlywarns and skips, producing an output about 477 MB smaller on Chromium.For fair 4-way comparisons I also strip --gdb-index fromthe response file; the no --gdb-index rows above use thatsetup.
A few observations before diving in:
- The
--gdb-indexsurcharge on the Chromium link is+1.42 sfor lld (5.90 s → 7.32 s) versus+1.12 sfor mold (2.67 s → 3.79 s). This is currently oneof the biggest remaining gaps. - Excluding
--gdb-index, mold is 1.66x–2.22x as fast andwild 2.5x–2.94x as fast on this machine. There is plenty of roomleft. -
clang-23 Release+Asserts --gc-sections(workload 1) hascollapsed from 1.255 s to 918 ms, a 1.37x speedup over 10 weeks. Most ofthat came from the parallel--gc-sectionsmark, parallelinput loading, and the task-runtime cleanup below — each contributing amultiplicative factor.
macOS (Apple M4) notes
The same clang-23 Release+Asserts link, --threads=8, onan Apple M4 (macOS 15, system allocator for all four linkers):
| Linker | Wall | User | Sys | (User+Sys)/Wall |
|---|---|---|---|---|
| lld-0201 | 324.4 ± 1.5 ms | 502.1 ms | 171.7 ms | 2.08x |
| lld-load | 221.5 ± 1.8 ms | 476.5 ms | 368.8 ms | 3.82x |
| mold | 201.2 ± 1.7 ms | 875.1 ms | 220.5 ms | 5.44x |
| wild | 107.1 ± 0.5 ms | 456.8 ms | 284.6 ms | 6.92x |
Parallelize--gc-sections mark
Garbage collection had been a single-threaded BFS overInputSection graph. On a Release+Asserts clang link,markLive was ~315 ms of the 1562 ms wall time (20%).
markParallel, a level-synchronizedBFS. Each BFS level is processed with parallelFor; newlydiscovered sections land in per-thread queues, which are merged beforethe next level. The parallel path activates when!TrackWhyLive && partitions.size() == 1.Implementation details that turned out to matter:
- Depth-limited inline recursion (
depth < 3) beforepushing to the next-level queue. Shallow reference chains stay hot incache and avoid queue overhead. - Optimistic "load then compare-exchange" section-flag dedup insteadof atomic fetch-or. The vast majority of sections are visited once, sothe load almost always wins.
On the Release+Asserts clang link, markLive dropped from315 ms to 82 ms at --threads=8 (from 199 ms to 50 ms at--threads=16); total wall time 1.16x–1.18x.
Two prerequisite cleanups were needed for correctness:
commit6a874161621e moved Symbol::usedinto the existingstd::atomic<uint16_t> flags. The bitfield waspreviously racing with other mark threads.commit2118499a898b decoupled SharedFile::isNeededfrom themark walk.--as-neededused to flipisNeededinsideresolveReloc, which would have required coordinatedwrites across threads; it is now a post-GC scan of global symbols.
Parallelize input fileloading
Historically, LinkerDriver::createFiles walked thecommand line and called addFile serially.addFile maps the file (MemoryBuffer::getFile),sniffs the magic, and constructs an ObjFile,SharedFile, BitcodeFile, orArchiveFile. For thin archives it also materializes eachmember. On workloads with hundreds of archives and thousands of objects,this serial walk dominates the early part of the link.
The pending patch will rewrite addFile to record aLoadJob for each non-script input together with a snapshotof the driver's state machine (inWholeArchive,inLib, asNeeded, withLOption,groupId). After createFiles finishes,loadFiles fans the jobs out to worker threads. Linkerscripts stay on the main thread because INPUT() andGROUP() recursively call back intoaddFile.
A few subtleties made this harder than it sounds:
-
BitcodeFileand fatLTO construction callctx.saver/ctx.uniqueSaver, both of which arenon-thread-safeStringSaver/UniqueStringSaver. I serialized those constructors behind amutex; pure-ELF links hit it zero times. - Thin-archive member buffers used to be appended to
ctx.memoryBuffersdirectly. To keep the outputdeterministic across--threadsvalues, each job nowaccumulates into a per-jobSmallVectorwhich is merged intoctx.memoryBuffersin command-line order. -
InputFile::groupIdused to be assigned inside theInputFileconstructor from a global counter. With parallelconstruction the assignment race would have been unobservable but stillugly;b6c8cba516daabced0105114a7bcc745bc52faaehoists ++nextGroupIdinto the serial driver loop and storesthe value into each file after construction.
The output is byte-identical to the old lld and deterministic across--threads values, which I verified with diffacross --threads={1,2,4,8} on Chromium.
A --time-trace breakdown is useful to set expectations.On Chromium, the serial portion of createFiles accounts foronly ~81 ms of the 5.9 s wall, and loadFiles (after thispatch) runs in ~103 ms in parallel. Serial readFile/mmap isnot the bottleneck. What moves the needle is overlapping the per-fileconstructor work — magic sniffing, archive member materialization,bitcode initialization — with everything else that now kicks off on themain thread while workers chew through the job list.
Extending parallelrelocation scanning
Relocation scanning has been parallel since LLVM 17, but three caseshad opted out via bool serial:
-
-z nocombreloc, because.rela.dynmergedrelative and non-relative relocations and needed deterministicordering. - MIPS, because
MipsGotSectionis mutated duringscanning. - PPC64, because
ctx.ppc64noTocRelax(aDenseSetof(Symbol*, offset)pairs) waswritten without a lock.
.rela.dyn withcombreloc=true; the only remaining effect of-z nocombreloc is suppressing DT_RELACOUNT. ctx.ppc64noTocRelax with thealready-existing ctx.relocMutex, which is only taken onrare slow paths. After these changes, only MIPS still runs scanningserially.
Target-specific relocationscanning
Relocation scanning used to go through a generic loop inRelocations.cpp that calledTarget->getRelExpr through a virtual for everyrelocation — once to classify the expression kind (PC-relative, PLT,TLS, etc.) and again from the TLS-optimization dispatch. On anyrealistic link that is a hot inner loop running over tens of millions ofrelocations, and the virtual call plus its post-dispatch switch are areal fraction of the cost.
The fix is to move the whole per-section scan loop intotarget-specific code, so each Target::scanSection /scanSectionImpl pair can inline its owngetRelExpr, handle TLS optimization in-place, andspecialize for the two or three relocation kinds that dominate on thatarchitecture. Rolled out across most backends in early 2026:
4b887533389cx86 (i386 / x86-64). On lld's own object files, R_X86_64_PC32andR_X86_64_PLT32make up ~95%of relocations and now hit an inlined hot path.371e0e2082e9AArch64, 4ea72c1e8cbdRISC-V, cd01e6526af6LoongArch, c04b00de7508ARM, 6d9169553029Hexagon, aec1c984266cSystemZ, 5e87f8147d68PPC32, aecc4997bf12PPC64.
Besides devirtualization, inlining TLS relocation handling intoscanSectionImpl let the TLS-optimization-specificexpression kinds be replaced with general ones:R_RELAX_TLS_GD_TO_LE / R_RELAX_TLS_LD_TO_LE /R_RELAX_TLS_IE_TO_LE fold into R_TPREL,R_RELAX_TLS_GD_TO_IE folds into R_GOT_PC, andgetTlsGdRelaxSkip goes away. What remains in the shareddispatch path — getRelExpr called fromrelocateNonAlloc and relocateEH — is a muchsmaller set.
Average Scan relocations wall time on a clang-14 link(--threads=8, x86-64, 50 runs, measured via--time-trace) drops from 110 ms to 102 ms, ~8% from the x86commit alone.
Faster getSectionPiece
Merge sections (SHF_MERGE) split their input into"pieces". Every reference into a merge section needs to map an offset toa piece. The old implementation was always a binary search inMergeInputSection::pieces, called fromMarkLive, includeInSymtab, andgetRelocTargetVA.
- For non-string fixed-size merge sections,
getSectionPieceusesoffset / entsizedirectly. - For non-section
Definedsymbols pointing into mergesections, the piece index is pre-resolved duringsplitSectionsand packed intoDefined::valueas((pieceIdx + 1) << 32) | intraPieceOffset.
The binary search is now limited to references via section symbols(addend-based), which is common on AArch64 but rare on x86-64 where theassembler emits local labels for .L references intomergeable strings. The clang-relassert link with--gc-sections is 1.05x as fast.
Optimizingthe underlying llvm/lib/Support/Parallel.cpp
All of the wins above rely onllvm/lib/Support/Parallel.cpp, the tiny work-stealing-ishtask runtime shared by lld, dsymutil, and a handful of debug-info tools.Four changes in that file mattered:
commitc7b5f7c635e2 — parallelForused to pre-split work intoup toMaxTasksPerGroup(1024) tasks and spawn each throughthe executor's mutex + condvar. It now spawns onlyThreadCountworkers; each grabs the next chunk via anatomicfetch_add. On a clang-14 link(--threads=8), futex calls dropped from ~31K to ~1.4K(glibc release+asserts); wall time 927 ms → 879 ms. This is the reasonthe parallel mark and parallel scan numbers are worth quoting at all —on the old runtime, spawn overhead was a real fraction of the work beingparallelized.commit9085f74018a4 — TaskGroup::spawn()replaced themutex-basedLatch::inc()with an atomicfetch_addand passes theLatch&throughExecutor::add()so the worker callsdec()directly. Eliminates onestd::functionconstruction perspawn.commit5b1be759295c — removed the Executorabstract baseclass.ThreadPoolExecutorwas always the onlyimplementation;add()andgetThreadCount()arenow direct calls instead of virtual dispatches.commit8daaa26efdda — enables nested parallel TaskGroupviawork-stealing. Historically, nested groups ran serially to avoiddeadlock (the thread that was supposed to run a nested task might beblocked in the outer group'ssync()). Worker threads nowactively execute tasks from the queue while waiting, instead of justblocking. Root-level groups on the main thread keep the efficientblockingLatch::sync(), so the common non-nested case paysnothing. In lld this letsSyntheticSection::writeTocallswith internal parallelism (GdbIndexSection,MergeNoTailSection) parallelize automatically when calledfrom insideOutputSection::writeTo, instead of degeneratingto serial execution on a worker thread — which was the exact situationD131247 had worked aroundby threading a rootTaskGroupall the way down.
Small wins worth mentioning
036b755daedbparallelizes demoteAndCopyLocalSymbols. Each file collectslocalSymbol*pointers in a per-file vector viaparallelFor, which are merged into the symbol tableserially. Linking clang-14 (--no-gc-sections) with its 208K.symtabentries is 1.04x as fast.
Where lld still loses time
To locate the gap I ran lld --time-trace,mold --perf, and wild --time on the Chromium--gdb-index link (--threads=8). Grouped intocomparable phases:
| Work scope | lld-0201 | lld-load | mold | wild |
|---|---|---|---|---|
| mmap + parse sections + merge strings + symbol resolve | 376 ms | 292 ms | 230 ms | 113 ms |
--gc-sections mark |
268 ms | 79 ms | 30 ms | — * |
| Scan relocations | 106 ms | 97 ms | 60 ms | — * |
| Assign / finalize / symtab | 76 ms | 100 ms | 27 ms | 84 ms |
| Write sections | 87 ms | 87 ms | 90 ms | 110 ms |
| Wall (hyperfine) | 1255 ms | 918 ms | 553 ms | 367 ms |
* wild fuses --gc-sections marking and relocation-drivenlive-section propagation into one Find required sectionspass (60 ms), so these two rows are effectively merged.
A subtlety on wild's parse number: wild'sLoad inputs into symbol DB phase by itself is only 23 ms,but it does only mmap + .symtab scan +global-name hash bucketing. Section-header parsing, mergeable-stringsplitting, COMDAT handling, and symbol resolution are deferred to laterwild phases. The 113 ms row above sums those(Load inputs into symbol DB 23 +Resolve symbols 12 + Section resolution 21 +Merge strings 57) so it covers the same work lld callsParse input files.
Meaningful gaps, in order of absolute impact:
Parse: lld-load 292 ms vs wild 113 ms ≈ 2.6x. Thebiggest remaining cross-linker gap on this workload, and the samepattern holds on the larger workloads below. The phase is alreadyparallel; the gap is constant factor in the per-object parse path(reading section headers, interning strings, splitting CIEs/FDEs,merging globals into the symbol table). On clang-relassert the 179 msparse gap alone accounts for ~33% of the 551 ms wall-clock gap betweenlld-load and wild.
Assign / finalize / symtab: 100 ms vs mold 27 ms ≈3.7x. finalizeAddressDependentContent,assignAddresses, finalizeSynthetic,Add symbols to symtabs, and Finalize .eh_frametogether cost ~100 ms on this workload; mold's equivalents(compute_section_sizes, compute_symtab_size,create_output_sections, set_osec_offsets)total 27 ms. This gap grows linearly with the number of.symtab entries — on clang-debug it's 127 ms lld vs 27 msmold, on Chromium 570 ms vs ~80 ms. I have a local branch that turnsSymbolTableBaseSection::finalizeContents into aprefix-sum-driven parallel fill and replaces thestable_partition + MapVector shuffle withper-file lateLocals buffers. 1640 ELF tests pass; notposted yet.
markLive: 79 ms, 3.4x faster than the Feb 1baseline (268 ms). This is apples-to-oranges comparison: lldsupports __start_/__stop_ edges,SHF_LINK_ORDER dependencies, linker scriptsKEEP, and others features. lld correctly handles--gc-sections --as-needed with Symbol::used(tests gc-sections-shared.s, weak-shared-gc.s,as-needed-not-in-regular.s):
-
mold over-approximates
DT_NEEDEDon twoaxes: it emitsDT_NEEDEDfor DSOs referenced onlyvia weak relocs, and for DSOs referenced only from GC'd sections. Italso retains undefined symbols that are only reachable from deadsections in.dynsym. -
wild handles weak refs correctly but not dead-sectionrefs: weak-only references do not force
DT_NEEDED(matching lld), but DSOs referenced only from GC'd sections still getDT_NEEDEDentries. wild does drop the correspondingundefined symbols from.dynsym, so itsDT_NEEDEDdecision and its symtab-inclusion decisiondiverge slightly. - lld is strictest on all three axes
Scan relocations: 97 ms vs 60 ms. Clean 1.6x ratio,small absolute. Target-specific scanning (theAdd target-specific relocation scanning for …) removed somedispatch overhead; what remains isInputSectionBase::relocations overhead. wild foldsrelocation-driven liveness into Find required sections,which is why there's no separate wild row.
Interestingly, writing section content is not a gap(87–110 ms across all four). The earlier assumption that.debug_* section writes were a lld weakness didn't survivemeasurement.
One cost that only shows up on debug-info-heavy workloads is--gdb-index construction, which lld does in ~1.3 s vsmold's ~0.9 s on Chromium. The work is embarrassingly parallel perinput, but lld funnels string interning through a shardedDenseMap; mold uses a lock-free ConcurrentMapsized by HyperLogLog. wild does not yet implement--gdb-index.
wild is worth calling out separately: its user time is comparable tolld's but its system time is roughly half, and its parse phase is 4-8xfaster than either of the C++ linkers across all three workloads. moldis at the other extreme — the highest user time on every workload,bought back by aggressive parallelism.