Since the LLVM 22 branch was cut, I've landed patches thatparallelize more link phases and cut task-runtime overhead. This postcompares current main against lld 22.1, mold, and wild.
Headline: a Release+Asserts clang --gc-sections link is1.37x as fast as lld 22.1; Chromium debug with --gdb-indexis 1.07x as fast. mold and wild are still ahead — the last sectionexplains why.
Benchmark
lld-0201 is main at 2026-02-01 (6a1803929817);lld-load is main plus the new[ELF] Parallelize input file loading. mold andwild run with --no-fork so the wall-clocknumbers include the linker process itself.
Three reproduce tarballs, --threads=8,hyperfine -w 1 -r 10, pinned to CPU cores withnumactl -C.
| Workload |
lld-0201 |
lld-load |
mold |
wild |
clang-23 Release+Asserts, --gc-sections
|
1.255 s |
917.8 ms |
552.6 ms |
367.2 ms |
clang-23 Debug (no --gdb-index) |
4.582 s |
4.306 s |
2.464 s |
1.565 s |
clang-23 Debug (--gdb-index) |
6.291 s |
5.915 s |
4.001 s |
N/A |
Chromium Debug (no --gdb-index) |
6.140 s |
5.904 s |
2.665 s |
2.010 s |
Chromium Debug (--gdb-index) |
7.857 s |
7.322 s |
3.786 s |
N/A |
Note that llvm/lib/Support/Parallel.cpp design keeps themain thread idle during parallelFor, so--threads=N really utilizes N+1 threads.
wild does not yet implement --gdb-index — it silentlywarns and skips, producing an output about 477 MB smaller on Chromium.For fair 4-way comparisons I also strip --gdb-index fromthe response file; the no --gdb-index rows above use thatsetup.
A few observations before diving in:
- The
--gdb-index surcharge on the Chromium link is+1.42 s for lld (5.90 s → 7.32 s) versus+1.12 s for mold (2.67 s → 3.79 s). This is currently oneof the biggest remaining gaps.
- Excluding
--gdb-index, mold is 1.66x–2.22x as fast andwild 2.5x–2.94x as fast on this machine. There is plenty of roomleft.
-
clang-23 Release+Asserts --gc-sections (workload 1) hascollapsed from 1.255 s to 918 ms, a 1.37x speedup over 10 weeks. Most ofthat came from the parallel --gc-sections mark, parallelinput loading, and the task-runtime cleanup below — each contributing amultiplicative factor.
macOS (Apple M4) notes
The same clang-23 Release+Asserts link, --threads=8, onan Apple M4 (macOS 15, system allocator for all four linkers):
| Linker |
Wall |
User |
Sys |
(User+Sys)/Wall |
| lld-0201 |
324.4 ± 1.5 ms |
502.1 ms |
171.7 ms |
2.08x |
| lld-load |
221.5 ± 1.8 ms |
476.5 ms |
368.8 ms |
3.82x |
| mold |
201.2 ± 1.7 ms |
875.1 ms |
220.5 ms |
5.44x |
| wild |
107.1 ± 0.5 ms |
456.8 ms |
284.6 ms |
6.92x |
Parallelize--gc-sections mark
Garbage collection had been a single-threaded BFS overInputSection graph. On a Release+Asserts clang link,markLive was ~315 ms of the 1562 ms wall time (20%).
6f9646a598f2adds markParallel, a level-synchronized BFS. Each BFS levelis processed with parallelFor; newly discovered sectionsland in per-thread queues, which are merged before the next level. Theparallel path activates when!TrackWhyLive && partitions.size() == 1.Implementation details that turned out to matter:
- Depth-limited inline recursion (
depth < 3) beforepushing to the next-level queue. Shallow reference chains stay hot incache and avoid queue overhead.
- Optimistic "load then compare-exchange" section-flag dedup insteadof atomic fetch-or. The vast majority of sections are visited once, sothe load almost always wins.
On the Release+Asserts clang link, markLive dropped from315 ms to 82 ms at --threads=8 (from 199 ms to 50 ms at--threads=16); total wall time 1.16x–1.18x.
Two prerequisite cleanups were needed for correctness:
- 6a874161621emoved
Symbol::used into the existingstd::atomic<uint16_t> flags. The bitfield waspreviously racing with other mark threads.
- 2118499a898bdecoupled
SharedFile::isNeeded from the mark walk.--as-needed used to flip isNeeded insideresolveReloc, which would have required coordinated writesacross threads; it is now a post-GC scan of global symbols.
Parallelize input fileloading
Historically, LinkerDriver::createFiles walked thecommand line and called addFile serially.addFile maps the file (MemoryBuffer::getFile),sniffs the magic, and constructs an ObjFile,SharedFile, BitcodeFile, orArchiveFile. For thin archives it also materializes eachmember. On workloads with hundreds of archives and thousands of objects,this serial walk dominates the early part of the link.
The pending patch will rewrite addFile to record aLoadJob for each non-script input together with a snapshotof the driver's state machine (inWholeArchive,inLib, asNeeded, withLOption,groupId). After createFiles finishes,loadFiles fans the jobs out to worker threads. Linkerscripts stay on the main thread because INPUT() andGROUP() recursively call back intoaddFile.
A few subtleties made this harder than it sounds:
-
BitcodeFile and fatLTO construction callctx.saver / ctx.uniqueSaver, both of which arenon-thread-safe StringSaver /UniqueStringSaver. I serialized those constructors behind amutex; pure-ELF links hit it zero times.
- Thin-archive member buffers used to be appended to
ctx.memoryBuffers directly. To keep the outputdeterministic across --threads values, each job nowaccumulates into a per-job SmallVector which is merged intoctx.memoryBuffers in command-line order.
-
InputFile::groupId used to be assigned inside theInputFile constructor from a global counter. With parallelconstruction the assignment race would have been unobservable but stillugly; b6c8cba516dahoists ++nextGroupId into the serial driver loop and storesthe value into each file after construction.
The output is byte-identical to the old lld and deterministic across--threads values, which I verified with diffacross --threads={1,2,4,8} on Chromium.
A --time-trace breakdown is useful to set expectations.On Chromium, the serial portion of createFiles accounts foronly ~81 ms of the 5.9 s wall, and loadFiles (after thispatch) runs in ~103 ms in parallel. Serial readFile/mmap isnot the bottleneck. What moves the needle is overlapping the per-fileconstructor work — magic sniffing, archive member materialization,bitcode initialization — with everything else that now kicks off on themain thread while workers chew through the job list.
Extending parallelrelocation scanning
Relocation scanning has been parallel since LLVM 17, but three caseshad opted out via bool serial:
-
-z nocombreloc, because .rela.dyn mergedrelative and non-relative relocations and needed deterministicordering.
- MIPS, because
MipsGotSection is mutated duringscanning.
- PPC64, because
ctx.ppc64noTocRelax (aDenseSet of (Symbol*, offset) pairs) waswritten without a lock.
076226f378dfand dc4df5da886eseparate relative and non-relative dynamic relocations unconditionallyand always build .rela.dyn withcombreloc=true; the only remaining effect of-z nocombreloc is suppressing DT_RELACOUNT. 2f7bd4fa9723then protects ctx.ppc64noTocRelax with the already-existingctx.relocMutex, which is only taken on rare slow paths.After these changes, only MIPS still runs scanning serially.
Faster getSectionPiece
Merge sections (SHF_MERGE) split their input into"pieces". Every reference into a merge section needs to map an offset toa piece. The old implementation was always a binary search inMergeInputSection::pieces, called fromMarkLive, includeInSymtab, andgetRelocTargetVA.
42cc45477727changes this in two ways:
- For non-string fixed-size merge sections,
getSectionPiece uses offset / entsizedirectly.
- For non-section
Defined symbols pointing into mergesections, the piece index is pre-resolved duringsplitSections and packed into Defined::valueas ((pieceIdx + 1) << 32) | intraPieceOffset.
The binary search is now limited to references via section symbols(addend-based), which is common on AArch64 but rare on x86-64 where theassembler emits local labels for .L references intomergeable strings. The clang-relassert link with--gc-sections is 1.05x as fast.
Optimizingthe underlying llvm/lib/Support/Parallel.cpp
All of the wins above rely onllvm/lib/Support/Parallel.cpp, the tiny work-stealing-ishtask runtime shared by lld, dsymutil, and a handful of debug-info tools.Four changes in that file mattered:
- c7b5f7c635e2—
parallelFor used to pre-split work into up toMaxTasksPerGroup (1024) tasks and spawn each through theexecutor's mutex + condvar. It now spawns only ThreadCountworkers; each grabs the next chunk via an atomic fetch_add.On a clang-14 link (--threads=8), futex calls dropped from~31K to ~1.4K (glibc release+asserts); wall time 927 ms → 879 ms. Thisis the reason the parallel mark and parallel scan numbers are worthquoting at all — on the old runtime, spawn overhead was a real fractionof the work being parallelized.
- 9085f74018a4—
TaskGroup::spawn() replaced the mutex-basedLatch::inc() with an atomic fetch_add andpasses the Latch& through Executor::add()so the worker calls dec() directly. Eliminates onestd::function construction per spawn.
- 5b1be759295c— removed the
Executor abstract base class.ThreadPoolExecutor was always the only implementation;add() and getThreadCount() are now directcalls instead of virtual dispatches.
- 8daaa26efdda— enables nested parallel
TaskGroup via work-stealing.Historically, nested groups ran serially to avoid deadlock (the threadthat was supposed to run a nested task might be blocked in the outergroup's sync()). Worker threads now actively execute tasksfrom the queue while waiting, instead of just blocking. Root-levelgroups on the main thread keep the efficient blockingLatch::sync(), so the common non-nested case pays nothing.In lld this lets SyntheticSection::writeTo calls withinternal parallelism (GdbIndexSection,MergeNoTailSection) parallelize automatically when calledfrom inside OutputSection::writeTo, instead of degeneratingto serial execution on a worker thread — which was the exact situationD131247 had worked aroundby threading a root TaskGroup all the way down.
Small wins worth mentioning
- 036b755daedbparallelizes
demoteAndCopyLocalSymbols. Each file collectslocal Symbol* pointers in a per-file vector viaparallelFor, which are merged into the symbol tableserially. Linking clang-14 (--no-gc-sections) with its 208K.symtab entries is 1.04x as fast.
Where lld still loses time
To locate the gap I ran lld --time-trace,mold --perf, and wild --time on the Chromium--gdb-index link (--threads=8). Grouped intocomparable phases:
| Phase |
lld |
mold |
| Parse input files |
2778 ms |
1034 ms |
| Scan relocations |
233 ms |
103 ms |
| Assign / finalize layout |
750 ms |
~150 ms |
| Symtab + synthetic finalize |
570 ms |
~80 ms |
| Write sections (copy chunks) |
533 ms |
558 ms |
| Create gdb index |
1317 ms |
911 ms |
| Wall |
6742 ms |
3428 ms |
That leaves four meaningful gaps, in order of absolute impact:
Parse input files: 2.78 s vs 1.03 s, ~52% of the totalgap. Same ratio on clang-debug (2.49 s vs 1.09 s). The phase isalready parallel; the gap is pure constant factor in the per-objectparse path (reading section headers, interning strings, splittingCIEs/FDEs, resolving symbols into the global table). wild is even moreextreme here — its whole "Load inputs into symbol DB" is ~255 ms onChromium, which is where most of its overall advantage comes from.
Assign / finalize / symtab finalize: ~1.3 s vs ~0.23s. finalizeAddressDependentContent,assignAddresses, finalizeSynthetic,Add symbols to symtabs, and Finalize .eh_frametogether cost ~1.3 s on Chromium. mold's equivalents(compute_section_sizes, compute_symtab_size,create_output_sections, set_osec_offsets)total ~230 ms. .symtab alone is ~127 ms lld vs ~27 ms moldon clang-debug; I have a local branch that turnsSymbolTableBaseSection::finalizeContents into aprefix-sum-driven parallel fill and replaces thestable_partition + MapVector shuffle withper-file lateLocals buffers. 1640 ELF tests pass; notposted yet.
Create gdb index: +1.32 s lld vs +0.91 s mold onChromium. Varies by workload — on clang-debug the two are within 200 ms(1.73 s vs 1.54 s). The work is embarrassingly parallel per input, butlld funnels a lot of string interning through a singleDenseMap (sharded, but still); mold uses a lock-freeConcurrentMap sized by HyperLogLog.
Scan relocations: 233 ms vs 103 ms. Small absolutebut a clean 2.3x ratio. Target-specific scanning (theAdd target-specific relocation scanning for … series fromlast year) already removed much of the dispatch overhead; what remainsis per-relocation work in the x86-64 path.
Interestingly, writing section content is not a gap.lld spends 533 ms in Write sections vs mold's 558 ms incopy_chunks vs wild's 574 ms inWrite data to file — all within noise of each other. Theearlier assumption that .debug_* section writes were a lldweakness didn't survive measurement; the --gdb-indexsurcharge really lives in index construction, not the write.
wild is worth calling out separately: its user time is comparable tolld's but its system time is roughly half, and its parse phase is 4-8xfaster than either of the C++ linkers. mold is at the other extreme —the highest user time on every workload, bought back by aggressiveparallelism.