Recent lld/ELF performance improvements
Since the LLVM 22 branch was cut, I've landed patches thatparallelize more link phases, cut task-runtime overhead, and removeper-relocation hotspots. This post compares current mainagainst lld 22.1, mold, andwild.
Headline: a Release+Asserts clang --gc-sections link is1.37x as fast as lld 22.1; Chromium debug with --gdb-indexis 1.07x as fast. mold and wild are still ahead — the last sectionexplains why.
Benchmark
Three reproduce tarballs, --threads=8,hyperfine -w 1 -r 10, pinned to CPU cores withnumactl -C. lld-0201 is main at 2026-02-01(6a1803929817); lld-load is main plus the new[ELF] Parallelize input file loading. mold andwild run with --no-fork so the wall-clocknumbers include the linker process itself.
| Workload | lld-0201 | lld-load | mold | wild |
|---|---|---|---|---|
clang-23 Release+Asserts, --gc-sections
|
1.255 s | 917.8 ms | 552.6 ms | 367.2 ms |
clang-23 Debug (no --gdb-index) |
4.582 s | 4.306 s | 2.464 s | 1.565 s |
clang-23 Debug (--gdb-index) |
6.291 s | 5.915 s | 4.001 s | N/A |
Chromium Debug (no --gdb-index) |
6.140 s | 5.904 s | 2.665 s | 2.010 s |
Chromium Debug (--gdb-index) |
7.857 s | 7.322 s | 3.786 s | N/A |
Note that llvm/lib/Support/Parallel.cpp design keeps themain thread idle during parallelFor, so--threads=N really utilizes N+1 threads.
wild does not yet implement --gdb-index — it silentlywarns and skips, producing an output about 477 MB smaller on Chromium.For fair 4-way comparisons I also strip --gdb-index fromthe response file; the no --gdb-index rows above use thatsetup.
A few observations before diving in:
- The
--gdb-indexsurcharge on the Chromium link is+1.42 sfor lld (5.90 s → 7.32 s) versus+1.12 sfor mold (2.67 s → 3.79 s). This is currently oneof the biggest remaining gaps. - Excluding
--gdb-index, mold is 1.66x–2.22x as fast andwild 2.5x–2.94x as fast on this machine. There is plenty of roomleft. -
clang-23 Release+Asserts --gc-sections(workload 1) hascollapsed from 1.255 s to 918 ms, a 1.37x speedup over 10 weeks. Most ofthat came from the parallel--gc-sectionsmark, parallelinput loading, and the task-runtime cleanup below — each contributing amultiplicative factor.
macOS (Apple M4) notes
The same clang-23 Release+Asserts link, --threads=8, onan Apple M4 (macOS 15, system allocator for all four linkers):
| Linker | Wall | User | Sys | (User+Sys)/Wall |
|---|---|---|---|---|
| lld-0201 | 324.4 ± 1.5 ms | 502.1 ms | 171.7 ms | 2.08x |
| lld-load | 221.5 ± 1.8 ms | 476.5 ms | 368.8 ms | 3.82x |
| mold | 201.2 ± 1.7 ms | 875.1 ms | 220.5 ms | 5.44x |
| wild | 107.1 ± 0.5 ms | 456.8 ms | 284.6 ms | 6.92x |
Parallelize--gc-sections mark
Garbage collection had been a single-threaded BFS overInputSection graph. On a Release+Asserts clang link,markLive was ~315 ms of the 1562 ms wall time (20%).
markParallel, a level-synchronized BFS. Each BFS levelis processed with parallelFor; newly discovered sectionsland in per-thread queues, which are merged before the next level. Theparallel path activates when!TrackWhyLive && partitions.size() == 1.Implementation details that turned out to matter:
- Depth-limited inline recursion (
depth < 3) beforepushing to the next-level queue. Shallow reference chains stay hot incache and avoid queue overhead. - Optimistic "load then compare-exchange" section-flag dedup insteadof atomic fetch-or. The vast majority of sections are visited once, sothe load almost always wins.
On the Release+Asserts clang link, markLive dropped from315 ms to 82 ms at --threads=8 (from 199 ms to 50 ms at--threads=16); total wall time 1.16x–1.18x.
Two prerequisite cleanups were needed for correctness:
6a874161621emoved Symbol::usedinto the existingstd::atomic<uint16_t> flags. The bitfield waspreviously racing with other mark threads.2118499a898bdecoupled SharedFile::isNeededfrom the mark walk.--as-neededused to flipisNeededinsideresolveReloc, which would have required coordinated writesacross threads; it is now a post-GC scan of global symbols.
Parallelize input fileloading
Historically, LinkerDriver::createFiles walked thecommand line and called addFile serially.addFile maps the file (MemoryBuffer::getFile),sniffs the magic, and constructs an ObjFile,SharedFile, BitcodeFile, orArchiveFile. For thin archives it also materializes eachmember. On workloads with hundreds of archives and thousands of objects,this serial walk dominates the early part of the link.
The pending patch will rewrite addFile to record aLoadJob for each non-script input together with a snapshotof the driver's state machine (inWholeArchive,inLib, asNeeded, withLOption,groupId). After createFiles finishes,loadFiles fans the jobs out to worker threads. Linkerscripts stay on the main thread because INPUT() andGROUP() recursively call back intoaddFile.
A few subtleties made this harder than it sounds:
-
BitcodeFileand fatLTO construction callctx.saver/ctx.uniqueSaver, both of which arenon-thread-safeStringSaver/UniqueStringSaver. I serialized those constructors behind amutex; pure-ELF links hit it zero times. - Thin-archive member buffers used to be appended to
ctx.memoryBuffersdirectly. To keep the outputdeterministic across--threadsvalues, each job nowaccumulates into a per-jobSmallVectorwhich is merged intoctx.memoryBuffersin command-line order. -
InputFile::groupIdused to be assigned inside theInputFileconstructor from a global counter. With parallelconstruction the assignment race would have been unobservable but stillugly;b6c8cba516dahoists ++nextGroupIdinto the serial driver loop and storesthe value into each file after construction.
The output is byte-identical to the old lld and deterministic across--threads values, which I verified with diffacross --threads={1,2,4,8} on Chromium.
A --time-trace breakdown is useful to set expectations.On Chromium, the serial portion of createFiles accounts foronly ~81 ms of the 5.9 s wall, and loadFiles (after thispatch) runs in ~103 ms in parallel. Serial readFile/mmap isnot the bottleneck. What moves the needle is overlapping the per-fileconstructor work — magic sniffing, archive member materialization,bitcode initialization — with everything else that now kicks off on themain thread while workers chew through the job list.
Extending parallelrelocation scanning
Relocation scanning has been parallel since LLVM 17, but three caseshad opted out via bool serial:
-
-z nocombreloc, because.rela.dynmergedrelative and non-relative relocations and needed deterministicordering. - MIPS, because
MipsGotSectionis mutated duringscanning. - PPC64, because
ctx.ppc64noTocRelax(aDenseSetof(Symbol*, offset)pairs) waswritten without a lock.
.rela.dyn withcombreloc=true; the only remaining effect of-z nocombreloc is suppressing DT_RELACOUNT. ctx.ppc64noTocRelax with the already-existingctx.relocMutex, which is only taken on rare slow paths.After these changes, only MIPS still runs scanning serially.
Faster getSectionPiece
Merge sections (SHF_MERGE) split their input into"pieces". Every reference into a merge section needs to map an offset toa piece. The old implementation was always a binary search inMergeInputSection::pieces, called fromMarkLive, includeInSymtab, andgetRelocTargetVA.
- For non-string fixed-size merge sections,
getSectionPieceusesoffset / entsizedirectly. - For non-section
Definedsymbols pointing into mergesections, the piece index is pre-resolved duringsplitSectionsand packed intoDefined::valueas((pieceIdx + 1) << 32) | intraPieceOffset.
The binary search is now limited to references via section symbols(addend-based), which is common on AArch64 but rare on x86-64 where theassembler emits local labels for .L references intomergeable strings. The clang-relassert link with--gc-sections is 1.05x as fast.
Optimizingthe underlying llvm/lib/Support/Parallel.cpp
All of the wins above rely onllvm/lib/Support/Parallel.cpp, the tiny work-stealing-ishtask runtime shared by lld, dsymutil, and a handful of debug-info tools.Four changes in that file mattered:
c7b5f7c635e2— parallelForused to pre-split work into up toMaxTasksPerGroup(1024) tasks and spawn each through theexecutor's mutex + condvar. It now spawns onlyThreadCountworkers; each grabs the next chunk via an atomicfetch_add.On a clang-14 link (--threads=8), futex calls dropped from~31K to ~1.4K (glibc release+asserts); wall time 927 ms → 879 ms. Thisis the reason the parallel mark and parallel scan numbers are worthquoting at all — on the old runtime, spawn overhead was a real fractionof the work being parallelized.9085f74018a4— TaskGroup::spawn()replaced the mutex-basedLatch::inc()with an atomicfetch_addandpasses theLatch&throughExecutor::add()so the worker callsdec()directly. Eliminates onestd::functionconstruction per spawn.5b1be759295c— removed the Executorabstract base class.ThreadPoolExecutorwas always the only implementation;add()andgetThreadCount()are now directcalls instead of virtual dispatches.8daaa26efdda— enables nested parallel TaskGroupvia work-stealing.Historically, nested groups ran serially to avoid deadlock (the threadthat was supposed to run a nested task might be blocked in the outergroup'ssync()). Worker threads now actively execute tasksfrom the queue while waiting, instead of just blocking. Root-levelgroups on the main thread keep the efficient blockingLatch::sync(), so the common non-nested case pays nothing.In lld this letsSyntheticSection::writeTocalls withinternal parallelism (GdbIndexSection,MergeNoTailSection) parallelize automatically when calledfrom insideOutputSection::writeTo, instead of degeneratingto serial execution on a worker thread — which was the exact situationD131247 had worked aroundby threading a rootTaskGroupall the way down.
Small wins worth mentioning
036b755daedbparallelizes demoteAndCopyLocalSymbols. Each file collectslocalSymbol*pointers in a per-file vector viaparallelFor, which are merged into the symbol tableserially. Linking clang-14 (--no-gc-sections) with its 208K.symtabentries is 1.04x as fast.
Where lld still loses time
The benchmark makes several bottlenecks obvious; in rough order ofimpact on the Chromium debug link:
Input-file parsing (parseFiles).Reading section headers, building local symbol tables, splittingCIEs/FDEs out of .eh_frame, etc. On Chromium this is ~2.6 sin lld versus ~1.1 s in mold — roughly 80% of the remaining gap. It'salready mostly parallel, so the difference is constant factors in theper-object parse path.
Symbol-table construction (.symtab /.dynsym). On clang-debug with--gdb-index, lld spends ~127 ms here versus mold's ~27 ms.I have a local branch that turns finalizeContents into aprefix-sum-driven parallel fill and replaces the oldstable_partition + MapVector shuffle withper-file lateLocals buffers; 1640 ELF tests pass but Ihaven't posted it yet.
--gdb-index. +1.42 s on Chromium versus+1.12 s in mold. The work is embarrassingly parallel per input but thecurrent implementation funnels a lot of string interning through asingle hash table. mold uses a lock-free ConcurrentMapsized by HyperLogLog; lld's sharded DenseMap is alreadycompetitive but not yet ahead.
.debug_* section writes. mold and wildparallelize section writes more aggressively; lld still writes several.debug_* sections on a single thread. This dominates the"Write sections" scope (lld 570 ms vs mold 334 ms on clang-debug).
Layout and section assignment.assignAddresses /finalizeAddressDependentContent is 176 ms lld vs 58 ms moldon clang-debug. Not a huge absolute number, but a 3x ratio on codethat's conceptually simple.
wild is worth calling out separately: its user time is comparable tolld's but its system time is roughly half. mold is at the other extreme— the highest user time on every workload, bought back by aggressiveparallelism.