Recent lld/ELF performance improvements

作者 MaskRay

2026年4月12日 15:00

Since the LLVM 22 branch was cut, I've landed patches thatparallelize more link phases, cut task-runtime overhead, and removeper-relocation hotspots. This post compares current mainagainst lld 22.1, mold, andwild.

Headline: a Release+Asserts clang --gc-sections link is1.37x as fast as lld 22.1; Chromium debug with --gdb-indexis 1.07x as fast. mold and wild are still ahead — the last sectionexplains why.

Benchmark

Three reproduce tarballs, --threads=8,hyperfine -w 1 -r 10, pinned to CPU cores withnumactl -C. lld-0201 is main at 2026-02-01(6a1803929817); lld-load is main plus the new[ELF] Parallelize input file loading. mold andwild run with --no-fork so the wall-clocknumbers include the linker process itself.

Workload	lld-0201	lld-load	mold	wild
clang-23 Release+Asserts, `--gc-sections`	1.255 s	917.8 ms	552.6 ms	367.2 ms
clang-23 Debug (no `--gdb-index`)	4.582 s	4.306 s	2.464 s	1.565 s
clang-23 Debug (`--gdb-index`)	6.291 s	5.915 s	4.001 s	N/A
Chromium Debug (no `--gdb-index`)	6.140 s	5.904 s	2.665 s	2.010 s
Chromium Debug (`--gdb-index`)	7.857 s	7.322 s	3.786 s	N/A

Note that llvm/lib/Support/Parallel.cpp design keeps themain thread idle during parallelFor, so--threads=N really utilizes N+1 threads.

wild does not yet implement --gdb-index — it silentlywarns and skips, producing an output about 477 MB smaller on Chromium.For fair 4-way comparisons I also strip --gdb-index fromthe response file; the no --gdb-index rows above use thatsetup.

A few observations before diving in:

The --gdb-index surcharge on the Chromium link is+1.42 s for lld (5.90 s → 7.32 s) versus+1.12 s for mold (2.67 s → 3.79 s). This is currently oneof the biggest remaining gaps.
Excluding --gdb-index, mold is 1.66x–2.22x as fast andwild 2.5x–2.94x as fast on this machine. There is plenty of roomleft.
clang-23 Release+Asserts --gc-sections (workload 1) hascollapsed from 1.255 s to 918 ms, a 1.37x speedup over 10 weeks. Most ofthat came from the parallel --gc-sections mark, parallelinput loading, and the task-runtime cleanup below — each contributing amultiplicative factor.

macOS (Apple M4) notes

The same clang-23 Release+Asserts link, --threads=8, onan Apple M4 (macOS 15, system allocator for all four linkers):

Linker	Wall	User	Sys	(User+Sys)/Wall
lld-0201	324.4 ± 1.5 ms	502.1 ms	171.7 ms	2.08x
lld-load	221.5 ± 1.8 ms	476.5 ms	368.8 ms	3.82x
mold	201.2 ± 1.7 ms	875.1 ms	220.5 ms	5.44x
wild	107.1 ± 0.5 ms	456.8 ms	284.6 ms	6.92x

Parallelize`--gc-sections` mark

Garbage collection had been a single-threaded BFS overInputSection graph. On a Release+Asserts clang link,markLive was ~315 ms of the 1562 ms wall time (20%).

6f9646a598f2adds markParallel, a level-synchronized BFS. Each BFS levelis processed with parallelFor; newly discovered sectionsland in per-thread queues, which are merged before the next level. Theparallel path activates when!TrackWhyLive && partitions.size() == 1.Implementation details that turned out to matter:

Depth-limited inline recursion (depth < 3) beforepushing to the next-level queue. Shallow reference chains stay hot incache and avoid queue overhead.
Optimistic "load then compare-exchange" section-flag dedup insteadof atomic fetch-or. The vast majority of sections are visited once, sothe load almost always wins.

On the Release+Asserts clang link, markLive dropped from315 ms to 82 ms at --threads=8 (from 199 ms to 50 ms at--threads=16); total wall time 1.16x–1.18x.

Two prerequisite cleanups were needed for correctness:

6a874161621emoved Symbol::used into the existingstd::atomic<uint16_t> flags. The bitfield waspreviously racing with other mark threads.
2118499a898bdecoupled SharedFile::isNeeded from the mark walk.--as-needed used to flip isNeeded insideresolveReloc, which would have required coordinated writesacross threads; it is now a post-GC scan of global symbols.

Parallelize input fileloading

Historically, LinkerDriver::createFiles walked thecommand line and called addFile serially.addFile maps the file (MemoryBuffer::getFile),sniffs the magic, and constructs an ObjFile,SharedFile, BitcodeFile, orArchiveFile. For thin archives it also materializes eachmember. On workloads with hundreds of archives and thousands of objects,this serial walk dominates the early part of the link.

The pending patch will rewrite addFile to record aLoadJob for each non-script input together with a snapshotof the driver's state machine (inWholeArchive,inLib, asNeeded, withLOption,groupId). After createFiles finishes,loadFiles fans the jobs out to worker threads. Linkerscripts stay on the main thread because INPUT() andGROUP() recursively call back intoaddFile.

A few subtleties made this harder than it sounds:

BitcodeFile and fatLTO construction callctx.saver / ctx.uniqueSaver, both of which arenon-thread-safe StringSaver /UniqueStringSaver. I serialized those constructors behind amutex; pure-ELF links hit it zero times.
Thin-archive member buffers used to be appended toctx.memoryBuffers directly. To keep the outputdeterministic across --threads values, each job nowaccumulates into a per-job SmallVector which is merged intoctx.memoryBuffers in command-line order.
InputFile::groupId used to be assigned inside theInputFile constructor from a global counter. With parallelconstruction the assignment race would have been unobservable but stillugly; b6c8cba516dahoists ++nextGroupId into the serial driver loop and storesthe value into each file after construction.

The output is byte-identical to the old lld and deterministic across--threads values, which I verified with diffacross --threads={1,2,4,8} on Chromium.

A --time-trace breakdown is useful to set expectations.On Chromium, the serial portion of createFiles accounts foronly ~81 ms of the 5.9 s wall, and loadFiles (after thispatch) runs in ~103 ms in parallel. Serial readFile/mmap isnot the bottleneck. What moves the needle is overlapping the per-fileconstructor work — magic sniffing, archive member materialization,bitcode initialization — with everything else that now kicks off on themain thread while workers chew through the job list.

Extending parallelrelocation scanning

Relocation scanning has been parallel since LLVM 17, but three caseshad opted out via bool serial:

-z nocombreloc, because .rela.dyn mergedrelative and non-relative relocations and needed deterministicordering.
MIPS, because MipsGotSection is mutated duringscanning.
PPC64, because ctx.ppc64noTocRelax (aDenseSet of (Symbol*, offset) pairs) waswritten without a lock.

076226f378dfand dc4df5da886eseparate relative and non-relative dynamic relocations unconditionallyand always build .rela.dyn withcombreloc=true; the only remaining effect of-z nocombreloc is suppressing DT_RELACOUNT. 2f7bd4fa9723then protects ctx.ppc64noTocRelax with the already-existingctx.relocMutex, which is only taken on rare slow paths.After these changes, only MIPS still runs scanning serially.

Faster `getSectionPiece`

Merge sections (SHF_MERGE) split their input into"pieces". Every reference into a merge section needs to map an offset toa piece. The old implementation was always a binary search inMergeInputSection::pieces, called fromMarkLive, includeInSymtab, andgetRelocTargetVA.

42cc45477727changes this in two ways:

For non-string fixed-size merge sections,getSectionPiece uses offset / entsizedirectly.
For non-section Defined symbols pointing into mergesections, the piece index is pre-resolved duringsplitSections and packed into Defined::valueas ((pieceIdx + 1) << 32) | intraPieceOffset.

The binary search is now limited to references via section symbols(addend-based), which is common on AArch64 but rare on x86-64 where theassembler emits local labels for .L references intomergeable strings. The clang-relassert link with--gc-sections is 1.05x as fast.

Optimizingthe underlying `llvm/lib/Support/Parallel.cpp`

All of the wins above rely onllvm/lib/Support/Parallel.cpp, the tiny work-stealing-ishtask runtime shared by lld, dsymutil, and a handful of debug-info tools.Four changes in that file mattered:

c7b5f7c635e2— parallelFor used to pre-split work into up toMaxTasksPerGroup (1024) tasks and spawn each through theexecutor's mutex + condvar. It now spawns only ThreadCountworkers; each grabs the next chunk via an atomic fetch_add.On a clang-14 link (--threads=8), futex calls dropped from~31K to ~1.4K (glibc release+asserts); wall time 927 ms → 879 ms. Thisis the reason the parallel mark and parallel scan numbers are worthquoting at all — on the old runtime, spawn overhead was a real fractionof the work being parallelized.
9085f74018a4— TaskGroup::spawn() replaced the mutex-basedLatch::inc() with an atomic fetch_add andpasses the Latch& through Executor::add()so the worker calls dec() directly. Eliminates onestd::function construction per spawn.
5b1be759295c— removed the Executor abstract base class.ThreadPoolExecutor was always the only implementation;add() and getThreadCount() are now directcalls instead of virtual dispatches.
8daaa26efdda— enables nested parallel TaskGroup via work-stealing.Historically, nested groups ran serially to avoid deadlock (the threadthat was supposed to run a nested task might be blocked in the outergroup's sync()). Worker threads now actively execute tasksfrom the queue while waiting, instead of just blocking. Root-levelgroups on the main thread keep the efficient blockingLatch::sync(), so the common non-nested case pays nothing.In lld this lets SyntheticSection::writeTo calls withinternal parallelism (GdbIndexSection,MergeNoTailSection) parallelize automatically when calledfrom inside OutputSection::writeTo, instead of degeneratingto serial execution on a worker thread — which was the exact situationD131247 had worked aroundby threading a root TaskGroup all the way down.

Small wins worth mentioning

036b755daedbparallelizes demoteAndCopyLocalSymbols. Each file collectslocal Symbol* pointers in a per-file vector viaparallelFor, which are merged into the symbol tableserially. Linking clang-14 (--no-gc-sections) with its 208K.symtab entries is 1.04x as fast.

Where lld still loses time

The benchmark makes several bottlenecks obvious; in rough order ofimpact on the Chromium debug link:

Input-file parsing (parseFiles).Reading section headers, building local symbol tables, splittingCIEs/FDEs out of .eh_frame, etc. On Chromium this is ~2.6 sin lld versus ~1.1 s in mold — roughly 80% of the remaining gap. It'salready mostly parallel, so the difference is constant factors in theper-object parse path.

Symbol-table construction (.symtab /.dynsym). On clang-debug with--gdb-index, lld spends ~127 ms here versus mold's ~27 ms.I have a local branch that turns finalizeContents into aprefix-sum-driven parallel fill and replaces the oldstable_partition + MapVector shuffle withper-file lateLocals buffers; 1640 ELF tests pass but Ihaven't posted it yet.

--gdb-index. +1.42 s on Chromium versus+1.12 s in mold. The work is embarrassingly parallel per input but thecurrent implementation funnels a lot of string interning through asingle hash table. mold uses a lock-free ConcurrentMapsized by HyperLogLog; lld's sharded DenseMap is alreadycompetitive but not yet ahead.

.debug_* section writes. mold and wildparallelize section writes more aggressively; lld still writes several.debug_* sections on a single thread. This dominates the"Write sections" scope (lld 570 ms vs mold 334 ms on clang-debug).

Layout and section assignment.assignAddresses /finalizeAddressDependentContent is 176 ms lld vs 58 ms moldon clang-debug. Not a huge absolute number, but a 3x ratio on codethat's conceptually simple.

wild is worth calling out separately: its user time is comparable tolld's but its system time is roughly half. mold is at the other extreme— the highest user time on every workload, bought back by aggressiveparallelism.

普通视图