普通视图

昨天以前MaskRay

MaskRay
Recent lld/ELF performance improvementsMaskRay
2026年4月12日 15:00

Recent lld/ELF performance improvements

作者 MaskRay

2026年4月12日 15:00

Since the LLVM 22 branch was cut, I've landed patches thatparallelize more link phases and cut task-runtime overhead. This postcompares current main against lld 22.1, mold, and wild.

Headline: a Release+Asserts clang --gc-sections link is1.37x as fast as lld 22.1; Chromium debug with --gdb-indexis 1.07x as fast. mold and wild are still ahead — the last sectionexplains why.

Benchmark

lld-0201 is main at 2026-02-01 (6a1803929817);lld-load is main plus the new[ELF] Parallelize input file loading. mold andwild run with --no-fork so the wall-clocknumbers include the linker process itself.

Three reproduce tarballs, --threads=8,hyperfine -w 1 -r 10, pinned to CPU cores withnumactl -C.

Workload	lld-0201	lld-load	mold	wild
clang-23 Release+Asserts, `--gc-sections`	1.255 s	917.8 ms	552.6 ms	367.2 ms
clang-23 Debug (no `--gdb-index`)	4.582 s	4.306 s	2.464 s	1.565 s
clang-23 Debug (`--gdb-index`)	6.291 s	5.915 s	4.001 s	N/A
Chromium Debug (no `--gdb-index`)	6.140 s	5.904 s	2.665 s	2.010 s
Chromium Debug (`--gdb-index`)	7.857 s	7.322 s	3.786 s	N/A

Note that llvm/lib/Support/Parallel.cpp design keeps themain thread idle during parallelFor, so--threads=N really utilizes N+1 threads.

wild does not yet implement --gdb-index — it silentlywarns and skips, producing an output about 477 MB smaller on Chromium.For fair 4-way comparisons I also strip --gdb-index fromthe response file; the no --gdb-index rows above use thatsetup.

A few observations before diving in:

The --gdb-index surcharge on the Chromium link is+1.42 s for lld (5.90 s → 7.32 s) versus+1.12 s for mold (2.67 s → 3.79 s). This is currently oneof the biggest remaining gaps.
Excluding --gdb-index, mold is 1.66x–2.22x as fast andwild 2.5x–2.94x as fast on this machine. There is plenty of roomleft.
clang-23 Release+Asserts --gc-sections (workload 1) hascollapsed from 1.255 s to 918 ms, a 1.37x speedup over 10 weeks. Most ofthat came from the parallel --gc-sections mark, parallelinput loading, and the task-runtime cleanup below — each contributing amultiplicative factor.

macOS (Apple M4) notes

The same clang-23 Release+Asserts link, --threads=8, onan Apple M4 (macOS 15, system allocator for all four linkers):

Linker	Wall	User	Sys	(User+Sys)/Wall
lld-0201	324.4 ± 1.5 ms	502.1 ms	171.7 ms	2.08x
lld-load	221.5 ± 1.8 ms	476.5 ms	368.8 ms	3.82x
mold	201.2 ± 1.7 ms	875.1 ms	220.5 ms	5.44x
wild	107.1 ± 0.5 ms	456.8 ms	284.6 ms	6.92x

Parallelize`--gc-sections` mark

Garbage collection had been a single-threaded BFS overInputSection graph. On a Release+Asserts clang link,markLive was ~315 ms of the 1562 ms wall time (20%).

commit6f9646a598f2 adds markParallel, a level-synchronizedBFS. Each BFS level is processed with parallelFor; newlydiscovered sections land in per-thread queues, which are merged beforethe next level. The parallel path activates when!TrackWhyLive && partitions.size() == 1.Implementation details that turned out to matter:

Depth-limited inline recursion (depth < 3) beforepushing to the next-level queue. Shallow reference chains stay hot incache and avoid queue overhead.
Optimistic "load then compare-exchange" section-flag dedup insteadof atomic fetch-or. The vast majority of sections are visited once, sothe load almost always wins.

On the Release+Asserts clang link, markLive dropped from315 ms to 82 ms at --threads=8 (from 199 ms to 50 ms at--threads=16); total wall time 1.16x–1.18x.

Two prerequisite cleanups were needed for correctness:

commit6a874161621e moved Symbol::used into the existingstd::atomic<uint16_t> flags. The bitfield waspreviously racing with other mark threads.
commit2118499a898b decoupled SharedFile::isNeeded from themark walk. --as-needed used to flip isNeededinside resolveReloc, which would have required coordinatedwrites across threads; it is now a post-GC scan of global symbols.

Parallelize input fileloading

Historically, LinkerDriver::createFiles walked thecommand line and called addFile serially.addFile maps the file (MemoryBuffer::getFile),sniffs the magic, and constructs an ObjFile,SharedFile, BitcodeFile, orArchiveFile. For thin archives it also materializes eachmember. On workloads with hundreds of archives and thousands of objects,this serial walk dominates the early part of the link.

The pending patch will rewrite addFile to record aLoadJob for each non-script input together with a snapshotof the driver's state machine (inWholeArchive,inLib, asNeeded, withLOption,groupId). After createFiles finishes,loadFiles fans the jobs out to worker threads. Linkerscripts stay on the main thread because INPUT() andGROUP() recursively call back intoaddFile.

A few subtleties made this harder than it sounds:

BitcodeFile and fatLTO construction callctx.saver / ctx.uniqueSaver, both of which arenon-thread-safe StringSaver /UniqueStringSaver. I serialized those constructors behind amutex; pure-ELF links hit it zero times.
Thin-archive member buffers used to be appended toctx.memoryBuffers directly. To keep the outputdeterministic across --threads values, each job nowaccumulates into a per-job SmallVector which is merged intoctx.memoryBuffers in command-line order.
InputFile::groupId used to be assigned inside theInputFile constructor from a global counter. With parallelconstruction the assignment race would have been unobservable but stillugly; b6c8cba516daabced0105114a7bcc745bc52faaehoists ++nextGroupId into the serial driver loop and storesthe value into each file after construction.

The output is byte-identical to the old lld and deterministic across--threads values, which I verified with diffacross --threads={1,2,4,8} on Chromium.

A --time-trace breakdown is useful to set expectations.On Chromium, the serial portion of createFiles accounts foronly ~81 ms of the 5.9 s wall, and loadFiles (after thispatch) runs in ~103 ms in parallel. Serial readFile/mmap isnot the bottleneck. What moves the needle is overlapping the per-fileconstructor work — magic sniffing, archive member materialization,bitcode initialization — with everything else that now kicks off on themain thread while workers chew through the job list.

Extending parallelrelocation scanning

Relocation scanning has been parallel since LLVM 17, but three caseshad opted out via bool serial:

-z nocombreloc, because .rela.dyn mergedrelative and non-relative relocations and needed deterministicordering.
MIPS, because MipsGotSection is mutated duringscanning.
PPC64, because ctx.ppc64noTocRelax (aDenseSet of (Symbol*, offset) pairs) waswritten without a lock.

commit076226f378df and commitdc4df5da886e separate relative and non-relative dynamic relocationsunconditionally and always build .rela.dyn withcombreloc=true; the only remaining effect of-z nocombreloc is suppressing DT_RELACOUNT. commit2f7bd4fa9723 then protects ctx.ppc64noTocRelax with thealready-existing ctx.relocMutex, which is only taken onrare slow paths. After these changes, only MIPS still runs scanningserially.

Target-specific relocationscanning

Relocation scanning used to go through a generic loop inRelocations.cpp that calledTarget->getRelExpr through a virtual for everyrelocation — once to classify the expression kind (PC-relative, PLT,TLS, etc.) and again from the TLS-optimization dispatch. On anyrealistic link that is a hot inner loop running over tens of millions ofrelocations, and the virtual call plus its post-dispatch switch are areal fraction of the cost.

The fix is to move the whole per-section scan loop intotarget-specific code, so each Target::scanSection /scanSectionImpl pair can inline its owngetRelExpr, handle TLS optimization in-place, andspecialize for the two or three relocation kinds that dominate on thatarchitecture. Rolled out across most backends in early 2026:

4b887533389cx86 (i386 / x86-64). On lld's own object files,R_X86_64_PC32 and R_X86_64_PLT32 make up ~95%of relocations and now hit an inlined hot path.
371e0e2082e9AArch64, 4ea72c1e8cbdRISC-V, cd01e6526af6LoongArch, c04b00de7508ARM, 6d9169553029Hexagon, aec1c984266cSystemZ, 5e87f8147d68PPC32, aecc4997bf12PPC64.

Besides devirtualization, inlining TLS relocation handling intoscanSectionImpl let the TLS-optimization-specificexpression kinds be replaced with general ones:R_RELAX_TLS_GD_TO_LE / R_RELAX_TLS_LD_TO_LE /R_RELAX_TLS_IE_TO_LE fold into R_TPREL,R_RELAX_TLS_GD_TO_IE folds into R_GOT_PC, andgetTlsGdRelaxSkip goes away. What remains in the shareddispatch path — getRelExpr called fromrelocateNonAlloc and relocateEH — is a muchsmaller set.

Average Scan relocations wall time on a clang-14 link(--threads=8, x86-64, 50 runs, measured via--time-trace) drops from 110 ms to 102 ms, ~8% from the x86commit alone.

Faster `getSectionPiece`

Merge sections (SHF_MERGE) split their input into"pieces". Every reference into a merge section needs to map an offset toa piece. The old implementation was always a binary search inMergeInputSection::pieces, called fromMarkLive, includeInSymtab, andgetRelocTargetVA.

commit42cc45477727 changes this in two ways:

For non-string fixed-size merge sections,getSectionPiece uses offset / entsizedirectly.
For non-section Defined symbols pointing into mergesections, the piece index is pre-resolved duringsplitSections and packed into Defined::valueas ((pieceIdx + 1) << 32) | intraPieceOffset.

The binary search is now limited to references via section symbols(addend-based), which is common on AArch64 but rare on x86-64 where theassembler emits local labels for .L references intomergeable strings. The clang-relassert link with--gc-sections is 1.05x as fast.

Optimizingthe underlying `llvm/lib/Support/Parallel.cpp`

All of the wins above rely onllvm/lib/Support/Parallel.cpp, the tiny work-stealing-ishtask runtime shared by lld, dsymutil, and a handful of debug-info tools.Four changes in that file mattered:

commitc7b5f7c635e2 — parallelFor used to pre-split work intoup to MaxTasksPerGroup (1024) tasks and spawn each throughthe executor's mutex + condvar. It now spawns onlyThreadCount workers; each grabs the next chunk via anatomic fetch_add. On a clang-14 link(--threads=8), futex calls dropped from ~31K to ~1.4K(glibc release+asserts); wall time 927 ms → 879 ms. This is the reasonthe parallel mark and parallel scan numbers are worth quoting at all —on the old runtime, spawn overhead was a real fraction of the work beingparallelized.
commit9085f74018a4 — TaskGroup::spawn() replaced themutex-based Latch::inc() with an atomicfetch_add and passes the Latch& throughExecutor::add() so the worker calls dec()directly. Eliminates one std::function construction perspawn.
commit5b1be759295c — removed the Executor abstract baseclass. ThreadPoolExecutor was always the onlyimplementation; add() and getThreadCount() arenow direct calls instead of virtual dispatches.
commit8daaa26efdda — enables nested parallel TaskGroup viawork-stealing. Historically, nested groups ran serially to avoiddeadlock (the thread that was supposed to run a nested task might beblocked in the outer group's sync()). Worker threads nowactively execute tasks from the queue while waiting, instead of justblocking. Root-level groups on the main thread keep the efficientblocking Latch::sync(), so the common non-nested case paysnothing. In lld this lets SyntheticSection::writeTo callswith internal parallelism (GdbIndexSection,MergeNoTailSection) parallelize automatically when calledfrom inside OutputSection::writeTo, instead of degeneratingto serial execution on a worker thread — which was the exact situationD131247 had worked aroundby threading a root TaskGroup all the way down.

Small wins worth mentioning

036b755daedbparallelizes demoteAndCopyLocalSymbols. Each file collectslocal Symbol* pointers in a per-file vector viaparallelFor, which are merged into the symbol tableserially. Linking clang-14 (--no-gc-sections) with its 208K.symtab entries is 1.04x as fast.

Where lld still loses time

To locate the gap I ran lld --time-trace,mold --perf, and wild --time on the Chromium--gdb-index link (--threads=8). Grouped intocomparable phases:

Work scope	lld-0201	lld-load	mold	wild
mmap + parse sections + merge strings + symbol resolve	376 ms	292 ms	230 ms	113 ms
`--gc-sections` mark	268 ms	79 ms	30 ms	— *
Scan relocations	106 ms	97 ms	60 ms	— *
Assign / finalize / symtab	76 ms	100 ms	27 ms	84 ms
Write sections	87 ms	87 ms	90 ms	110 ms
Wall (hyperfine)	1255 ms	918 ms	553 ms	367 ms

* wild fuses --gc-sections marking and relocation-drivenlive-section propagation into one Find required sectionspass (60 ms), so these two rows are effectively merged.

A subtlety on wild's parse number: wild'sLoad inputs into symbol DB phase by itself is only 23 ms,but it does only mmap + .symtab scan +global-name hash bucketing. Section-header parsing, mergeable-stringsplitting, COMDAT handling, and symbol resolution are deferred to laterwild phases. The 113 ms row above sums those(Load inputs into symbol DB 23 +Resolve symbols 12 + Section resolution 21 +Merge strings 57) so it covers the same work lld callsParse input files.

Meaningful gaps, in order of absolute impact:

Parse: lld-load 292 ms vs wild 113 ms ≈ 2.6x. Thebiggest remaining cross-linker gap on this workload, and the samepattern holds on the larger workloads below. The phase is alreadyparallel; the gap is constant factor in the per-object parse path(reading section headers, interning strings, splitting CIEs/FDEs,merging globals into the symbol table). On clang-relassert the 179 msparse gap alone accounts for ~33% of the 551 ms wall-clock gap betweenlld-load and wild.

Assign / finalize / symtab: 100 ms vs mold 27 ms ≈3.7x. finalizeAddressDependentContent,assignAddresses, finalizeSynthetic,Add symbols to symtabs, and Finalize .eh_frametogether cost ~100 ms on this workload; mold's equivalents(compute_section_sizes, compute_symtab_size,create_output_sections, set_osec_offsets)total 27 ms. This gap grows linearly with the number of.symtab entries — on clang-debug it's 127 ms lld vs 27 msmold, on Chromium 570 ms vs ~80 ms. I have a local branch that turnsSymbolTableBaseSection::finalizeContents into aprefix-sum-driven parallel fill and replaces thestable_partition + MapVector shuffle withper-file lateLocals buffers. 1640 ELF tests pass; notposted yet.

markLive: 79 ms, 3.4x faster than the Feb 1baseline (268 ms). This is apples-to-oranges comparison: lldsupports __start_/__stop_ edges,SHF_LINK_ORDER dependencies, linker scriptsKEEP, and others features. lld correctly handles--gc-sections --as-needed with Symbol::used(tests gc-sections-shared.s, weak-shared-gc.s,as-needed-not-in-regular.s):

mold over-approximates DT_NEEDED on twoaxes: it emits DT_NEEDED for DSOs referenced onlyvia weak relocs, and for DSOs referenced only from GC'd sections. Italso retains undefined symbols that are only reachable from deadsections in .dynsym.
wild handles weak refs correctly but not dead-sectionrefs: weak-only references do not force DT_NEEDED(matching lld), but DSOs referenced only from GC'd sections still getDT_NEEDED entries. wild does drop the correspondingundefined symbols from .dynsym, so itsDT_NEEDED decision and its symtab-inclusion decisiondiverge slightly.
lld is strictest on all three axes

Scan relocations: 97 ms vs 60 ms. Clean 1.6x ratio,small absolute. Target-specific scanning (theAdd target-specific relocation scanning for …) removed somedispatch overhead; what remains isInputSectionBase::relocations overhead. wild foldsrelocation-driven liveness into Find required sections,which is why there's no separate wild row.

Interestingly, writing section content is not a gap(87–110 ms across all four). The earlier assumption that.debug_* section writes were a lld weakness didn't survivemeasurement.

One cost that only shows up on debug-info-heavy workloads is--gdb-index construction, which lld does in ~1.3 s vsmold's ~0.9 s on Chromium. The work is embarrassingly parallel perinput, but lld funnels string interning through a shardedDenseMap; mold uses a lock-free ConcurrentMapsized by HyperLogLog. wild does not yet implement--gdb-index.

wild is worth calling out separately: its user time is comparable tolld's but its system time is roughly half, and its parse phase is 4-8xfaster than either of the C++ linkers across all three workloads. moldis at the other extreme — the highest user time on every workload,bought back by aggressive parallelism.

MaskRay
Bit-field layoutMaskRay
2026年2月22日 16:00

Bit-field layout

MaskRay

作者 MaskRay

2026年2月22日 16:00

The C and C++ standards leave nearly every detail to theimplementation. C23 §6.7.3.2:

An implementation may allocate any addressable storage unit largeenough to hold a bit-field. If enough space remains, a bit-field thatimmediately follows another bit-field in a structure shall be packedinto adjacent bits of the same unit. If insufficient space remains,whether a bit-field that does not fit is put into the next unit oroverlaps adjacent units is implementation-defined. The order ofallocation of bit-fields within a unit (high-order to low-order orlow-order to high-order) is implementation-defined. The alignment of theaddressable storage unit is unspecified

C++ is also terse — [class.bit]p1:

Allocation of bit-fields within a class object isimplementation-defined. Alignment of bit-fields isimplementation-defined. Bit-fields are packed into some addressableallocation unit.

The actual rules come from the platform ABI:

Itanium ABI — used on Linux, macOS, BSD, and mostnon-Windows platforms. The Itanium C++ ABI (section2.4) defers bit-field placement to "the base C ABI" but adds its ownconstraints (notably: bit-fields are never placed in the tail padding ofa base class).
System V ABI Processor Supplement. The x86-64 psABI says littleabout bit-fields, while the AArch64AAPCS has a more detailed description.
Microsoft ABI — used on Windows (MSVC). In GCC andClang, structs with the ms_struct attribute also mimicsthis ABI.

Clang implements both ABIs inclang/lib/AST/RecordLayoutBuilder.cpp. It processesbit-fields in two distinct phases:

Layout (storage units) — assign a bit offset toevery bit-field. This is ABI-specified and determinessizeof and alignof.
Codegen (access units) — choose what LLVM IR loadsand stores to emit. This is a compiler optimization that affectsgenerated code but not the ABI.

Understanding these separately is the key to understandingbit-fields. This article focuses on Itanium (the default on mostplatforms), with a section on how the Microsoft ABI differs.

Phase 1: Storage Units

In clang/lib/AST/RecordLayoutBuilder.cpp,ItaniumRecordLayoutBuilder::LayoutFields lays out fields ofa RecordDecl. For each bit field, it callsLayoutBitField to determine the storage unit and bitoffset.

A storage unit is a region of sizeof(T)bytes, by default aligned to alignof(T). For anint bit-field, that's a 4-byte region at a 4-byte-alignedoffset. The alignment can be reduced by the packedattribute and #pragma pack.

StorageUnitSize = sizeof(T) * 8 — the unit's size inbits
FieldAlign = alignof(T) in bits — the unit's alignment(before modifiers)
FieldOffset — the first bit after the lastbit-field

Itanium's Core Rule

if (FieldSize == 0 ||
    (AllowPadding &&
     (FieldOffset & (FieldAlign-1)) + FieldSize > StorageUnitSize))
  FieldOffset = alignTo(FieldOffset, FieldAlign);

Compute where FieldOffset falls within its alignedstorage unit. If the remaining space is less thanFieldSize, round up to the next aligned boundary.Otherwise, pack the bit-field at the current position.

Declared Type Matters

Consider two structs that store the same total number of bits (7 + 7+ 2 = 16) but use different declared types:

struct U8  { uint8_t  a:7, b:7, c:2; };   // sizeof = 3
struct U16 { uint16_t a:7, b:7, c:2; };   // sizeof = 2

struct S1 { int a:14; int b:10; int c:30; };   // sizeof = 8

Walk-through for U8 (all fields haveStorageUnitSize = 8, FieldAlign = 8):

a at bit 0. Position = 0, 0 + 7 = 7 <= 8. Fits.Offset = 0.
b at bit 7. Position = 7, 7 + 7 = 14 > 8. Doesn'tfit. New unit at bit 8. Offset = 8.
c at bit 15. Position = 15 - 8 = 7, 7 + 2 = 9 > 8.Doesn't fit. New unit at bit 16. Offset = 16.

Three 1-byte storage units. sizeof(U8) = 3. Eightpadding bits wasted.

Walk-through for U16 (all fields haveStorageUnitSize = 16, FieldAlign = 16):

a at bit 0. Position = 0, 0 + 7 = 7 <= 16. Fits.Offset = 0.
b at bit 7. Position = 7, 7 + 7 = 14 <= 16. Fits.Offset = 7.
c at bit 14. Position = 14, 14 + 2 = 16 <= 16. Fits.Offset = 14.

One 2-byte storage unit. sizeof(U16) = 2. No waste.

Walk-through for S1 (all fields haveStorageUnitSize = 32, FieldAlign = 32):

a at bit 0. Position = 0, 14 fits in 32. Offset= 0.
b at bit 14. Position = 14, 14 + 10 = 24 <= 32.Fits. Offset = 14. Bits 24–31 are padding (unfilledtail of the first storage unit).
c at bit 24. Position = 24, 24 + 30 = 54 > 32.Doesn't fit. New unit at bit 32. Offset = 32. Bits62–63 are padding (unfilled tail of the second storage unit).

sizeof(S1) = 8, alignof(S1) = 4.

Note: Phase 1 uses two int storage units, but Phase 2 isfree to merge a, b, and c into asingle i64 access unit (since there are no non-bit-fieldbarriers and 8 bytes fits in a register). On x86_64, the LLVM type endsup as { i64 }.

Mixed Types

When bit-fields have different declared types, the storage unit sizechanges:

1	struct S2 { int a:24; short b:8; }; // sizeof = 4

a is int (StorageUnitSize = 32). Placed atbit 0.
b is short (StorageUnitSize = 16,FieldAlign = 16). Current offset = 24. Position within a 16-bit alignedunit: 24 % 16 = 8. 8 + 8 = 16 <= 16. Fits. Offset =24.

sizeof(S2) = 4. The short bit-fieldoverlaps into the int's storage unit. Under Itanium,storage units of different types can share bytes.

The short can also reuse space left by a smallerbit-field:

1	struct S2b { int a:16; short b:8; }; // sizeof = 4

a is int (StorageUnitSize = 32). Placed atbit 0.
b is short (StorageUnitSize = 16,FieldAlign = 16). Current offset = 16. Position within a 16-bit alignedunit: 16 % 16 = 0. 0 + 8 = 8 <= 16. Fits. Offset =16.

Here b's 16-bit storage unit (bits 16–31) falls entirelywithin a's 32-bit storage unit.

Under Microsoft ABI, sizeof is 8: the type size changefrom int to short forces a new storageunit.

This overlapping extends to non-bit-field members too. Anon-bit-field can be allocated within the unfilled bytes of a precedingbit-field's storage unit:

1	struct S2c { uint16_t first:8; uint8_t second; }; // sizeof = 2

first is uint16_t:8. Placed at bit 0. Uses8 bits of a 16-bit storage unit (bytes 0–1).
second is a non-bit-field uint8_t. Thebit-field state resets, but DataSize is only 1 byte. second(alignment 1) goes at byte 1 (bit 8) — insidefirst's storage unit.

Note that this overlapping means a write to first viaits access unit could touch byte 1 where second lives.Phase 2 must ensure the access units don't clobber each other (see Hard constraints).

Under Microsoft ABI, sizeof is 4: firstgets a full uint16_t unit (2 bytes), andsecond starts at byte 2 instead of byte 1.

Non-bit-field AfterBit-field

When a non-bit-field field cannot fit within the remaining bytes, itresets the bit-field state and unfilled bits become padding:

1	struct S3 { int a:10; int b:6; char c; int d:6; }; // sizeof = 4

a at bit 0, b at bit 10 — both fit in thefirst int storage unit. a + b occupy 16 bits =2 bytes, leaving 16 bits unused in the 32-bit storage unit.
c is not a bit-field. It resetsUnfilledBitsInLastUnit to 0. c (achar, alignment 1) goes at byte 2 (bit16). A subsequent bit-field could have used bits 16–31, but thenon-bit-field c claims byte 2.
d is a new int bit-field. Current bitoffset = 24 (byte 3). Position = 24 % 32 = 24. 24 + 6 = 30 <= 32.Fits. Offset = 24.

sizeof(S3) = 4.

Under Microsoft ABI, sizeof is 12:a+b get a full int unit (4bytes), c starts at byte 4, and d gets a newint unit at byte 8.

Bit-field AfterNon-bit-field

The overlap works in the other direction too. When a bit-fieldfollows a non-bit-field, its storage unit can encompass the precedingbytes:

1	struct NB { char a; int b:4; }; // sizeof = 4

a is a char at byte 0. DataSize = 1byte.
b is int:4. FieldOffset = 8, FieldAlign =32, StorageUnitSize = 32. Position: 8 & 31 = 8.8 + 4 = 12 ≤ 32. Fits. Offset = 8.

b's 4-byte int storage unit (bytes 0–3)encompasses a at byte 0. No padding is inserted — the corerule only cares whether the field fits within an aligned unit, notwhether that unit overlaps earlier non-bit-field storage.

Under Microsoft ABI, sizeof is 8: b'sint unit starts at byte 4, after a is paddedto int alignment.

Attributes and Pragmas

Several attributes and pragmas alter the placement rules. They allwork by changing FieldAlign.

packed — setsFieldAlign = 1 (bit-granular packing). Bitfields pack atthe next available bit with no alignment constraint.

1 2	struct [[gnu::packed]] P { int x:4, y:30, z:30; }; // 4 + 30 + 30 = 64 bits = 8 bytes. sizeof = 8.

Under Microsoft ABI, sizeof is 12: each bit-field mustfit within a single int unit, so x,y, and z each get their own 4-byte unit.

packed can also be applied to individual fields:

1 2	struct P2 { short a:8; [[gnu::packed]] int b:30; }; // sizeof = 6, b at bit 8 // Without packed on b: b at bit 32, sizeof = 8

Without packed, b's FieldAlign is 32, so it doesn't fitin a's short storage unit and starts a newint unit at bit 32. With packed, b'sFieldAlign drops to 1, so it packs immediately after a atbit 8.

#pragma pack(N) — capsFieldAlign at N * 8 bits and suppresses thepadding-insertion test (AllowPadding = false, so theoverflow check is skipped — the field is placed at the current offsetwithout rounding up).

1
2
3

#pragma pack(1)
struct PP { char a; int b:4; int c:28; char s; };   // sizeof = 6
#pragma pack()

b packs at bit 8 by the normal core rule —(8 & 31) + 4 = 12 ≤ 32, so it fits. Without#pragma pack, c:28 at bit 12 would fail thesame check — 12 + 28 = 40 > 32 — and round up to bit 32.With #pragma pack(1), AllowPadding is false,so the overflow check is skipped and c stays at bit 12.Total: a(8) + b+c(32) +s(8) = 48 bits = 6 bytes.

aligned(N) — forces minimum alignment.Overrides packed, but is itself overridden by#pragma pack.

1 2	struct A { char a; [[gnu::aligned(16)]] int b:1; char c; }; // b aligned to 16 bytes = bit 128. c at byte 17. sizeof = 32, alignof = 16.

Precedence (for non-zero-width bit-fields):#pragma pack > aligned attr >packed attr > natural alignment.

Zero-width Bitfields

T : 0 rounds up to alignof(T), acting as aseparator. Subsequent fields start in a new storage unit.

1
2
3

struct Z { char x; int : 0; char y; };
// x86:         y at offset 4, sizeof = 5, alignof = 1
// ARM/AArch64: y at offset 4, sizeof = 8, alignof = 4

On most targets, anonymous bit-fields don't contribute to structalignment. But on AArch32/AArch64 (withuseZeroLengthBitfieldAlignment()), zero-width bit-fieldsdo raise the struct's alignment.

Zero-width bit-fields are exempt from both packed and#pragma pack — they always round up toalignof(T).

Microsoft ABI Differences

Clang uses the Microsoft layout rules in two situations: targeting aWindows triple (e.g. x86_64-windows-msvc), which usesMicrosoftRecordLayoutBuilder; or applying__attribute__((ms_struct)) to individual structs on anytarget, which activates the IsMsStruct path insideItaniumRecordLayoutBuilder. GCC documents the rules underTARGET_MS_BITFIELD_LAYOUT_P.

The Microsoft ABI uses a fundamentally different layout strategy.While Itanium packs bit-fields into overlapping storage units ofpotentially different types, Microsoft allocates acomplete storage unit of the declared type, thenparcels bits among successive bit-fields of the same typesize.

The key differences:

Type size changes force a new storage unit. In theGCC documentation's wording: "a bit-field won't share the same storageunit with the previous bit-field if their underlying types havedifferent sizes, and the bit-field will be aligned to the highestalignment of the underlying types of itself and of the previousbit-field." Itanium would let them overlap.

1 2	struct Itn { int a:24; short b:8; }; // sizeof = 4 struct __attribute__((ms_struct)) MS { int a:24; short b:8; }; // sizeof = 8

Under Itanium, b's short storage unitoverlaps into a's int unit — everything fitsin 4 bytes. Under Microsoft, the type size changes from 4 to 2, sob gets its own storage unit. The int unit (4bytes) plus the short unit (2 bytes, padded to 4 foralignment) gives 8 bytes. Note that the rule is about typesize, not type identity — int a:24; unsigned b:8share a unit because both types are 4 bytes.

Each unit is discrete — this is a direct consequence of the type sizerule.

Zero-width bit-fields are ignored unless they follow anon-zero-width bit-field.(MicrosoftRecordLayoutBuilder::layoutZeroWidthBitField.)GCC's documentation: "zero-sized bit-fields are disregarded unless theyfollow another nonzero-size bit-field." When honored, they terminate thecurrent run and affect the struct's alignment.

// MS mode:
struct MS_ZW1 { long : 0; char bar; };                       // sizeof = 1 (no preceding bit-field)
struct MS_ZW2 { char foo; int : 0; char bar; };              // sizeof = 2 (preceding non-bit-field doesn't count)
struct MS_ZW3 { int : 0; long : 0; char bar; };              // sizeof = 1 (zero-width doesn't count either)
struct MS_ZW4 { char foo : 4; int : 0; char bar; };          // sizeof = 8 (non-zero-width bit-field — honored)
struct MS_ZW5 { long : 0; char foo : 4; int : 0; char bar; };  // sizeof = 8 (first ignored, second honored)

Alignment = type size. The alignment of afundamental type always equals its size —alignof(long long) == 8 even on targets where the naturalalignment is 4 (like Darwin PPC32).

Unions. ms_struct ignores all alignment attributesin unions. All bit-fields use alignment 1 and start at offset 0.

Phase 2: Access Units

LLVM IR has no bit-field concept. To access a bit-field, theClang-generated IR must:

Load an integer from memory (the access unit)
Mask and shift to extract or insert the bit-field's bits
Store the integer back

The access unit is the LLVM type that gets loaded and stored.Choosing it well matters:

Too narrow means multiple memory operations for adjacent bit-fieldwrites;
Too wide means touching memory unnecessarily or clobbering adjacentdata.

Implementation: CGRecordLowering::accumulateBitFields(clang/lib/CodeGen/CGRecordLayoutBuilder.cpp).

Itanium: Merging Algorithm

Hard constraints — an access unit must never:

Overlap non-bit-field storage. The C memory modelallows non-bit-field members to be accessed from other threads. Aload/store of the access unit must not touch bytes belonging to othermembers.
Cross a zero-width bit-field at a byte boundary.Zero-width bit-fields define memory location boundaries — they arebarriers.
Extend into reusable tail padding. In C++, aderived class may place fields in a non-POD base class's tail padding.The access unit must not overwrite those bytes.

Soft goals — subject to the hard constraints, accessunits should be:

Power-of-2 sized (1, 2, 4, 8 bytes). Non-power-of-2sizes (e.g., 3 bytes) get lowered as multiple smaller loads plus bitmanipulation.
No wider than a register. Avoids multi-registerloads.
Naturally aligned (on strict-alignment targets).Avoids the compiler synthesizing unaligned access sequences.
As wide as possible within the above. Fewer, wideraccesses let LLVM combine adjacent bit-field writes into oneread-modify-write.

The algorithm: spans then merging.

Step 1 — Spans. Bitfields that share a byte are inseparable.They form a minimal "span" that must be in the same access unit. A spanis a maximal run of bit-fields where each successive one startsmid-byte.

Spans break at byte-aligned boundaries and at zero-width bit-fieldbarriers. A field mid-byte is unconditionally part of the current span —step 2 never sees it as a merge point.

Step 2 — Merge. Starting from each span, try to widen theaccess unit by incorporating the next span. Accept the merge if thecombined unit:

Fits in one register (<= RegSize)
Is power-of-2 and naturally aligned (on strict-alignmenttargets)
Doesn't cross a barrier (zero-width bit-field or non-bit-fieldstorage)
The natural iN type fits before the limit offset

Track the best candidate and install it when merging can't improvefurther.

Access unit representation.

Clang represents each access unit as either an integer typeiN or an array type [N x i8] (seeCGRecordLowering::accumulateBitFields). iN ispreferred — it generates a single load/store instruction. But LLVM'siN types have allocation sizes rounded up to powers of 2(DataLayout.getTypeAllocSize). For example,i24 has allocation size 4 bytes.

If that rounded-up size would extend past the next field or pastreusable tail padding, the access unit is clipped to[N x i8], which has an exact byte count. Clang assumesclipped for each new span (BestClipped = true) and sets itto false only when the natural iN fits within the availablespace (BeginOffset + TypeSize <= LimitOffset).

// Tail padding reuse (C++)
struct A { int x:24; ~A(); };      // non-POD: DataSize=3, Size=4
struct B : A { char c; };          // c at offset 3, in A's tail padding

// i24 allocates 4 bytes, but byte 3 belongs to B::c.
// Access unit for x is clipped to [3 x i8].

Strict vs cheap unaligned. On targets with cheapunaligned access (x86, AArch64 without +strict-align),alignment checks are skipped — spans merge freely up to register width.On strict-alignment targets (e.g. -mstrict-align), a mergeis rejected if the combined access unit would not be naturally alignedat its offset within the struct.

struct Align { char x; short a:12; short b:4; char c:8; }; // sizeof = 6

// AArch64 -mno-strict-align:  %struct.S = type <{ i8, i8, i32 }>
//   → a+b+c merged into one i32 at offset 2 (unaligned, but cheap)
// AArch64 -mstrict-align:     %struct.S = type { i8, i16, i8 }
//   → a+b merged
//   → +c rejected; a+b stay as i16, c gets its own i8

-ffine-grained-bit-field-accesses. ThisClang flag disables merging entirely. Each span becomes its own accessunit — no adjacent spans are combined. For example:

1
2
3

struct S4 { unsigned long f1:28, f2:4, f3:12; };
// Default:        %struct.S4 = type { i64 }       — spans merged into one access unit
// Fine-grained:   %struct.S4 = type { i32, i16 }  — each span kept separate

The flag is incompatiblewith sanitizers and is automatically disabled (with a warning) whenany sanitizer is active.

Returning to S3:

1	struct S3 { int a:10; int b:6; char c; int d:6; };

Phase 1 assigned: a@0, b@10, c@16 (byte 2), d@24 (byte 3).

Phase 2 sees two bit-field runs (separated by non-bit-fieldc):

Run 1: a and b (bits 0–15, bytes0–1). They share byte 1 (bits 8–15), so they form one span. The spancovers 2 bytes. The natural type i16 fits exactly — noclipping needed. Access unit: i16.

Run 2: d (bits 24–29, byte 3). Single span, 6bits in 1 byte. Access unit: i8.

The resulting LLVM struct type:

1 2	%struct.S3 = type { i16, i8, i8 } a,b c d

To read a, codegen loads the i16, extractsbits 0–9. To read b, it loads the same i16,extracts bits 10–15. Neither load touches c.

When clipping is needed. Widen the bit-fields soa + b no longer fits in 2 bytes:

1	struct S3w { int a:14; int b:10; char c; int d:6; };

Phase 1 assigned: a@0, b@14, c@24 (byte 3), d@32 (byte 4).sizeof(S3w) = 8.

Run 1: a and b (bits 0–23, bytes0–2). The span covers 3 bytes. The natural type i24 hasallocation size 4 bytes — but byte 3 belongs to c. Theaccess unit is clipped to [3 x i8].

Run 2: d (bits 32–37, byte 4). Access unit:i8.

1 2	%struct.S3w = type { [3 x i8], i8, i8, [3 x i8] } a,b c d padding

Endianness.

Access unit selection is endianness-agnostic — spans, merging, andclipping all work in byte offsets from the start of the struct.Endianness matters only when codegen emits the shift/mask sequence toextract or insert a bitfield within its access unit.

LLVM loads an access unit as a single integer. On little-endian, bit0 of the integer corresponds to the lowest-addressed byte's LSB —bitfield offsets from Phase 1 can be used directly as shift amounts. Onbig-endian, bit 0 of the integer corresponds to the highest-addressedbyte's MSB, so the bit numbering within the loaded integer isreversed.

Clang handles this in setBitFieldInfo(CGRecordLayoutBuilder.cpp):

Info.Offset = (unsigned)(getFieldBitOffset(FD) - Context.toBits(StartOffset));
// ...
if (DataLayout.isBigEndian())
  Info.Offset = Info.StorageSize - (Info.Offset + Info.Size);

The little-endian offset counts up from the LSB; the big-endianoffset is mirrored to count down from the MSB.EmitLoadOfBitfieldLValue (CGExpr.cpp) thenuses Info.Offset uniformly — it right-shifts byOffset and masks to Size bits, which works forboth endiannesses because the flip was already baked intoOffset.

Microsoft: Discrete AccessUnits

Microsoft ABI's codegen is simple: each bit-field gets an access unitof its declared type. Adjacent bit-fields of the same type size shareone access unit. Zero-width bit-fields and type-size changes break runs.There is no complex merging — the Phase 1 storage units are theaccess units.

Contrast S3 under both ABIs:

1	struct S3 { int a:10; int b:6; char c; int d:6; };

1 2	Itanium: %struct.S3 = type { i16, i8, i8 } // a,b merged into i16, d is i8 Microsoft: %struct.MS3 = type { i32, i8, i32 } // a,b share i32 unit, d gets own i32

Itanium's Phase 2 merges a and b into thetightest access unit that covers both (i16), and clips orshrinks to avoid touching c. Microsoft uses the fulldeclared type (int = i32) for each storageunit — no merging, no clipping.

Similarly for mixed types:

1	struct S2 { int a:24; short b:8; };

1 2	Itanium: %struct.S2 = type { i32 } // a and b merged into one i32 Microsoft: %struct.MS2 = type { i32, i16 } // separate units: i32 for a, i16 for b

Itanium merges a and b into a singlei32 since they share the same 4 bytes. Microsoft gives eachits own access unit matching the declared type.

Conclusion

Phase 1 decides where bits go — it's specified by the ABIand determines sizeof and alignof. Phase 2decides how to access them — it's a compiler optimization thataffects codegen but not the binary layout. They answer differentquestions and often produce different-sized units. The storage unit fora bit-field is determined by its declared type; the access unit isdetermined by what's safe and efficient to load.

MaskRay
Call relocation typesMaskRay
2026年2月16日 16:00

Call relocation types

MaskRay

作者 MaskRay

2026年2月16日 16:00

Most architectures encode direct branch/call instructions with aPC-relative displacement. This post discusses a specific category ofbranch relocations: those used for direct function calls and tail calls.Some architectures use two ELF relocation types for a callinstruction:

# i386, x86-64
call foo              # R_386_PC32, R_X86_64_PC32
call foo@plt          # R_386_PLT32, R_X86_64_PLT32

# m68k
bsr.l foo             # R_68K_PC32
bsr.l foo@plt         # R_68K_PLT32

# s390/s390x
brasl %r14, foo       # R_390_PC32DBL
brasl %r14, foo@plt   # R_390_PLT32DBL

# sparc
call    foo, 0        # not PIC: R_SPARC_WDISP30
call    foo, 0        # gas -KPIC: R_SPARC_WPLT30

This post describes why I think this happened.

Static linking: one typesuffices

In the static linking model, all symbols are resolved at link time:every symbol is either defined in a relocatable object file or anundefined weak symbol. A branch instruction with a PC-relativedisplacement—x86 call, m68k bsr.l, s390brasl—can reuse the same PC-relative data relocation typeused for data references.

i386: R_386_PC32 for both call foo and.long foo - .
x86-64: R_X86_64_PC32 for both call fooand .long foo - .
m68k: R_68K_PC32 for bsr.l foo,move.l var,%d0, and .long foo - .
s390x: R_390_PC32DBL for brasl %r14, foo,larl %r1, var, and .long foo - .

No separate "call" relocation type is needed. The linker simplypatches the displacement to point to the symbol address.

Dynamic linking changes thepicture

With System V Release 4 style shared libraries, variable access andfunction calls diverge.

For variables and function addresses, a referencefrom one component to a symbol defined in another cannot use a plainPC-relative relocation, because the distance between the two componentsis not known at link time. The Global OffsetTable was introduced for this purpose, along with GOT-generatingrelocation types. (Additionally, copyrelocations are a workaround for external data symbols from-fno-pic relocatable files.) To satisfy the pointerequality requirement, a PC-relative data relocation in anexecutable must resolve to the same address as its counterpart in ashared object—this is why GOT indirection is used for symbols not knownat compile time to be preemptible.

For direct function calls, the situation isdifferent. A call instruction has "transfer control there by any means"semantics - the caller usually doesn't care how the callee isreached, only that it gets there. This allows the linker to interpose aPLTstub when the target is in another component, without any specialcode sequence at the call site. Alternatively, some architecturessupport an indirect call sequence that bypasses PLT entirely: -fno-plton x86 and -mno-plton MIPS (o32/n32 non-PIC).

Variable accesses do not have the same semantics - so the PC-relativedata relocation type cannot be reused on a call instruction.

This is why separate branch relocation types were introduced:R_386_PLT32, R_68K_PLT32,R_390_PLT32DBL, and so on. The relocation type carries thesemantic information: "this is a function call that can use PLTindirection."

Misleading names

The @plt notation in assembly and the PLT32relocation type names are misleading. They suggest that a PLT entry isinvolved, but that is often not the case - when the callee is defined inthe same component, the linker resolves the branch directly—no PLT entryis created.

R_386_CALL32 and R_X86_64_CALL32 would havebeen a better name.

In addition, the @plt notation itself is problematic asa relocationspecifier.

Architecture comparison

Single type (clean design). Some architecturesrecognized from the start that one call relocation type is sufficient.The linker can decide whether a PLT stub is needed based on the symbol'sbinding and visibility.

AArch64: R_AARCH64_CALL26 for bl andR_AARCH64_JUMP26 for b.
PowerPC64 ELFv2: R_PPC64_REL24 forbl.

These architectures never had the naming confusion—there is no "PLT"in the relocation name, and no redundant pair.

Redundant pairs (misguided). Some architecturesintroduced separate "PLT" and "non-PLT" call relocation types, creatinga distinction without a real difference.

SPARC: R_SPARC_WPLT30 alongsideR_SPARC_WDISP30. The assembler decides at assembly timebased on PIC mode and symbol preemptivity, when ideally the linkershould make these decisions.
PPC32: R_PPC_REL24 (non-PIC) andR_PPC_PLTREL24 (PIC) have genuinely different semantics(the addend of R_PPC_PLTREL24 encodes the r30 GOT pointersetup). However, R_PPC_LOCAL24PC is entirely useless—alloccurrences can be replaced with R_PPC_REL24.
RISC-V: R_RISCV_CALL_PLT alongside the now-removedR_RISCV_CALL. The community recognized that only onerelocation is needed. R_RISCV_CALL_PLT is kept (despite thename, does not mandate a PLT entry).

x86-64 started with R_X86_64_PC32 forcall foo (inherited from the static-linking mindset) andR_X86_64_PLT32 for call foo@plt (symbols notcompile-time known to be non-preemptible). In 2018, binutils https://sourceware.org/bugzilla/show_bug.cgi?id=22791switched to R_X86_64_PLT32 for call foo. LLVMintegrated assembler followed suit.

This means R_X86_64_PC32 is now effectively reserved fordata references, and R_X86_64_PLT32 marks all calls—a cleanseparation achieved by convention.

However, GNU Assembler still produces R_X86_64_PC32 whencall foo references an STB_LOCAL symbol. I'vesent a patch to fix this: [PATCHv2] x86: keep PLT32 relocation for local symbols instead of convertingto PC32.

GCC's s390 port seems to always generate @plt (even forhidden visibility functions), leading to R_390_PLT32DBLrelocations.

Range extension thunks

When a branch target is out of range, some architectures allow thelinker to insert a rangeextension thunks: On AArch64 and PowerPC64, this is wellestablished.

On x86-64, the ±2GiB range of call/jmp hasbeen sufficient so far, but as executables grow, relocationoverflow becomes a concern. There are proposals to add rangeextension thunks to x86-64, which would require the linker to identifycall sites-a PC-relative data relocation like R_X86_64_PC32would not be suitable (due to pointer equality requirement), makingconsistent use of R_X86_64_PLT32 for calls all the moreimportant.

Recommendation forfuture architectures

For a specific instruction or pseudo instruction for function callsand tail calls, use a single call relocation type—no "PLT" vs. "non-PLT"distinction. The assembler should emit the same relocation, and thelinker, which knows whether the symbol is preemptible, decides whether aPLT stub is needed. Optionally enable range extension thunks for therelocation type. AArch64's R_AARCH64_CALL26 and PowerPC64ELFv2's R_PPC64_REL24 demonstrate this approach well.

This discussion does not apply to intra-function branches, whichtarget local labels.

普通视图

Benchmark

macOS (Apple M4) notes

Parallelize--gc-sections mark

Parallelize input fileloading

Extending parallelrelocation scanning

Target-specific relocationscanning

Faster getSectionPiece

Optimizingthe underlying llvm/lib/Support/Parallel.cpp

Small wins worth mentioning

Where lld still loses time

Phase 1: Storage Units

Itanium's Core Rule

Declared Type Matters

Mixed Types

Non-bit-field AfterBit-field

Bit-field AfterNon-bit-field

Attributes and Pragmas

Zero-width Bitfields

Microsoft ABI Differences

Phase 2: Access Units

Itanium: Merging Algorithm

Microsoft: Discrete AccessUnits

Conclusion

Static linking: one typesuffices

Dynamic linking changes thepicture

Misleading names

Architecture comparison

Range extension thunks

Recommendation forfuture architectures

See also

Parallelize`--gc-sections` mark

Faster `getSectionPiece`

Optimizingthe underlying `llvm/lib/Support/Parallel.cpp`