For those unfamiliar, lld is theLLVM linker, supporting PE/COFF, ELF, Mach-O, and WebAssembly ports.These object file formats differ significantly, and each port mustfollow the conventions of the platform's system linker. As a result, theports share limited code (diagnostics, memory allocation, etc) and havelargely separate reviewer groups.
With LLVM 22.1 releasing soon, I've added some notes to the https://github.com/llvm/llvm-project/blob/release/22.x/lld/docs/ReleaseNotes.rstas an lld/ELF maintainer. As usual, I've reviewed almost all the patchesnot authored by me.
For the first time, I used an LLM agent (Claude Code) to help lookthrough commits(git log release/21.x..release/22.x -- lld/ELF) and draftthe release notes. Despite my request to only read lld/ELF changes,Claude Code also crafted notes for other ports, which I retained sincetheir release notes had been quite sparse for several releases. Changesback ported to the 21.x release are removed(git log --oneline llvmorg-22-init..llvmorg-21.1.8 -- lld).
I'll delve into some of the key changes.
--print-gc-sections=<file> has been added toredirect garbage collection section listing to a file, avoidingcontamination of stdout with other linker output. (#159706)
A VersionNode lexer state has been added for betterversion script parsing. This brings the lexer behavior closer to GNU ld.(#174530)
Unversioned undefined symbols now use version index 0, aligning withGNU ld 2.46 behavior. (#168189)
.data.rel.ro.hot and .data.rel.ro.unlikelyare now recognized as RELRO sections, allowing profile-guided staticdata partitioning. (#148920)
DTLTO now supports archive members and bitcode members of thinarchives. (#157043)
For DTLTO,--thinlto-remote-compiler-prepend-arg=<arg> has beenadded to prepend an argument to the remote compiler's command line. (#162456)
Balanced Partitioning (BP) section ordering now skips input sectionswith null data, and filters out section symbols. (#149265) (#151685)
For AArch64, fixed a crash when using--fix-cortex-a53-843419 with synthetic sections andimproved handling when patched code is far from the short jump. (#170495)
For AArch64, added support for the R_AARCH64_FUNCINIT64dynamic relocation type for relocating word-sized data using the returnvalue of a function. (#156564)
For AArch64, added support for the R_AARCH64_PATCHINSTrelocation type to support deactivation symbols. (#133534)
For AArch64, added support for reading AArch64 Build Attributes andconverting them into GNU Properties. (#147970)
For ARM, fixed incorrect veneer generation for wraparound branchesat the high end of the 32-bit address space branching to the low end.(#165263)
For LoongArch, -r now synthesizesR_LARCH_ALIGN at input section start to preserve alignmentinformation. (#153935)
For LoongArch, added relocation types for LA32R/LA32S. (#172618) (#176312)
For RISC-V, added infrastructure for handling vendor-specificrelocations. (#159987)
For RISC-V, added support for statically resolved vendor-specificrelocations. (#169273)
For RISC-V, -r now synthesizesR_RISCV_ALIGN at input section start to preserve alignmentinformation during two-stage linking. (#151639)This is an interesting relocatablelinking challenge for linker relaxation.
Besides me, Peter Smith (smithp35) and Jessica Clarke (jrtc27) havedone a lot of reviews.
jrtc27 has done great work simplifying the dynamic relocation system,which is highly appreciated.
I should call out https://github.com/llvm/llvm-project/pull/172618: forthis relatively large addition, the author and approver are from thesame company and contributing to their architecture, and neither theauthor nor the approver is a regular lld contributor/reviewer. Theauthor did not request review from regular reviewers and landed thepatch just 3 minutes after their colleague's approval. I left a commentasking to keep the PR open for other maintainers to review.
Distributed ThinLTO
Distributed ThinLTO(DTLTO) enables distributing ThinLTO backend compilations toexternal systems (e.g., Incredibuild, distcc-like tools) during the linkstep. This feature was contributed by PlayStation, who had offered it asa proprietary technology before upstreaming.
The traditional distributed ThinLTO is implemented in Bazel in buck2.Bazel-style distribution (build system orchestrated)uses a multi-step workflow:
1 2 3 4 5 6 7 8 9
# Compile to bitcode (made parallel by build system) clang -c -O2 -flto=thin a.c b.c # Thin link clang -flto=thin -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp,--thinlto-emit-imports-files -Wl,--thinlto-prefix-replace=';lto/' a.o b.o # Backend compilation (distributed by build system) with dynamic dependencies clang -c -O2 -fthinlto-index=lto/a.o.thinlto.bc a.o -o lto/a.o clang -c -O2 -fthinlto-index=lto/b.o.thinlto.bc b.o -o lto/b.o # Final native link clang -fuse-ld=lld @a.rsp # a.rsp contains lto/a.o and lto/b.o
The build system sees the index files from step 2 as outputs andschedules step 3 jobs across the build cluster. This requires a buildsystem that handles dynamic dependencies—outputs ofstep 2 determine inputs to step 3.
DTLTO (linker orchestrated) integrates steps 2-4into a single link invocation:
LLD performs the thin-link internally, generates a JSON jobdescription for each backend compilation, invokes the distributorprocess, waits for native objects, and links them. The distributor isresponsible for farming out the compilations to remote machines.
DTLTO works with any build system but requires a separate distributorprocess that speaks its JSON protocol. DTLTO is essentially "ThinLTOdistribution for projects that don't use Bazel".
Pointer Field Protection
R_AARCH64_PATCHINST is a static relocation type usedwith Pointer Field Protection (PFP), which leverages Armv8.3-A PointerAuthentication (PAC) to protect pointer fields in structs.
Consider the following C++ code:
1 2 3 4 5 6 7 8 9
structcls { ~cls(); long *ptr; private: long *ptr2; };
long *load(cls *c){ return c->ptr; } voidstore(cls *c, long *ptr){ c->ptr = ptr; }
With Pointer Field Protection enabled, the compiler generates PACinstructions to sign and authenticate pointers:
1 2 3 4 5 6 7 8 9 10
load: ldr x8, [x0] // Load the PAC-signed pointer from c->ptr autda x8, x0 // Authenticate and strip the PAC, R_AARCH64_PATCHINST __pfp_ds__ZTS3cls.ptr mov x0, x8 ret
store: pacda x1, x0 // Sign ptr using c as a discriminator, R_AARCH64_PATCHINST __pfp_ds__ZTS3cls.ptr str x1, [x0] ret
Each PAC instruction is associated with anR_AARCH64_PATCHINST relocation referencing adeactivation symbol (the __pfp_ds_ prefixstands for "pointer field protection deactivation symbol"). By default,__pfp_ds__ZTS3cls.ptr is an undefined weak symbol in everyrelocatable file.
However, if the field's address escapes in any translation unit(e.g., someone takes &c->ptr), the compiler definesthe deactivation symbol as an absolute symbol (ELFSHN_ABS). When the linker sees a defined deactivationsymbol, it patches the PAC instruction to a NOP(R_AARCH64_PATCHINST acts as R_AARCH64_ABS64when the referenced symbol is defined), disabling the protection forthat field. This is necessary because external code could modify thepointer without signing it, which would cause authenticationfailures.
The linker allows duplicate definitions of absolute symbols if thevalues are identical.
R_AARCH64_FUNCINIT64 is a related static relocation typethat produces an R_AARCH64_IRELATIVE dynamic relocation (GNU indirectfunction). It initializes function pointers in static data at loadtime by calling a resolver function that returns the PAC-signedaddress.
PFP is AArch64-specific because it relies on Pointer Authentication(PAC), a hardware feature introduced in Armv8.3-A. PAC providesdedicated instructions (pacda, autda, etc.)that cryptographically sign pointers using keys stored in systemregisters. x86-64 lacks an equivalent mechanism—Intel CET providesshadow stacks and indirect branch tracking for control-flow integrity,but cannot sign arbitrary data pointers stored in memory.
Takeaways:
Security features need linker support. This is because many featuresrequire aggregated information across all translation units. In thiscase, if any TU exposes a field's address, the linker disablesprotection for this field everywhere The implementation isusually lightweight.
Relocations can do more than fill in addresses:R_AARCH64_PATCHINST conditionally patches instructions toNOPs based on symbol resolution. This is a different paradigm fromtraditional "compute address, write it" relocations.
RISC-V vendor relocations
RISC-V's openness encourages vendors to add custom instructions.Qualcomm has the uC extensions for their microcontrollers; CHERIoT addscapability-based security.
The RISC-V psABI adopted the vendor relocation system:
The R_RISCV_VENDOR marker identifies the vendornamespace via its symbol reference. The subsequent relocation uses avendor-specific type number that only makes sense within that namespace.Different vendors can reuse the same type numbers without conflict.
In lld 22:
Infrastructure for vendor relocations was added (#159987).The implementation folds vendor namespace information into the upperbits of RelType, allowing existing relocation processingcode to work with minimal changes.
Support for statically-resolved vendor relocations was added (#169273),including Qualcomm and Andes relocation types. The patch landed withoutinvolving the regular lld/ELF reviewer pool. For changes that setarchitectural precedents, broader consensus should be sought beforemerging. I've commentedon this.
The RISC-Vtoolchain conventions document the vendor relocation scheme.
There's a maintainability concern: accepting vendor-specificrelocations into the core linker sets a precedent. RISC-V is uniquelyfragmented compared to other LLVM backends-x86, AArch64, PowerPC, andothers don't have nearly as many vendors adding custom instructions andrelocations. This fragmentation is a direct consequence of RISC-V's opennature and extensibility, but it creates new challenges for upstreamtoolchain maintainers. Accumulated vendor-specific code could become asignificant maintenance burden.
GNU ld compatibility
Large corporate users of lld/ELF don't care about GNU ldcompatibility. They add features for their own use cases and move on. Idiligently coordinate with binutils maintainers and file featurerequests when appropriate. When lld implements a new option or behavior,I often file corresponding GNU ld feature requests to keep the toolsaligned.
This coordination work is largely invisible but essential for thebroader toolchain ecosystem. Users benefit when they can switch betweenlinkers without surprises.
LLVM 22 will be released. As usual, I maintain lld/ELF and have addedsome notes to https://github.com/llvm/llvm-project/blob/release/22.x/lld/docs/ReleaseNotes.rst.I've meticulously reviewed nearly all the patches that are not authoredby me. I'll delve into some of the key changes.
--print-gc-sections=<file> has been added toredirect garbage collection section listing to a file, avoidingcontamination of stdout with other linker output. (#159706)
A VersionNode lexer state has been added for betterversion script parsing. This brings the lexer behavior closer to GNU ld.(#174530)
Unversioned undefined symbols now use version index 0, aligning withGNU ld 2.46 behavior. (#168189)
.data.rel.ro.hot and .data.rel.ro.unlikelyare now recognized as RELRO sections, allowing profile-guided staticdata partitioning. (#148920)
DTLTO now supports archive members and bitcode members of thinarchives. (#157043)
For DTLTO,--thinlto-remote-compiler-prepend-arg=<arg> has beenadded to prepend an argument to the remote compiler's command line. (#162456)
Balanced Partitioning (BP) section ordering now skips input sectionswith null data, and filters out section symbols. (#149265) (#151685)
For AArch64, fixed a crash when using--fix-cortex-a53-843419 with synthetic sections andimproved handling when patched code is far from the short jump. (#170495)
For AArch64, added support for the R_AARCH64_FUNCINIT64dynamic relocation type for relocating word-sized data using the returnvalue of a function. (#156564)
For AArch64, added support for the R_AARCH64_PATCHINSTrelocation type to support deactivation symbols. (#133534)
For AArch64, added support for reading AArch64 Build Attributes andconverting them into GNU Properties. (#147970)
For ARM, fixed incorrect veneer generation for wraparound branchesat the high end of the 32-bit address space branching to the low end.(#165263)
For LoongArch, -r now synthesizesR_LARCH_ALIGN at input section start to preserve alignmentinformation. (#153935)
For LoongArch, added relocation types for LA32R/LA32S. (#172618) (#176312)
For RISC-V, added infrastructure for handling vendor-specificrelocations. (#159987)
For RISC-V, added support for statically resolved vendor-specificrelocations. (#169273)
For RISC-V, -r now synthesizesR_RISCV_ALIGN at input section start to preserve alignmentinformation during two-stage linking. (#151639)
Branch instructions on most architectures use PC-relative addressingwith a limited range. When the target is too far away, the branchbecomes "out of range" and requires special handling.
Consider a large binary where main() at address 0x10000calls foo() at address 0x8010000-over 128MiB away. OnAArch64, the bl instruction can only reach ±128MiB, so thiscall cannot be encoded directly. Without proper handling, the linkerwould fail with an error like "relocation out of range." The toolchainmust handle this transparently to produce correct executables.
This article explores how compilers, assemblers, and linkers worktogether to solve the long branch problem.
Compiler (IR to assembly): Handles branches within a function thatexceed the range of conditional branch instructions
Assembler (assembly to relocatable file): Handles branches within asection where the distance is known at assembly time
Linker: Handles cross-section and cross-object branches discoveredduring final layout
Branch range limitations
Different architectures have different branch range limitations.Here's a quick comparison of unconditional / conditional branchranges:
Architecture
Cond
Uncond
Call
Notes
AArch64
±1MiB
±128MiB
±128MiB
Thunks
AArch32 (A32)
±32MiB
±32MiB
±32MiB
Thunks, interworking
AArch32 (T32)
±1MiB
±16MiB
±16MiB
Thunks, interworking
LoongArch
±128KiB
±128MiB
±128MiB
Linker relaxation
M68k (68020+)
±2GiB
±2GiB
±2GiB
Assembler picks size
MIPS (pre-R6)
±128KiB
±128KiB (b offset)
±128KiB (bal offset)
In -fno-pic code, pseudo-absolutej/jal can be used for a 256MiB region.
MIPS R6
±128KiB
±128MiB
±128MiB
PowerPC64
±32KiB
±32MiB
±32MiB
Thunks
RISC-V
±4KiB
±1MiB
±1MiB
Linker relaxation
SPARC
±1MiB
±8MiB
±2GiB
No thunks needed
SuperH
±256B
±4KiB
±4KiB
Use register-indirect if needed
x86-64
±2GiB
±2GiB
±2GiB
Large code model changes call sequence
Xtensa
±2KiB
±128KiB
±512KiB
Linker relaxation
z/Architecture
±64KiB
±4GiB
±4GiB
No thunks needed
The following subsections provide detailed per-architectureinformation, including relocation types relevant for linkerimplementation.
AArch32
In A32 state:
Branch (b/b<cond>), conditionalbranch and link (bl<cond>)(R_ARM_JUMP24): ±32MiB
Unconditional branch and link (bl/blx,R_ARM_CALL): ±32MiB
Note: R_ARM_CALL is for unconditionalbl/blx which can be relaxed to BLX inline;R_ARM_JUMP24 is for branches which require a veneer forinterworking.
The compiler's BranchRelaxation pass handlesout-of-range conditional branches by inverting the condition andinserting an unconditional branch. The AArch64 assembler does notperform branch relaxation; out-of-range branches produce linker errorsif not handled by the compiler.
Medium range call (pcaddu12i+jirl,R_LARCH_CALL30): ±2GiB
Long range call (pcaddu18i+jirl,R_LARCH_CALL36): ±128GiB
M68k
Short branch(Bcc.B/BRA.B/BSR.B): ±128 bytes(8-bit displacement)
Word branch(Bcc.W/BRA.W/BSR.W): ±32KiB(16-bit displacement)
Long branch(Bcc.L/BRA.L/BSR.L, 68020+):±2GiB (32-bit displacement)
GNU Assembler provides pseudoopcodes (jbsr, jra, jXX) that"automatically expand to the shortest instruction capable of reachingthe target". For example, jeq .L0 emits one ofbeq.b, beq.w, and beq.l dependingon the displacement.
With the long forms available on 68020 and later, M68k doesn't needlinker range extension thunks.
Pseudo-absolute jump/call (j/jal,R_MIPS_26): branch within the current 256MiB region, onlysuitable for -fno-pic code. Deprecated in R6 in favor ofbc/balc
Compiler long branch handling: Both GCC(mips_output_conditional_branch) and LLVM(MipsBranchExpansion) handle out-of-range conditionalbranches by inverting the condition and inserting an unconditionaljump:
lld implements LA25 thunks for MIPS PIC/non-PIC interoperability, butnot range extension thunks. GNU ld also does not implement rangeextension thunks for MIPS.
GCC's mips port ported added-mlong-calls in 1993-03. In -mno-abicallsmode, GCC's -mlong-calls option (addedin 1993) generates indirect call sequences that can reach anyaddress.
The Gocompiler emits a single jal for calls and relies on itslinker to generate trampolines when the target is out of range.
In contrast, GCC and Clang emit auipc+jalrand rely on linker relaxation to shrink the sequence when possible.
The jal range (±1MiB) is notably smaller than other RISCarchitectures (AArch64 ±128MiB, PowerPC64 ±32MiB, LoongArch ±128MiB).This limits the effectiveness of linker relaxation ("start large andshrink"), and leads to frequent trampolines when the compileroptimistically emits jal ("start small and grow").
SPARC
Compare and branch (cxbe, R_SPARC_5): ±64bytes
Conditional branch (bcc, R_SPARC_WDISP19):±1MiB
Unconditional branch (b, R_SPARC_WDISP22):±8MiB
call(R_SPARC_WDISP30/R_SPARC_WPLT30): ±2GiB
With ±2GiB range for call, SPARC doesn't need rangeextension thunks in practice.
SuperH
SuperH uses fixed-width 16-bit instructions, which limits branchranges.
Branch to subroutine (bsr): ±4KiB (12-bitdisplacement)
For longer distances, register-indirect branches(braf/bsrf) are used. The compiler invertsconditions and emits these when targets exceed the short ranges.
SuperH is supported by GCC and binutils, but not by LLVM.
Xtensa
Xtensa uses variable-length instructions: 16-bit (narrow,.n suffix) and 24-bit (standard).
Narrow conditional branch (beqz.n/bnez.n,16-bit): -28 to +35 bytes (6-bit signed + 4)
Conditional branch (compare two registers)(beq/bne/blt/bge/etc,24-bit): ±256 bytes
Conditional branch (compare with zero)(beqz/bnez/bltz/bgez,24-bit): ±2KiB
Unconditional jump (j, 24-bit): ±128KiB
Call(call0/call4/call8/call12,24-bit): ±512KiB
The assembler performs branch relaxation: when a conditional branchtarget is too far, it inverts the condition and inserts a jinstruction.
Per https://www.sourceware.org/binutils/docs/as/Xtensa-Call-Relaxation.html,for calls, GNU Assembler pessimistically generates indirect sequences(l32r+callx8) when the target distance isunknown. GNU ld then performs linker relaxation.
x86-64
Short conditional jump (Jcc rel8): -128 to +127bytes
Short unconditional jump (JMP rel8): -128 to +127bytes
Near conditional jump (Jcc rel32): ±2GiB
Near unconditional jump (JMP rel32): ±2GiB
With a ±2GiB range for near jumps, x86-64 rarely encountersout-of-range branches in practice. That said, Google and Meta Platformsdeploy mostly statically linked executables on x86-64 production serversand have run into the huge executable problem for certainconfigurations.
z/Architecture
Short conditional branch (BRC,R_390_PC16DBL): ±64KiB (16-bit halfword displacement)
Long conditional branch (BRCL,R_390_PC32DBL): ±4GiB (32-bit halfword displacement)
Short call (BRAS, R_390_PC16DBL):±64KiB
Long call (BRASL, R_390_PC32DBL):±4GiB
With ±4GiB range for long forms, z/Architecture doesn't need linkerrange extension thunks. LLVM's SystemZLongBranch passrelaxes short branches (BRC/BRAS) to longforms (BRCL/BRASL) when targets are out ofrange.
Compiler: branch rangehandling
Conditional branch instructions usually have shorter ranges thanunconditional ones, making them less suitable for linker thunks (as wewill explore later). Compilers typically keep conditional branch targetswithin the same section, allowing the compiler to handle out-of-rangecases via branch relaxation.
Within a function, conditional branches may still go out of range.The compiler measures branch distances and relaxes out-of-range branchesby inverting the condition and inserting an unconditional branch:
1 2 3 4 5 6 7
# Before relaxation (out of range) beq .Lfar_target # ±4KiB range on RISC-V
# After relaxation bne .Lskip # Inverted condition, short range j .Lfar_target # Unconditional jump, ±1MiB range .Lskip:
Some architectures have conditional branch instructions that comparewith an immediate, with even shorter ranges due to encoding additionalimmediates. For example, AArch64's cbz/cbnz(compare and branch if zero/non-zero) andtbz/tbnz (test bit and branch) have only±32KiB range. RISC-V Zibi beqi/bnei have ±4KiBrange. The compiler handles these in a similar way:
1 2 3 4 5 6 7
// Before relaxation (cbz has ±32KiB range) cbz w0, far
// After relaxation cbnz w0, .Lskip // Inverted condition b far // Unconditional branch, ±128MiB range .Lskip:
An Intel employee contributed https://reviews.llvm.org/D41634 (in 2017) when inversionof a branch condintion is impossible. This is for an out-of-treebackend. As of Jan 2026 there is no in-tree test for this code path.
In LLVM, this is handled by the BranchRelaxation pass,which runs just before AsmPrinter. Different backends havetheir own implementations:
BranchRelaxation: AArch64, AMDGPU, AVR, RISC-V
HexagonBranchRelaxation: Hexagon
PPCBranchSelector: PowerPC
SystemZLongBranch: SystemZ
MipsBranchExpansion: MIPS
MSP430BSel: MSP430
The generic BranchRelaxation pass computes block sizesand offsets, then iterates until all branches are in range. Forconditional branches, it tries to invert the condition and insert anunconditional branch. For unconditional branches that are still out ofrange, it calls TargetInstrInfo::insertIndirectBranch toemit an indirect jump sequence (e.g.,adrp+add+br on AArch64) or a longjump sequence (e.g., pseudo jump on RISC-V).
Note: The size estimates may be inaccurate due to inline assembly.LLVM uses heuristics to estimate inline assembly sizes, but for certainassembly constructs the size is not precisely known at compile time.
Unconditional branches and calls can target different sections sincethey have larger ranges. If the target is out of reach, the linker caninsert thunks to extend the range.
For x86-64, the large code model uses multiple instructions for callsand jumps to support text sections larger than 2GiB (see Relocationoverflow and code models: x86-64 large code model). This is apessimization if the callee ends up being within reach. Google and MetaPlatforms have interest in allowing range extension thunks as areplacement for the multiple instructions.
Assembler: instructionrelaxation
The assembler converts assembly to machine code. When the target of abranch is within the same section and the distance is known at assemblytime, the assembler can select the appropriate encoding. This isdistinct from linker thunks, which handle cross-section or cross-objectreferences where distances aren't known until link time.
Assembler instruction relaxation handles two cases (see Clang-O0 output: branch displacement and size increase for examples):
Span-dependent instructions: Select an appropriateencoding based on displacement.
On x86, a short jump (jmp rel8) can be relaxed to anear jump (jmp rel32) when the target is far.
On RISC-V, beqz may be assembled to the 2-bytec.beqz when the displacement fits within ±256 bytes.
Conditional branch transform: Invert the conditionand insert an unconditional branch. On RISC-V, a blt mightbe relaxed to bge plus an unconditional branch.
The assembler uses an iterative layout algorithm that alternatesbetween fragment offset assignment and relaxation until all fragmentsbecome legalized. See Integratedassembler improvements in LLVM 19 for implementation details.
Linker: range extensionthunks
When the linker resolves relocations, it may discover that a branchtarget is out of range. At this point, the instruction encoding isfixed, so the linker cannot simply change the instruction. Instead, itgenerates range extension thunks (also called veneers,branch stubs, or trampolines).
A thunk is a small piece of linker-generated code that can reach theactual target using a longer sequence of instructions. The originalbranch is redirected to the thunk, which then jumps to the realdestination.
Range extension thunks are one type of linker-generated thunk. Othertypes include:
ARM interworking veneers: Switch between ARM andThumb instruction sets (see Linker notes onAArch32)
MIPS LA25 thunks: Enable PIC and non-PIC codeinteroperability (see Toolchain notes onMIPS)
PowerPC64 TOC/NOTOC thunks: Handle calls betweenfunctions using different TOC pointer conventions (see Linker notes on PowerISA)
Short range vs long rangethunks
A short range thunk (see lld/ELF's AArch64implementation) contains just a single branch instruction. Since ituses a branch, its reach is also limited by the branch range—it can onlyextend coverage by one branch distance. For targets further away,multiple short range thunks can be chained, or a long range thunk withaddress computation must be used.
Long range thunks use indirection and can jump to (practically)arbitrary locations.
1 2 3 4 5 6 7 8 9
// Short range thunk: single branch, 4 bytes __AArch64AbsLongThunk_dst: b dst // ±128MiB range
__ARMV7PILongThunk_dst: movw ip, :lower16:(dst - .) ; ip = intra-procedure-call scratch register movt ip, :upper16:(dst - .) add ip, ip, pc bx ip
PowerPC64 ELFv2 (see Linker notes on PowerISA):
1 2 3 4 5
__long_branch_dst: addis 12, 2, .branch_lt@ha # Load high bits from branch lookup table ld 12, .branch_lt@l(12) # Load target address mtctr 12 # Move to count register bctr # Branch to count register
Thunk impact ondebugging and profiling
Thunks are transparent at the source level but visible in low-leveltools:
Stack traces: May show thunk symbols (e.g.,__AArch64ADRPThunk_foo) between caller and callee
Profilers: Samples may attribute time to thunkcode; some profilers aggregate thunk time with the target function
Disassembly: objdump orllvm-objdump will show thunk sections interspersed withregular code
Code size: Each thunk adds bytes; large binariesmay have thousands of thunks
lld/ELF's thunk creationalgorithm
lld/ELF uses a multi-pass algorithm infinalizeAddressDependentContent:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
assignAddresses(); for (pass = 0; pass < 30; ++pass) { if (pass == 0) createInitialThunkSections(); // pre-create empty ThunkSections bool changed = false; for (relocation : all_relocations) { if (pass > 0 && normalizeExistingThunk(rel)) continue; // existing thunk still in range if (!needsThunk(rel)) continue; Thunk *t = getOrCreateThunk(rel); ts = findOrCreateThunkSection(rel, src); ts->addThunk(t); rel.sym = t->getThunkTargetSym(); // redirect changed = true; } mergeThunks(); // insert ThunkSections into output if (!changed) break; assignAddresses(); // recalculate with new thunks }
Key details:
Multi-pass: Iterates until convergence (max 30passes). Adding thunks changes addresses, potentially puttingpreviously-in-range calls out of range.
Pre-allocated ThunkSections: On pass 0,createInitialThunkSections places emptyThunkSections at regular intervals(thunkSectionSpacing). For AArch64: 128 MiB - 0x30000 ≈127.8 MiB.
Thunk reuse: getThunk returns existingthunk if one exists for the same target;normalizeExistingThunk checks if a previously-created thunkis still in range.
ThunkSection placement: getISDThunkSecfinds a ThunkSection within branch range of the call site, or createsone adjacent to the calling InputSection.
lld/MachO's thunk creationalgorithm
lld/MachO uses a single-pass algorithm inTextOutputSection::finalize:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
for (callIdx = 0; callIdx < inputs.size(); ++callIdx) { // Finalize sections within forward branch range (minus slop) while (finalIdx < endIdx && fits_in_range(inputs[finalIdx])) finalizeOne(inputs[finalIdx++]);
// Process branch relocations in this section for (Relocation &r : reverse(isec->relocs)) { if (!isBranchReloc(r)) continue; if (targetInRange(r)) continue; if (existingThunkInRange(r)) { reuse it; continue; } // Create new thunk and finalize it createThunk(r); } }
Key differences from lld/ELF:
Single pass: Addresses are assigned monotonicallyand never revisited
Slop reservation: ReservesslopScale * thunkSize bytes (default: 256 × 12 = 3072 byteson ARM64) to leave room for future thunks
Thunk naming:<function>.thunk.<sequence> where sequenceincrements per target
Thunkstarvation problem: If many consecutive branches need thunks, eachthunk (12 bytes) consumes slop faster than call sites (4 bytes apart)advance. The test lld/test/MachO/arm64-thunk-starvation.sdemonstrates this edge case. Mitigation is increasing--slop-scale, but pathological cases with hundreds ofconsecutive out-of-range callees can still fail.
mold's thunk creationalgorithm
mold uses a two-pass approach:
Pessimistically over-allocate thunks. Out-of-section relocations andrelocations referencing to a section not assigned address yetpessimistically need thunks.(requires_thunk(ctx, isec, rel, first_pass) whenfirst_pass=true)
Then remove unnecessary ones.
Linker pass ordering:
compute_section_sizes() callscreate_range_extension_thunks() — final section addressesare NOT yet known
set_osec_offsets() assigns section addresses
remove_redundant_thunks() is called AFTER addresses areknown — check unneeded thunks due to out-of-section relocations
Rerun set_osec_offsets()
Pass 1 (create_range_extension_thunks):Process sections in batches using a sliding window. The window tracksfour positions:
1 2 3 4 5 6 7 8 9
Sections: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] ... ^ ^ ^ ^ A B C D | |_______| | | batch | | | earliest thunk reachable placement from C
[B, C) = current batch of sections to process (size≤ branch_distance/5)
A = earliest section still reachable from C (forthunk expiration)
D = where to place the thunk (furthest pointreachable from B)
// Simplified from OutputSection<E>::create_range_extension_thunks while (b < sections.size()) { // Advance D: find furthest point where thunk is reachable from B while (d < size && thunk_at_d_reachable_from_b) assign_address(sections[d++]);
// Compute batch [B, C) c = b + 1; while (c < d && sections[c] < sections[b] + batch_size) c++;
// Advance A: expire thunks no longer reachable while (a < b && sections[a] + branch_distance < sections[c]) a++; // Expire thunk groups before A: clear symbol flags. for (; t < thunks.size() && thunks[t].offset < sections[a]; t++) for (sym in thunks[t].symbols) sym->flags = 0;
// Scan [B,C) relocations. If a symbol is not assigned to a thunk group yet, // assign it to the new thunk group at D. auto &thunk = thunks.emplace_back(newThunk(offset)); parallel_for(b, c, [&](i64 i) { for (rel in sections[i].relocs) { if (requires_thunk(rel)) { Symbol &sym = rel.symbol; if (!sym.flags.test_and_set()) { // atomic: skip if already set lock_guard lock(mu); thunk.symbols.push_back(&sym); } } } }); offset += thunk.size(); b = c; // Move to next batch }
Pass 2 (remove_redundant_thunks): Afterfinal addresses are known, remove thunk entries for symbols actually inrange.
Key characteristics:
Pessimistic over-allocation: Assumes allout-of-section calls need thunks; safe to shrink later
Batch size: branch_distance/5 (25.6 MiB forAArch64, 3.2 MiB for AArch32)
Parallelism: Uses TBB for parallel relocationscanning within each batch
Single branch range: Uses one conservativebranch_distance per architecture. For AArch32, uses ±16 MiB(Thumb limit) for all branches, whereas lld/ELF uses ±32 MiB for A32branches.
Thunk size not accounted in D-advancement: Theactual thunk group size is unknown when advancing D, so the end of alarge thunk group may be unreachable from the beginning of thebatch.
No convergence loop: Single forward pass foraddress assignment, no risk of non-convergence
GNU ld's thunk creationalgorithm
Each port implements the algorithm on their own. There is no codesharing.
GNU ld's AArch64 port (bfd/elfnn-aarch64.c) uses aniterative algorithm but with a single stub type and no lookup table.
for (;;) { stub_changed = false; _bfd_aarch64_add_call_stub_entries(&stub_changed, ...); if (!stub_changed) returntrue; _bfd_aarch64_resize_stubs(htab); layout_sections_again(); }
GNU ld's ppc64 port (bfd/elf64-ppc.c) uses an iterativemulti-pass algorithm with a branch lookup table(.branch_lt) for long-range stubs.
Section grouping: Sections are grouped bystub_group_size (~28-30 MiB default); each group gets onestub section. For 14-bit conditional branches(R_PPC64_REL14, ±32KiB range), group size is reduced by1024x.
while (1) { // Scan all relocations in all input sections for (input_bfd; section; irela) { // Only process branch relocations (R_PPC64_REL24, R_PPC64_REL14, etc.) stub_type = ppc_type_of_stub(section, irela, ...); if (stub_type == ppc_stub_none) continue; // Create or merge stub entry stub_entry = ppc_add_stub(...); }
// Size all stubs, potentially upgrading long_branch to plt_branch bfd_hash_traverse(&stub_hash_table, ppc_size_one_stub, ...);
// Check for convergence if (!stub_changed && all_sizes_stable) break;
// Re-layout sections layout_sections_again(); }
Convergence control:
STUB_SHRINK_ITER = 20 (PR28827): After 20 iterations,stub sections only grow (prevents oscillation)
Convergence when:!stub_changed && all section sizes stable
Stub type upgrade: ppc_type_of_stub()initially returns ppc_stub_long_branch for out-of-rangebranches. Later, ppc_size_one_stub() checks if the stub'sbranch can reach; if not, it upgrades toppc_stub_plt_branch and allocates an 8-byte entry in.branch_lt.
Comparing linker thunkalgorithms
Aspect
lld/ELF
lld/MachO
mold
GNU ld ppc64
Passes
Multi (max 30)
Single
Two
Multi (shrink after 20)
Strategy
Iterative refinement
Sliding window
Sliding window
Iterative refinement
Thunk placement
Pre-allocated intervals
Inline with slop
Batch intervals
Per stub-group
Linker relaxation
Some architectures take a different approach: instead of onlyexpanding branches, the linker can also shrinkinstruction sequences when the target is close enough. RISC-V andLoongArch both use this technique. See Thedark side of RISC-V linker relaxation for a deeper dive into thecomplexities and tradeoffs.
Consider a function call using the callpseudo-instruction, which expands to auipc +jalr:
1 2 3 4 5
# Before linking (8 bytes) call ext # Expands to: # auipc ra, %pcrel_hi(ext) # jalr ra, ra, %pcrel_lo(ext)
If ext is within ±1MiB, the linker can relax this to:
1 2
# After relaxation (4 bytes) jal ext
This is enabled by R_RISCV_RELAX relocations thataccompany R_RISCV_CALL relocations. TheR_RISCV_RELAX relocation signals to the linker that thisinstruction sequence is a candidate for shrinking.
When the linker deletes instructions, it must also adjust:
Subsequent instruction offsets within the section
Symbol addresses
Other relocations that reference affected locations
Alignment directives (R_RISCV_ALIGN)
This makes RISC-V linker relaxation more complex than thunkinsertion, but it provides code size benefits that other architecturescannot achieve at link time.
LoongArch uses a similar approach. Apcaddu12i+jirl sequence(R_LARCH_CALL36, ±128GiB range) can be relaxed to a singlebl instruction (R_LARCH_B26, ±128MiB range)when the target is close enough.
Diagnosing out-of-rangeerrors
When you encounter a "relocation out of range" error, check thelinker diagnostic and locate the relocatable file and function.Determine how the function call is lowered in assembly.
Summary
Handling long branches requires coordination across thetoolchain:
Stage
Technique
Example
Compiler
Branch relaxation pass
Invert condition + add unconditional jump
Assembler
Instruction relaxation
Invert condition + add unconditional jump
Linker
Range extension thunks
Generate trampolines
Linker
Linker relaxation
Shrink auipc+jalr to jal(RISC-V)
The linker's thunk generation is particularly important for largeprograms where function calls may exceed branch ranges. Differentlinkers use different algorithms with various tradeoffs betweencomplexity, optimality, and robustness.
Linker relaxation approaches adopted by RISC-V and LoongArch is analternative that avoids range extension thunks but introduces othercomplexities.
Branch instructions on most architectures use PC-relative addressingwith a limited range. When the target is too far away, the branchbecomes "out of range" and requires special handling.
Consider a large binary where main() at address 0x10000calls foo() at address 0x8010000-over 128MiB away. OnAArch64, the bl instruction can only reach ±128MiB, so thiscall cannot be encoded directly. Without proper handling, the linkerwould fail with an error like "relocation out of range." The toolchainmust handle this transparently to produce correct executables.
This article explores how compilers, assemblers, and linkers worktogether to solve the long branch problem.
Compiler (IR to assembly): Handles branches within a function thatexceed the range of conditional branch instructions
Assembler (assembly to relocatable file): Handles branches within asection where the distance is known at assembly time
Linker: Handles cross-section and cross-object branches discoveredduring final layout
Branch range limitations
Different architectures have different branch range limitations.Here's a quick comparison of unconditional branch/call ranges:
Architecture
Unconditional Branch
Conditional Branch
Notes
AArch64
±128MiB
±1MiB
Range extension thunks
AArch32 (A32)
±32MiB
±32MiB
Range extension and interworking veneers
AArch32 (T32)
±16MiB
±1MiB
Thumb has shorter ranges
PowerPC64
±32MiB
±32KiB
Range extension and TOC/NOTOC interworking thunks
RISC-V
±1MiB (jal)
±4KiB
Linker relaxation
x86-64
±2GiB
±2GiB
Code models or thunk extension
The following subsections provide detailed per-architectureinformation, including relocation types relevant for linkerimplementation.
AArch32
In A32 state:
Branch (b/b<cond>), conditionalbranch and link (bl<cond>)(R_ARM_JUMP24): ±32MiB
Unconditional branch and link (bl/blx,R_ARM_CALL): ±32MiB
Note: R_ARM_CALL is for unconditionalbl/blx which can be relaxed to BLX inline;R_ARM_JUMP24 is for branches which require a veneer forinterworking.
Note: lld does not implement range extension thunks for SPARC.
x86-64
Short conditional jump (Jcc rel8): -128 to +127bytes
Short unconditional jump (JMP rel8): -128 to +127bytes
Near conditional jump (Jcc rel32): ±2GiB
Near unconditional jump (JMP rel32): ±2GiB
With a ±2GiB range for near jumps, x86-64 rarely encountersout-of-range branches in practice. A single text section would need toexceed 2GiB before thunks become necessary. For this reason, mostlinkers (including lld) do not implement range extension thunks forx86-64.
Compiler: branch relaxation
The compiler typically generates branches using a form with a largerange. However, certain conditional branches may still go out of rangewithin a function.
The compiler measures branch distances and relaxes out-of-rangebranches. In LLVM, this is handled by the BranchRelaxationpass, which runs just before AsmPrinter.
Different backends have their own implementations:
BranchRelaxation: AArch64, AMDGPU, AVR, RISC-V
HexagonBranchRelaxation: Hexagon
PPCBranchSelector: PowerPC
SystemZLongBranch: SystemZ
MipsBranchExpansion: MIPS
MSP430BSel: MSP430
For a conditional branch that is out of range, the pass typicallyinverts the condition and inserts an unconditional branch:
1 2 3 4 5 6 7
# Before relaxation (out of range) beq .Lfar_target # ±4KiB range on RISC-V
# After relaxation bne .Lskip # Inverted condition, short range j .Lfar_target # Unconditional jump, ±1MiB range .Lskip:
Assembler: instructionrelaxation
The assembler converts assembly to machine code. When the target of abranch is within the same section and the distance is known at assemblytime, the assembler can select the appropriate encoding. This isdistinct from linker thunks, which handle cross-section or cross-objectreferences where distances aren't known until link time.
Assembler instruction relaxation handles two cases (see Clang-O0 output: branch displacement and size increase for examples):
Span-dependent instructions: Select a largerencoding when the displacement exceeds the range of the smallerencoding. For x86, a short jump (jmp rel8) can be relaxedto a near jump (jmp rel32).
Conditional branch transform: Invert the conditionand insert an unconditional branch. On RISC-V, a blt mightbe relaxed to bge plus an unconditional branch.
The assembler uses an iterative layout algorithm that alternatesbetween fragment offset assignment and relaxation until all fragmentsbecome legalized. See Integratedassembler improvements in LLVM 19 for implementation details.
Linker: range extensionthunks
When the linker resolves relocations, it may discover that a branchtarget is out of range. At this point, the instruction encoding isfixed, so the linker cannot simply change the instruction. Instead, itgenerates range extension thunks (also called veneers,branch stubs, or trampolines).
A thunk is a small piece of linker-generated code that can reach theactual target using a longer sequence of instructions. The originalbranch is redirected to the thunk, which then jumps to the realdestination.
Range extension thunks are one type of linker-generated thunk. Othertypes include:
ARM interworking veneers: Switch between ARM andThumb instruction sets (see Linker notes onAArch32)
MIPS LA25 thunks: Enable PIC and non-PIC codeinteroperability (see Toolchain notes onMIPS)
PowerPC64 TOC/NOTOC thunks: Handle calls betweenfunctions using different TOC pointer conventions (see Linker notes on PowerISA)
Short range vs long rangethunks
A short range thunk (see lld/ELF's AArch64implementation) contains just a single branch instruction. Since ituses a branch, its reach is also limited by the branch range—it can onlyextend coverage by one branch distance. For targets further away,multiple short range thunks can be chained, or a long range thunk withaddress computation must be used.
Long range thunks use indirection and can jump to (practically)arbitrary locations.
1 2 3 4 5 6 7 8 9
// Short range thunk: single branch, 4 bytes __AArch64AbsLongThunk_dst: b dst // ±128MiB range
__ARMV7PILongThunk_dst: movw ip, :lower16:(dst - .) ; ip = intra-procedure-call scratch register movt ip, :upper16:(dst - .) add ip, ip, pc bx ip
PowerPC64 ELFv2 (see Linker notes on PowerISA):
1 2 3 4 5
__long_branch_dst: addis 12, 2, .branch_lt@ha # Load high bits from branch lookup table ld 12, .branch_lt@l(12) # Load target address mtctr 12 # Move to count register bctr # Branch to count register
Thunk impact ondebugging and profiling
Thunks are transparent at the source level but visible in low-leveltools:
Stack traces: May show thunk symbols (e.g.,__AArch64ADRPThunk_foo) between caller and callee
Profilers: Samples may attribute time to thunkcode; some profilers aggregate thunk time with the target function
Disassembly: objdump orllvm-objdump will show thunk sections interspersed withregular code
Code size: Each thunk adds bytes; large binariesmay have thousands of thunks
lld/ELF's thunk creationalgorithm
lld/ELF uses a multi-pass algorithm infinalizeAddressDependentContent:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
assignAddresses(); for (pass = 0; pass < 30; ++pass) { if (pass == 0) createInitialThunkSections(); // pre-create empty ThunkSections bool changed = false; for (relocation : all_relocations) { if (pass > 0 && normalizeExistingThunk(rel)) continue; // existing thunk still in range if (!needsThunk(rel)) continue; Thunk *t = getOrCreateThunk(rel); ts = findOrCreateThunkSection(rel, src); ts->addThunk(t); rel.sym = t->getThunkTargetSym(); // redirect changed = true; } mergeThunks(); // insert ThunkSections into output if (!changed) break; assignAddresses(); // recalculate with new thunks }
Key details:
Multi-pass: Iterates until convergence (max 30passes). Adding thunks changes addresses, potentially puttingpreviously-in-range calls out of range.
Pre-allocated ThunkSections: On pass 0,createInitialThunkSections places emptyThunkSections at regular intervals(thunkSectionSpacing). For AArch64: 128 MiB - 0x30000 ≈127.8 MiB.
Thunk reuse: getThunk returns existingthunk if one exists for the same target;normalizeExistingThunk checks if a previously-created thunkis still in range.
ThunkSection placement: getISDThunkSecfinds a ThunkSection within branch range of the call site, or createsone adjacent to the calling InputSection.
lld/MachO's thunk creationalgorithm
lld/MachO uses a single-pass algorithm inTextOutputSection::finalize:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
for (callIdx = 0; callIdx < inputs.size(); ++callIdx) { // Finalize sections within forward branch range (minus slop) while (finalIdx < endIdx && fits_in_range(inputs[finalIdx])) finalizeOne(inputs[finalIdx++]);
// Process branch relocations in this section for (Relocation &r : reverse(isec->relocs)) { if (!isBranchReloc(r)) continue; if (targetInRange(r)) continue; if (existingThunkInRange(r)) { reuse it; continue; } // Create new thunk and finalize it createThunk(r); } }
Key differences from lld/ELF:
Single pass: Addresses are assigned monotonicallyand never revisited
Slop reservation: ReservesslopScale * thunkSize bytes (default: 256 × 12 = 3072 byteson ARM64) to leave room for future thunks
Thunk naming:<function>.thunk.<sequence> where sequenceincrements per target
Thunkstarvation problem: If many consecutive branches need thunks, eachthunk (12 bytes) consumes slop faster than call sites (4 bytes apart)advance. The test lld/test/MachO/arm64-thunk-starvation.sdemonstrates this edge case. Mitigation is increasing--slop-scale, but pathological cases with hundreds ofconsecutive out-of-range callees can still fail.
mold's thunk creationalgorithm
mold uses a two-pass approach: first pessimistically over-allocatethunks, then remove unnecessary ones.
Intuition: It's safe to allocate thunk space andlater shrink it, but unsafe to add thunks after addresses are assigned(would create gaps breaking existing references).
Pass 1 (create_range_extension_thunks):Process sections in batches using a sliding window. The window tracksfour positions:
1 2 3 4 5 6 7 8 9
Sections: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] ... ^ ^ ^ ^ A B C D | |_______| | | batch | | | earliest thunk reachable placement from C
[B, C) = current batch of sections to process (size≤ branch_distance/5)
A = earliest section still reachable from C (forthunk expiration)
D = where to place the thunk (furthest pointreachable from B)
// Simplified from OutputSection<E>::create_range_extension_thunks while (b < sections.size()) { // Advance D: find furthest point where thunk is reachable from B while (d < size && thunk_at_d_reachable_from_b) assign_address(sections[d++]);
// Compute batch [B, C) c = b + 1; while (c < d && sections[c] < sections[b] + batch_size) c++;
// Advance A: expire thunks no longer reachable while (a < b && sections[a] + branch_distance < sections[c]) a++; // Expire thunk groups before A: clear symbol flags. for (; t < thunks.size() && thunks[t].offset < sections[a]; t++) for (sym in thunks[t].symbols) sym->flags = 0;
// Scan [B,C) relocations. If a symbol is not assigned to a thunk group yet, // assign it to the new thunk group at D. auto &thunk = thunks.emplace_back(newThunk(offset)); parallel_for(b, c, [&](i64 i) { for (rel in sections[i].relocs) { if (requires_thunk(rel)) { Symbol &sym = rel.symbol; if (!sym.flags.test_and_set()) { // atomic: skip if already set lock_guard lock(mu); thunk.symbols.push_back(&sym); } } } }); offset += thunk.size(); b = c; // Move to next batch }
Pass 2 (remove_redundant_thunks): Afterfinal addresses are known, remove thunk entries for symbols actually inrange.
Key characteristics:
Pessimistic over-allocation: Assumes allout-of-section calls need thunks; safe to shrink later
Batch size: branch_distance/5 (25.6 MiB forAArch64, 3.2 MiB for AArch32)
Parallelism: Uses TBB for parallel relocationscanning within each batch
Single branch range: Uses one conservativebranch_distance per architecture. For AArch32, uses ±16 MiB(Thumb limit) for all branches, whereas lld/ELF uses ±32 MiB for A32branches.
Thunk size not accounted in D-advancement: Theactual thunk group size is unknown when advancing D, so the end of alarge thunk group may be unreachable from the beginning of thebatch.
No convergence loop: Single forward pass foraddress assignment, no risk of non-convergence
Comparing thunk algorithms
Aspect
lld/ELF
lld/MachO
mold
Passes
Multi-pass (max 30)
Single-pass
Two-pass
Strategy
Iterative refinement
Greedy
Greedy
Thunk placement
Pre-allocated at intervals
Inline with slop reservation
Batch-based at intervals
Convergence
Always (bounded iterations)
Almost always
Almost always
Range handling
Per-relocation type
Single conservative range
Single conservative range
Parallelism
Sequential
Sequential
Parallel (TBB)
Linker relaxation (RISC-V)
RISC-V takes a different approach: instead of only expandingbranches, it can also shrink instruction sequences whenthe target is close enough.
Consider a function call using the callpseudo-instruction, which expands to auipc +jalr:
1 2 3 4 5
# Before linking (8 bytes) call ext # Expands to: # auipc ra, %pcrel_hi(ext) # jalr ra, ra, %pcrel_lo(ext)
If ext is within ±1MiB, the linker can relax this to:
1 2
# After relaxation (4 bytes) jal ext
This is enabled by R_RISCV_RELAX relocations thataccompany R_RISCV_CALL relocations. TheR_RISCV_RELAX relocation signals to the linker that thisinstruction sequence is a candidate for shrinking.
When the linker deletes instructions, it must also adjust:
Subsequent instruction offsets within the section
Symbol addresses
Other relocations that reference affected locations
Alignment directives (R_RISCV_ALIGN)
This makes RISC-V linker relaxation more complex than thunkinsertion, but it provides code size benefits that other architecturescannot achieve at link time.
Diagnosing out-of-rangeerrors
When you encounter a "relocation out of range" error, here are somediagnostic steps:
Check the error message: lld reports the sourcelocation, relocation type, and the distance. For example:
1
ld.lld: error: a.o:(.text+0x1000): relocation R_AARCH64_CALL26 out of range: 150000000 is not in [-134217728, 134217727]
Use --verbose or-Map: Generate a link map to see sectionlayout and identify which sections are far apart.
Consider -ffunction-sections:Splitting functions into separate sections gives the linker moreflexibility in placement, potentially reducing distances.
Check for large data in .text:Embedded data (jump tables, constant pools) can push functions apart.Some compilers have options to place these elsewhere.
LTO considerations: Link-time optimization candramatically change code layout. If thunk-related issues appear onlywith LTO, the optimizer may be creating larger functions or differentinlining decisions.
Summary
Handling long branches requires coordination across thetoolchain:
Stage
Technique
Example
Compiler
Branch relaxation pass
Invert condition + add unconditional jump
Assembler
Instruction relaxation
Short jump to near jump
Linker
Range extension thunks
Generate trampolines
Linker
Linker relaxation
Shrink auipc+jalr to jal(RISC-V)
The linker's thunk generation is particularly important for largeprograms where cross-compilation-unit calls may exceed branch ranges.Different linkers use different algorithms with various tradeoffsbetween complexity, optimality, and robustness.
RISC-V's linker relaxation is unique in that it can both expand andshrink code, optimizing for both correctness and code size.
I've created pr-shadow with vibecoding, a tool that maintains a shadow branch for GitHub pull requests(PR) that never requires force-pushing. This addresses pain points Idescribed in Reflectionson LLVM's switch to GitHub pull requests#Patch evolution.
The problem
GitHub structures pull requests around branches, enforcing abranch-centric workflow. There are multiple problems when you force-pusha branch after a rebase:
The UI displays "force-pushed the BB branch from X to Y". Clicking"compare" shows git diff X..Y, which includes unrelatedupstream commits—not the actual patch difference. For a project likeLLVM with 100+ commits daily, this makes the comparison essentiallyuseless.
Inline comments may become "outdated" or misplaced after forcepushes.
If your commit message references an issue or another PR, each forcepush creates a new link on the referenced page, cluttering it withduplicate mentions. (Adding backticks around the link text works aroundthis, but it's not ideal.)
These difficulties lead to recommendations favoring less flexibleworkflows that only append commits (including merge commits) anddiscourage rebases. However, this means working with an outdated base,and switching between the main branch and PR branches causes numerousrebuilds-especially painful for large repositories likellvm-project.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
git switch main; git pull; ninja -C build
# Switching to a feature branch with an outdated base requires numerous rebuilds. git switch feature0 git merge origin/main # I prefer `git rebase main` to remove merge commits, which clutter the history ninja -C out/release
# Switching to another feature branch with an outdated base requires numerous rebuilds. git switch feature1 git merge origin/main ninja -C out/release
In a large repository, avoiding rebases isn't realistic—other commitsfrequently modify nearby lines, and rebasing is often the only way todiscover that your patch needs adjustments due to interactions withother landed changes.
In 2022, GitHub introduced "Pull request title and description" forsquash merging. This means updating the final commit message requiresediting via the web UI. I prefer editing the local commit message andsyncing the PR description from it.
The solution
After updating my main branch, before switching to afeature branch, I always run
1
git rebase main feature
to minimize the number of modified files. To avoid the force-pushproblems, I use pr-shadow to maintain a shadow PR branch (e.g.,pr/feature) that only receives fast-forward commits(including merge commits).
I work freely on my local branch (rebase, amend, squash), then syncto the PR branch using git commit-tree to create a commitwith the same tree but parented to the previous PR HEAD.
1 2 3 4 5 6
Local branch (feature) PR branch (pr/feature) A A (init) | | B (amend) C1 "Fix bug" | | C (rebase) C2 "Address review"
Reviewers see clean diffs between C1 and C2, even though theunderlying commits were rewritten.
When a rebase is detected (git merge-base withmain/master changed), the new PR commit is created as a merge commitwith the new merge-base as the second parent. GitHub displays these as"condensed" merges, preserving the diff view for reviewers.
# Set `git merge-base origin/main feature` as the initial base. Push to pr/feature and open a GitHub PR. prs init # Same but create a draft PR. Repeated `init`s are rejected. prs init --draft
# Work locally (rebase, amend, etc.) git fetch origin main:main git rebase main git commit --amend
# Sync to PR prs push "Rebase and fix bug" # Force push if remote diverged due to messing with pr/feature directly. prs push --force "Rewrite"
# Update PR title/body from local commit message. prs desc
# Run gh commands on the PR. prs gh view prs gh checks
The tool supports both fork-based workflows (pushing to your fork)and same-repo workflows (for branches likeuser/<name>/feature). It also works with GitHubEnterprise, auto-detecting the host from the repository URL.
Related work
The name "prs" is a tribute to spr, which implements asimilar shadow branch concept. However, spr pushes user branches to themain repository rather than a personal fork. While necessary for stackedpull requests, this approach is discouraged for single PRs as itclutters the upstream repository. pr-shadow avoids this by pushing toyour fork by default.
I owe an apology to folks who receiveusers/MaskRay/feature branches (if they use the defaultfetch = +refs/heads/*:refs/remotes/origin/* to receive userbranches). I had been abusing spr for a long time after LLVM'sGitHub transition to avoid unnecessary rebuilds when switchingbetween the main branch and PR branches.
Additionally, spr embeds a PR URL in commit messages (e.g.,Pull Request: https://github.com/llvm/llvm-project/pull/150816),which can cause downstream forks to add unwanted backlinks to theoriginal PR.
If I need stacked pull requests, I will probably use pr-shadow withthe base patch and just rebase stacked ones - it's unclear how sprhandles stacked PRs.
tl;dr: Weak AVL trees are replacements for AVL trees and red-blacktrees.
The 2014 paper Rank-BalancedTrees (Haeupler, Sen, Tarjan) presents a framework using ranksand rank differences to define binary search trees.
Each node has a non-negative integer rank r(x). Nullnodes have rank -1.
The rank difference of a node x with parentp(x) is r(p(x)) − r(x).
A node is i,j if its children have rank differencesi and j (unordered), e.g., a 1,2 node haschildren with rank differences 1 and 2.
A node is called 1-node if its rank difference is 1.
Several balanced trees fit this framework:
AVL tree: Ranks are defined as heights. Every node is 1,1 or 1,2(rank differences of children)
Red-Black tree: All rank differences are 0 or 1, and no parent of a0-child is a 0-child. (red: 0-child; black: 1-child; null nodes areblack)
Weak AVL tree (new tree described by this paper): All rankdifferences are 1 or 2, and every leaf has rank 0.
A weak AVL tree without 2,2 nodes is an AVL tree.
1
AVL trees ⫋ weak AVL trees ⫋ red-black trees
Weak AVL Tree
Weak AVL trees are replacements for AVL trees and red-black trees. Asingle insertion or deletion operation requires at most two rotations(forming a double rotation when two are needed). In contrast, AVLdeletion requires O(log n) rotations, and red-black deletion requires upto three.
Without deletions, a weak AVL tree is exactly an AVL tree. Withdeletions, its height remains at most that of an AVL tree with the samenumber of insertions but no deletions.
The rank rules imply:
Null nodes have rank -1, leaves have rank 0, unary nodes have rank1.
Insertion
The new node x has a rank of 0, changed from the nullnode of rank -1. There are three cases.
If the tree was previously empty, the new node becomes theroot.
If the parent of the new node was previously a unary node (1,2node), it is now a 1,1 binary node.
If the parent of the new node was previously a leaf (1,1 node), itis now a 0,1 binary node, leading to a rank violation.
When the tree was previously non-empty, x has a parentnode. We call the following subroutine with x indicatingthe new node to handle the second and third cases.
The following subroutine handles the rank increase of x.We call break if there is no more rank violation, i.e. weare done.
The 2014 paper isn't very clear about the conditions.
// Assume that x's rank has just increased by 1 and rank_diff(x) has been updated.
p = x->parent; if (rank_diff(x) == 1) { // x was previously a 2-child before increasing x->rank. // Done. } else { for (;;) { // Otherwise, p is a 0,1 node (previously a 1,1 node before increasing x->rank). // x being a 0-child is a violation.
Promote p. // Since we have promoted both x and p, it's as if rank_diff(x's sibling) is flipped. // p is now a 1,2 node.
x = p; p = p->parent; // x is a 1,2 node. if (!p) break; d = p->ch[1] == x;
if (rank_diff(x) == 1) { break; } // Otherwise, x is a 0-child, leading to a new rank violation.
auto sib = p->ch[!d]; if (rank_diff(sib) == 2) { // p is a 0,2 node auto y = x->ch[d^1]; if (y && rank_diff(y) == 1) { // y is a 1-child. y must the previous `x` in the last iteration. Perform a double rotation involving `p`, `x`, and `y`. } else { // Otherwise, y is null or a 2-child. Perform a single rotation involving `p` and `x`. x is now a 1,1 node and there is no more violation. } break; }
// Otherwise, p is a 0,1 node. Goto the next iteration. } }
Insertion never introduces a 2,2 node, so insertion-only sequencesproduce AVL trees.
Deletion
TODO: Describe deletion
1 2 3
if (!was_2 && !x && !p->ch[0] && !p->ch[1] && p->rp()) { // p was unary and becomes 2,2. Demote it. }
Implementation
Since valid rank differences can only be 1 or 2, ranks can be encodedefficiently using bit flags. There are three approaches:
Store two bits representing the rank differences to each child. Bit0: rank difference to left child (1 = diff is 2, 0 = diff is 1). Bit 1:rank difference to right child
Store a single bit representing the parity (even/odd) of the node'sabsolute rank. The rank difference to a child is computed by comparingparities. Same parity → rank difference of 2. Different parity → rankdifference of 1
Store a 1-bit rank difference parity in each node.
FreeBSD's sys/tree.h (https://reviews.freebsd.org/D25480, 2020) uses the firstapproach. The rb_ prefix remains as it can also indicateRank-Balanced:) Note: its insertion operation can be futheroptimized as the following code demonstrates.
The third approach is less efficient because a null node can beeither a 1-child (parent is binary) or a 2-child (parent is unary),requiring the sibling node to be probed to determine the rankdifference:int rank_diff(Node *p, int d) { return p->ch[d] ? p->ch[d]->par_and_flg & 1 : p->ch[!d] ? 2 : 1; }
select: find the k-th smallest element (0-indexed)
prev: find the largest element less than a key
next: find the smallest element greater than a key
Node structure:
ch[2]: left and right child pointers.
par_and_flg: packs the parent pointer with 2 flag bitsin the low bits. Bit 0 indicates whether the left child has rankdifference 2; bit 1 indicates whether the right child has rankdifference 2. A cleared bit means rank difference 1.
i: the key value.
sum, size: augmented data maintained bymconcat for order statistics operations.
Helper methods:
rd2(d): returns true if child d has rankdifference 2.
flip(d): toggles the rank difference of childd between 1 and 2.
clr_flags(): sets both children to rank difference 1(used after rotations to reset a node to 1,1).
Invariants:
Leaves always have flags() == 0, meaning both nullchildren are 1-children (null nodes have rank -1, leaf has rank 0).
After each insertion or deletion, mconcat is calledalong the path to the root to update augmented data.
Rotations:
The rotate(x, d) function rotates node x indirection d. It lifts x->ch[d] to replacex, and updates the augmented data for x. Thecaller is responsible for updating rank differences.
Dai Loy Gambling House大来赌场。过去八座赌场中建筑物仅剩的一座。赌场于1951年关闭。
Boarding House Museum 寄宿公寓
Jan Ying Chinese Association Museum 俊英工商会
Joe Shoong Chinese School 中文学校
Memorial Park
Google maps上结束时间不准确,四栋建筑均于16:00关闭。 后来根据视频https://www.youtube.com/watch?v=wtzcOgaMYcQ,关门的工作人员其实就是几栋建筑的主人ClarenceChu! 他在1976年从George Locke家族购买了town of Locke。
On most Linux platforms (except AArch32, which uses.ARM.exidx), DWARF .eh_frame is required forC++ exceptionhandling and stackunwinding to restore callee-saved registers. While.eh_frame can be used for call trace recording, it is oftencriticized for its runtime overhead. As an alternative, developers canenable frame pointers, or adopt SFrame, a newer format designedspecifically for profiling. This article examines the size overhead ofenabling non-DWARF stack walking mechanisms when building several LLVMexecutables.
Runtime performance analysis will be added in a future update.
Stack walking mechanisms
Here is a survey of mechanisms available for x86-64:
Frame pointers: Fast and simple, but costs a register.
DWARF .eh_frame: Comprehensive but slower, supportsadditional features like C++ exception handling
SFrame: This is a new experimental format only support profiling..eh_frame remains necessary for debugging and C++ exceptionhandling. Check out Remarkson SFrame for details.
LLVM's Compact Unwinding Format: A highly space-efficient format, implemented byApple for Mach-O binaries. This has llvm, lld/MachO, and libunwindimplementation. Supports x86-64 and AArch64. This can mostly replaceDWARF CFI, though some entries need DWARF escape(__eh_frame section would be tiny). OpenVMS modified it fortheir x86-64 port.
x86 Last Branch Record (LBR): A hardware feature that captures alimited history of most recent branches (up to 32 on Skylake+). Whenconfigured to track branches for SamplePGO, the limited depth means itwon't reliably capture deep stack traces. Traditionally Intel only, butAMD Zen 4 has since implemented LastBranch Record Extension Version 2 (LbrExtV2)
Control-flow Enforcement Technology (CET) Shadow Stack: Thishardware security hardening feature can be used to get stack traces.While it introduces some overhead, it offers the flexibility ofprocess-specific enablement.
Space overhead analysis
Frame pointer size impact
For most architectures, GCC defaults to-fomit-frame-pointer in -O compilation to freeup a register for general use. To enable frame pointers, specify-fno-omit-frame-pointer, which reserves the frame pointerregister (e.g., rbp on x86-64) and emits push/popinstructions in function prologues/epilogues.
For leaf functions (those that don't call other functions), while theframe pointer register should still be reserved for consistency, thepush/pop operations are often unnecessary. Compilers provide-momit-leaf-frame-pointer (with target-specific defaults)to reduce code size.
The viability of this optimization depends on the targetarchitecture:
On AArch64, the return address is available in the link register(X30). The immediate caller can be retrieved by inspecting X30, so-momit-leaf-frame-pointer does not compromiseunwinding.
On x86-64, after the prologue instructions execute, the returnaddress is stored at RSP plus an offset. An unwinder needs to know thestack frame size to retrieve the return address, or it must utilizeDWARF information for the leaf frame and then switch to the FP chain forparent frames.
Beyond this architectural consideration, there are additionalpractical reasons to use -momit-leaf-frame-pointer onx86-64:
Many hand-written assembly implementations (including numerous glibcfunctions) don't establish frame pointers, creating gaps in the framepointer chain anyway.
In the prologue sequence push rbp; mov rbp, rsp, afterthe first instruction executes, RBP does not yet reference the currentstack frame. When shrink-wrapping optimizations are enabled, theinstruction region where RBP still holds the old value becomes larger,increasing the window where the frame pointer is unreliable.
Given these trade-offs, three common configurations have emerged:
For instance, llvm-mc is dominated by read-only data,making the relative .text percentage quite small, so framepointer impact on the VM size is minimal. ("VM size" is a metric used bybloaty, representing the total p_memsz size ofPT_LOAD segments, excluding alignmentpadding.) As expected, llvm-mc grows larger as morefunctions set up the frame pointer chain. However, optactually becomes smaller when -fno-omit-frame-pointer isenabled—a counterintuitive result that warrants explanation.
Without frame pointer, the compiler uses RSP-relative addressing toaccess stack objects. When using the register-indirect + disp8/disp32addresing mode, RSP needs an extra SIB byte while RBP doesn't. Forlarger functions accessing many local variables, the savings fromshorter RBP-relative encodings can outweigh the additionalpush rbp; mov rbp, rsp; pop rbp instructions in theprologues/epilogues.
Oracle is advocating for SFrame adoption in Linux distributions. TheSFrame implementation is handled by the assembler and linker rather thanthe compiler. Let's build the latest binutils-gdb to test it.
Building test program
We'll use the clang compiler from https://github.com/llvm/llvm-project/tree/release/21.xas our test program.
There are still issues related to garbage collection (object fileformat design issue), so I'll just disable-Wl,--gc-sections.
1 2 3 4 5 6 7 8 9
--- i/llvm/cmake/modules/AddLLVM.cmake +++ w/llvm/cmake/modules/AddLLVM.cmake @@ -331,4 +331,4 @@ function(add_link_opts target_name) # TODO Revisit this later on z/OS. - set_property(TARGET ${target_name} APPEND_STRING PROPERTY - LINK_FLAGS " -Wl,--gc-sections") + #set_property(TARGET ${target_name} APPEND_STRING PROPERTY + # LINK_FLAGS " -Wl,--gc-sections") endif()
The results show that .sframe (8.87 MiB) isapproximately 10% larger than the combined size of.eh_frame and .eh_frame_hdr (7.07 + 0.99 =8.06 MiB). While SFrame is designed for efficiency during stack walking,it carries a non-trivial space overhead compared to traditional DWARFunwind information.
SFrame vs FP
Having examined SFrame's overhead compared to .eh_frame,let's now compare the two primary approaches for non-hardware-assistedstack walking.
Frame pointer approach: Reserve FP but omitpush/pop for leaf functionsg++ -fno-omit-frame-pointer -momit-leaf-frame-pointer
SFrame approach: Omit FP and use SFrame metadatag++ -fomit-frame-pointer -momit-leaf-frame-pointer -Wa,--gsframe
To conduct a fair comparison, we build LLVM executables using bothapproaches with both Clang and GCC compilers. The following scriptconfigures and builds test binaries with each combination:
GCC-built binaries are significantly larger than their Clangcounterparts, probably due to more aggressive inlining or vectorizationstrategies.
/tmp/out/custom-compact has significantly smaller EHsize. See details below.
With Clang-built binaries, the frame pointer configuration produces asmaller opt executable (55.6 MiB) compared to the SFrameconfiguration (62.5 MiB). This reinforces our earlier observation thatRBP addressing can be more compact than RSP-relative addressing forlarge functions with frequent local variable accesses.
Assembly comparison reveals that functions using RBP and RSPaddressing produce quite similar code.
In contrast, GCC-built binaries show the opposite trend: the framepointer version of opt (70.0 MiB) is smaller than theSFrame version (76.2 MiB).
The generated assembly differs significantly between omit-FP andnon-omit-FP builds, I have compared symbol sizes between two GCC builds.
Many functions, such as_ZN4llvm15ELFObjectWriter24executePostLayoutBindingEv, havesignificant more instructions in the keep-FP build. This suggests thatGCC's frame pointer code generation may not be as optimized as itsdefault omit-FP path.
The /tmp/out/custom-compact build uses my llvm-projectbranch (http://github.com/MaskRay/llvm-project/tree/demo-unwind)that ports Mach-O compact unwind to ELF, allowing the majority of.eh_frame FDEs to replace CFI instructions with unwinddescriptors. Linker behavior:
Split FDEs into two groups: descriptor-based (augmentation 'C') andinstruction-based
Generate .eh_frame_hdr version 2 with 12-byte tableentries when compact FDEs are present:(pc_ptr, unwind_descriptor_or_fde_ptr). Compact FDEsdescribed by .eh_frame_hdr inline are removed from theoutput .eh_frame section.
Note: .ARM.exidx and MIPScompact exception tables also describe unwind descriptors inline ina binary search index table.
FDEs not representable by compact unwind (e.g. shrink wrappingoptimization) use the traditional CFI instructions (called DWARF escapein the Mach-O compact unwind information).
This implementation involves several components:
-mllvm -elf-compact-unwind: Emits.eh_frame CIEs with augmentation character 'C' and FDEsusing unwind descriptors.
-mllvm -x86-epilog-cfi=0: Disables epilogue CFI for x86(primarily implemented by D42848 in 2018, notablydisabled for Darwin and Windows). Without this option most frames willnot utilize unwind descriptors because the current Mach-O compact unwindimplementation does not supportpopq %rbp; .cfi_def_cfa %rsp, 8; ret. I believe this isstill fair as we expect to use a 8-byte descriptor, sufficient todescribe epilogue CFI.
lld/ELF changes: FDEs are split into descriptor-based (augmentation'C') and CFI-instruction-based groups. When compact FDEs are present,.eh_frame_hdr version 2 is generated with 12-byte tableentries containing (pc_ptr, unwind_descriptor_or_fde_ptr). The PCpointer remains 4 bytes, while the 8-byte entry indicates either anunwind descriptor (odd value) or an FDE pointer (even value).
With the current implementation, 4937 out of 77648 FDEs (6.36%)require a DWARF escape, while the remaining FDEs can be replaced withunwind descriptors.
.eh_frame_hdr will become even smaller if we implementthe two-level page table structure in Mach-O__unwind_info.
Runtime performance analysis
TODO
perf record overhead with EH
perf record overhead with FP
Here is a benchmarkrun from llvm-compile-time-tracker.com.
The stable2-O3 benchmark is relevant. When enabling FPfor non-leaf functions, the instructions:u metric increasesby +2.44% while wall-time (a noisy metric) increases byjust 0.56%.
Summary
This article examines the space overhead of different stack walkingmechanisms when building LLVM executables.
Frame pointer configurations: Enabling framepointers (-fno-omit-frame-pointer) can paradoxically reducex86-64 binary size when stack object accesses are frequent. This occursbecause RBP-relative addressing produces more compact encodings thanRSP-relative addressing, which requires an extra SIB byte. The savingsfrom shorter instructions can outweigh the prologue/epilogueoverhead.
SFrame vs .eh_frame: For the x86-64clang executable, SFrame metadata is approximately 10%larger than the combined size of .eh_frame and.eh_frame_hdr. Given the significant VM size overhead andthe lack of clear advantages over established alternatives, I amskeptical about SFrame's viability as the future of stack walking foruserspace programs. While SFrame will receive a major revision V3 in theupcoming months, it needs to achieve substantial size reductionscomparable to existing compact unwinding schemes to justify its adoptionover frame pointers. I hope interested folks can implement somethingsimilar to macOS's compact unwind descriptors (with x86-64 support) andOpenVMS's.
ELF compact unwind: My prototype porting Mach-Ocompact unwind to ELF demonstrates significant promise. The approachreduces VM size by 4.5-6.1% compared to frame pointers, achieving thesmallest binaries in my benchmarks. By replacing verbose CFIinstructions with 8-byte unwind descriptors (with DWARF escape forcomplex cases like shrink-wrapped functions), .eh_frameshrinks dramatically—only 6.36% of FDEs require the traditional CFIformat. This approach, once completed, offers a compelling alternativeto SFrame: better compression, compatibility with existing.eh_frame infrastructure, and a clear path toimplementation.
LLVMcommunity: I need your support. I've raised technicalobjections to the SFrame RFC as maintainer. Some engineers dismissedthem. Now they're escalating to Project Council to override technicalreview. This looks OKR-driven, not merit-driven.
GCC's frame pointer code generation appears less optimized than itsdefault omit-frame-pointer path, as evidenced by substantial differencesin generated assembly.
Runtime performance analysis remains to be conducted to complete thetrade-off evaluation.
Appendix:configure-llvm
This script specifies common options when configuring llvm-project:https://github.com/MaskRay/Config/blob/master/home/bin/configure-llvm
-DCMAKE_CXX_ARCHIVE_CREATE="$HOME/Stable/bin/llvm-ar qc --thin <TARGET> <OBJECTS>" -DCMAKE_CXX_ARCHIVE_FINISH=::Use thin archives to reduce disk usage
-DLLVM_TARGETS_TO_BUILD=host: Build a singletarget
-DCLANG_ENABLE_OBJC_REWRITER=off -DCLANG_ENABLE_STATIC_ANALYZER=off:Disable less popular components
-DLLVM_ENABLE_PLUGINS=off -DCLANG_PLUGIN_SUPPORT=off:Disable -Wl,--export-dynamic, preventing large.dynsym and .dynstr sections
Appendix: My SFrame build
1 2 3 4
mkdir -p out/release && cd out/release ../../configure --prefix=$HOME/opt/binutils --disable-multilib make -j $(nproc) all-ld all-binutils all-gas make -j $(nproc) install-ld install-binutils install-gas
gcc -B$HOME/opt/binutils/bin andclang -B$HOME/opt/binutils/bin -fno-integrated-as will useas and ld from the install directory.
Appendix: Scripts
Ruby scripts used by this post are available at https://github.com/MaskRay/object-file-size-analyzer/
SFrame is a new stackwalking format for userspace profiling, inspired by Linux'sin-kernel ORC unwindformat. While SFrame eliminates some .eh_frame CIE/FDEoverhead, it sacrifices functionality (e.g., personality, LSDA,callee-saved registers) and flexibility, and its stack offsets are lesscompact than .eh_frame's bytecode-style CFI instructions.In llvm-project executables I've tested on x86-64, .sframesection is 20% larger than .eh_frame. It also remainssignificantly larger than highly compact schemes like WindowsARM64 unwind codes.
SFrame describes three elements for each function:
Canonical Frame Address (CFA): The base address for stack framecalculations
Return address
Frame pointer
An .sframe section follows a straightforward layout:
Header: Contains metadata and offset information
Auxiliary header (optional): Reserved for future extensions
Function Descriptor Entries (FDEs): Array describing eachfunction
Frame Row Entries (FREs): Arrays of unwinding information perfunction
struct [[gnu::packed]] sframe_header { struct { uint16_t sfp_magic; uint8_t sfp_version; uint8_t sfp_flags; } sfh_preamble; uint8_t sfh_abi_arch; int8_t sfh_cfa_fixed_fp_offset; // Used by x86-64 to define the return address slot relative to CFA int8_t sfh_cfa_fixed_ra_offset; // Size in bytes of the auxiliary header, allowing extensibility uint8_t sfh_auxhdr_len; // Numbers of FDEs and FREs uint32_t sfh_num_fdes; uint32_t sfh_num_fres; // Size in bytes of FREs uint32_t sfh_fre_len; // Offsets in bytes of FDEs and FREs uint32_t sfh_fdeoff; uint32_t sfh_freoff; };
While magic is popular choices for file formats, they deviate fromestablished ELF conventions, which simplifies utilizes the section typefor distinction.
The version field resembles the similar uses within DWARF sectionheaders. SFrame will likely evolve over time, unlike ELF's more stablecontrol structures. This means we'll probably need to keep producers andconsumers evolving in lockstep, which creates a stronger case forinternal versioning. An internal version field would allow linkers toupgrade or ignore unsupported low-version input pieces, providing moreflexibility in handling version mismatches.
Data structures
Function Descriptor Entries(FDEs)
Function Descriptor Entries serve as the bridge between functions andtheir unwinding information. Each FDE describes a function's locationand provides a direct link to its corresponding Frame Row Entries(FREs), which contain the actual unwinding data.
1 2 3 4 5 6 7 8 9 10 11 12 13
struct [[gnu::packed]] sframe_func_desc_entry { int32_t sfde_func_start_address; uint32_t sfde_func_size; uint32_t sfde_func_start_fre_off; uint32_t sfde_func_num_fres; // bits 0-3 fretype: sfre_start_address type // bit 4 fdetype: SFRAME_FDE_TYPE_PCINC or SFRAME_FDE_TYPE_PCMASK // bit 5 pauth_key: (AArch64 only) the signing key for the return address uint8_t sfde_func_info; // The size of the repetitive code block for SFRAME_FDE_TYPE_PCMASK; used by .plt uint8_t sfde_func_rep_size; uint16_t sfde_func_padding2; };
The current design has room for optimization. Thesfde_func_num_fres field uses a full 32 bits, which iswasteful for most functions. We could use uint16_t instead,requiring exceptionally large functions to be split across multipleFDEs.
It's important to note that SFrame's function concept represents coderanges rather than logical program functions. This distinction becomesparticularly relevant with compiler optimizations like hot-coldsplitting, where a single logical function may span multiplenon-contiguous code ranges, each requiring its own FDE.
The padding field sfde_func_padding2 representsunnecessary overhead in modern architectures where unaligned memoryaccess performs efficiently, making the alignment benefitsnegligible.
To enable binary search on sfde_func_start_address, FDEsmust maintain a fixed size, which precludes the use of variable-lengthinteger encodings like PrefixVarInt.
Frame Row Entries (FREs)
Frame Row Entries contain the actual unwinding information forspecific program counter ranges within a function. The template designallows for different address sizes based on the function'scharacteristics.
1 2 3 4 5 6 7 8 9
template <classAddrType> struct [[gnu::packed]] sframe_frame_row_entry { // If the fdetype is SFRAME_FDE_TYPE_PCINC, this is an offset relative to sfde_func_start_address AddrType sfre_start_address; // bit 0 fre_cfa_base_reg_id: define BASE_REG as either FP or SP // bits 1-4 fre_offset_count: typically 1 to 3, describing CFA, FP, and RA // bits 5-6 fre_offset_size: byte size of offset entries (1, 2, or 4 bytes) sframe_fre_info sfre_info; };
Each FRE contains variable-length stack offsets stored as trailingdata. The fre_offset_size field determines whether offsetsuse 1, 2, or 4 bytes (uint8_t, uint16_t, oruint32_t), allowing optimal space usage based on stackframe sizes.
Architecture-specific stackoffsets
SFrame adapts to different processor architectures by varying itsoffset encoding to match their respective calling conventions andarchitectural constraints.
x86-64
The x86-64 implementation takes advantage of the architecture'spredictable stack layout:
First offset: Encodes CFA as BASE_REG + offset
Second offset (if present): Encodes FP asCFA + offset
AArch64's more flexible calling conventions require explicit returnaddress tracking:
First offset: Encodes CFA as BASE_REG + offset
Second offset: Encodes return address asCFA + offset
Third offset (if present): Encodes FP asCFA + offset
The explicit return address encoding accommodates AArch64's variablestack layouts and link register usage patterns.
s390x
FP and return address may not be saved at the same time. In leaffunctions GCC might save the return address and FP to floating-pointregisters.
First offset: Encodes CFA as BASE_REG + offset
Second offset (if preset): Encodes the return address as one of
stack slot:CFA + offset2, if (offset2 & 1 == 0)
register number:offset2 >> 1, if (offset2 & 1 == 1)
not saved:if (offset2 == SFRAME_FRE_RA_OFFSET_INVALID)
Third offset (if present)
FP stack slot = CFA + offset3, if (offset3 & 1 == 0)
FP register number = offset3 >> 1, if (offset3 & 1 ==1)
The format uses 0 (an invalid SFrame RA offset from CFA value) toindicate that the return address is not saved, while FP is saved.
Toolchain implementation
In the GNU toolchain, the assembler in GNU Binutils reinterprets CFIdirectives and generates the .sframe section, while GCCitself has no knowledge of SFrame.
Some scenarios that cannot be described by .eh_frame inthe absence of the frame pointer are equally inexpressible in SFrame.Additionally, SFrame has extra limitations, as certain CFI directivescannot be re-encoded into the SFrame format. You can take a look atas_warn code in binutils-gdb gas/gen-sframe.cto learn some cases.
On the other hand, the assembler approach allows SFrame to work withhand-written assembly files with CFI directives.
ORC and .sframe
TODO
.eh_frame and.sframe
SFrame reduces header size compared to .eh_frame plus.eh_frame_hdr by:
Eliminating .eh_frame_hdr through sortedsfde_func_start_address fields
Replacing CIE pointers with direct FDE-to-FRE references
Using variable-width sfre_start_address fields (1 or 2bytes) for small functions
Storing start addresses instead of address ranges..eh_frame address ranges
Start addresses in a small function use 1 or 2 byte fields, moreefficient than .eh_frame initial_location, which needs atleast 4 bytes (DW_EH_PE_sdata4).
Hard-coding stack offsets rather than using flexible registerspecifications
However, the bytecode design of .eh_frame can sometimesbe more efficient than .sframe, as demonstrated onx86-64.
SFrame serves as a specialized complement to .eh_framerather than a complete replacement. The current version does not includepersonality routines, Language Specific Data Area (LSDA) information, orthe ability to encode extra callee-saved registers. While theseconstraints make SFrame ideal for profilers, they prevent it fromsupporting C++ exception handling, where libstdc++/libc++abi requiresthe full .eh_frame feature set.
In practice, executables and shared objects will likely contain allthree sections:
.eh_frame: Complete unwinding information for exceptionhandling
.eh_frame_hdr (encompassed by thePT_GNU_EH_FRAME program header): Fast lookup table for.eh_frame
.sframe (encompassed by the PT_GNU_SFRAMEprogram header)
The auxiliary header, currently unused, provides a pathway for futureenhancements. It could potentially accommodate .eh_frameaugmentation data such as personality routines, language-specific dataareas (LSDAs), and signal frame handling, bridging some of the currentfunctionality gaps.
Large text section support
The sfde_func_start_address field uses a signed 32-bitoffset to reference functions, providing a ±2GB addressing range fromthe field's location. This signed encoding offers flexibility in sectionordering-.sframe can be placed either before or after textsections.
However, this approach faces limitations with large binaries,particularly when LLVM generates .ltext sections forx86-64. The typical section layout creates significant gaps between.sframe and .ltext:
1 2 3 4 5 6 7 8 9
.ltext // Large text section .lrodata // Large read-only data .rodata // Regular read-only data // .eh_frame and .sframe position .text // Regular text section .data .bss .ldata // Large data .lbss // Large BSS
Object file format designissues
Mandatory index buildingproblems
Currently, Binutils enforces a single-element structure within each.sframe section, regardless of whether it resides in arelocatable object or final executable. While theSFRAME_F_FDE_SORTED flag can be cleared to permit unsortedFDEs, proposed unwinder implementations for the Linux kernel do not seemto support multiple elements in a single section. The design choicemakes linker merging mandatory rather than optional.
This design choice stems from Linux kernel requirements, where kernelmodules are relocatable files created with ld -r. Thepending SFrame support for linux-perf expects each module to contain asingle indexed format for efficient runtime processing. Consequently,GNU ld merges all input .sframe sections into a singleindexed element, even when producing relocatable files. This behaviordeviates from standard relocatable linkingconventions that suppress synthetic section finalization.
This approach differs from almost every metadata section, whichsupport multiple concatenated elements, each with its own header andbody. LLVM supports numerous well-behaved metadata sections(__asan_globals, .stack_sizes,__patchable_function_entries, __llvm_prf_cnts,__sancov_bools, __llvm_covmap,__llvm_gcov_ctr_section, .llvmcmd, andllvm_offload_entries) that concatenate without issues.SFrame stands apart as the only metadata section demandingversion-specific merging as default linker behavior, creatingunprecedented maintenance burden. For optimal portability, unwindersshould support multiple-element structures within a .sframesection.
For optimal portability, we must support object files from diverseorigins—not just those built from a single toolchain. In environmentswhere almost everything is built from source with a single toolchainoffering strong SFrame support, forcing default-on index building may beacceptable. However, we must also accommodate environments with prebuiltobject files using older SFrame versions, or toolchains that don'tsupport old formats. I believe unwinders should support multiple-elementstructures within a .sframe section. When a linker buildsan index for .sframe, it should be viewed as anoptimization that relieves the unwinder from constructing its own indexat runtime. This index construction should remain optional rather thanrequired.
Sectiongroup compliance and garbage collection issues
GNU Assembler generates a single .sframe sectioncontaining relocations to STB_LOCAL symbols from multipletext sections, including those in different section groups.
This creates ELF specification violations when a referenced textsection is discarded by the COMDAT section grouprule. The ELF specification states:
A symbol table entry with STB_LOCAL binding that isdefined relative to one of a group's sections, and that is contained ina symbol table section that is not part of the group, must be discardedif the group members are discarded. References to this symbol tableentry from outside the group are not allowed.
The problem manifests when inline functions are deduplicated:
1 2 3 4 5 6 7 8 9
cat > a.cc <<'eof' [[gnu::noinline]] inline int inl() { return 0; } auto *fa = inl; eof cat > b.cc <<'eof' [[gnu::noinline]] inline int inl() { return 0; } auto *fb = inl; eof ~/opt/gcc-15/bin/g++ -Wa,--gsframe -c a.cc b.cc
Linkers correctly reject this violation:
1 2 3 4 5 6 7 8 9 10
% ld.lld a.o b.o ld.lld: error: relocation refers to a discarded section: .text._Z3inlv >>> defined in b.o >>> referenced by b.cc >>> b.o:(.sframe+0x1c)
% gold a.o b.o b.o(.sframe+0x1c): error: relocation refers to local symbol ".text._Z3inlv" [2], which is defined in a discarded section section group signature: "inl()" prevailing definition is from a.o
(In 2020, I reported a similarissue for GCC -fpatchable-function-entry=.)
Some linkers don't implement this error check. A separate issuearises with garbage collection: by default, an unreferenced.sframe section will be discarded. If the linker implementsa workaround to force-retain .sframe, it mightinadvertently retain all text sections referenced by.sframe, even those that would otherwise be garbagecollected.
The solution requires restructuring the assembler's output strategy.Instead of creating a monolithic .sframe section, theassembler should generate individual SFrame sections corresponding toeach text section. When a text section belongs to a COMDAT group, itsassociated SFrame section must join the same group. For standalone textsections, the SHF_LINK_ORDER flag should establish theproper association.
This approach would create multiple SFrame sections withinrelocatable files, making the size optimization benefits of a simplifiedlinking view format even more compelling. While this comes with theoverhead of additional section headers (where eachElf64_Shdr consumes 64 bytes), it's a cost we should pay tobe a good ELF citizen. This reinforces the value of my sectionheader reduction proposal.
Version compatibilitychallenges
The current design creates significant version compatibilityproblems. When a linker only supports v3 but encounters object fileswith v2 .sframe sections, it faces impossible choices:
Report errors: Breaking builds with mixed-version object files
Concatenate sections: Currently unsupported by unwinders
Upgrade v2 to v3: Requires maintaining version-specific merge logicfor every version
This differs fundamentally from reading a format—each version needsversion-specific merging logic in every linker. Consider thescenario where v2 uses layout A, v3 uses layout B, and v4 uses layout C.A linker receiving objects with all three versions must produce coherentoutput with proper indexing while maintaining version-specific mergelogic for each.
Real-world mixing scenarios include:
Third-party vendor libraries built with older toolchains
Users linking against prebuilt libraries from different sources
Users who don't need SFrame but must handle prebuilt libraries witholder versions
Users updating their linker to a newer version that drops legacySFrame support
Most users will not need stack tracing features—this may changeeventually, but that will take many years. In the meantime, they mustaccept unneeded information while handling the resulting compatibilityissues.
Requiring version-specific merging as default behavior would createmaintenance burden unmatched by any other loadable metadata section.
Proposed format separation
A future version should distinguish between linking and executionviews to resolve the compatibility and maintenance challenges outlinedabove. This separation has precedent in existing debug formats:.debug_pubnames/.gdb_index provides anexcellent model for separate linking and execution views. DWARF v5's.debug_names takes a different approach, unifying bothviews at the cost of larger linking formats—a reasonable tradeoff sincerelocatable files contain only a single .debug_namessection, and debuggers can efficiently load sections with concatenatedname tables.
For SFrame, the separation would work as follows:
Separate linking format. Assemblers produce asimpler format, omitting index-specific metadata fields such assfh_num_fdes, sfh_num_fres,sfh_fdeoff, and sfh_freoff.
Default concatenation behavior. Linkers concatenate.sframe input sections by default, consistent with DWARFand other metadata sections. Linkers can handle mixed-version scenariosgracefully without requiring version-specific merge logic, eliminatingthe impossible maintenance burden of keeping version-specific mergelogic for every SFrame version in every linker implementation.Distributions can roll out SFrame support incrementally withoutrequiring all linkers to support index building immediately.
The unwinder implementation cost is manageable. Stack unwindersalready need to support .sframe sections across the mainexecutable and all shared objects. Supporting multiple concatenatedelements within a single .sframe section presents nofundamental technical barrier—this is a one-time implementation costthat provides forward and backward compatibility.
Optional index construction. When the opt-in option--sframe-index is requested, the linker builds an indexfrom recognized versions while reporting warnings for unrecognized ones.This is analogous to --gdb-indexand --debug-names.
With this approach, the linker builds .sframe_idx frominput .sframe sections. To support the Linux kernelworkflow (ld -r for kernel modules),ld -r --sframe-index must also generate the indexedformat.
The index construction happens before section matching in linkescripts. The output section description.sframe_idx : { *(.sframe_idx) } places the synthesized.sframe_idx into the .sframe_idx outputsection. .sframe input sections have been replaced by thelinker-synthesized .sframe_idx, so we don't write*(.sframe).
Alternative:Deriving SFrame from .eh_frame
An alternative approach could eliminate the need for assemblers togenerate .sframe sections directly. Instead, the linkerwould merge and optimize .eh_frame as usual (which requiresCIE and FDE boundary information), then derive .sframe (or.sframe_idx) from the optimized .eh_frame.
This approach offers a significant advantage: since the linker onlyreads the stable .eh_frame format and produces.sframe or .sframe_idx as output, versioncompatibility concerns disappear entirely.
While CFI instruction decoding introduces additional complexity (astep previously unneeded), this is balanced by the architecturaladvantage of centralizing the conversion logic. Rather than scatteringformat-specific processing code throughout the linker (similar to howSHF_MERGE and .eh_frame require specialinternal representations), the transformation logic remainslocalized.
The counterargument centers on maintenance burden. This fine-grainedknowledge of the SFrame format may expose the linker to more frequentupdates as the format evolves—a serious risk, given that the linker'sfoundational role in the build process demands exceptional stability androbustness.
Post-processing alternative
A more cautious intermediate strategy could leverage existing Linuxdistribution post-processing tools, modifying them to append.sframe sections to executable and shared object filesafter linking completes. While this introduces more friction than nativelinker support and requires integration into package build systems, itoffers several compelling advantages:
Allows .sframe format experimentation without imposinglinker complexity
Provides time for the format to mature and prove its value beforecommitting to linker integration
Enables testing across diverse userspace packages in real-worldscenarios
Post-link tools can optimize and even overwrite sections in-placewithout linker constraints
For cases where optimization significantly shrinks the section,.sframe can be placed at the end of the file (similar toBOLT moving .rodata)
However, this approach faces practical challenges. Post-processingadds build complexity, particularly with features like build-ids andread-only file systems. The success of .gdb_index, wherelinker support (--gdb-index) proved more popular thanpost-link tools, suggests that native linker support eventually becomesnecessary for widespread adoption.
The key question is timing: should linker integration be the startingpoint or the outcome of proven stability?
SHF_ALLOC considerations
The .sframe section carries the SHF_ALLOCflag, meaning it's loaded as part of the program's read-only datasegment. This design choice creates tradeoffs:
With SHF_ALLOC: - .sframe contributesto initial read-only data segment consumption - Can be accessed directlyas part of the memory-mapped area, relying on kernel's page fault ondemand mechanism.
Without SHF_ALLOC: - No upfront memory cost -Tracers must open the file and initiate IO to mmap the section on demand- Runtime cost may not amortize well for frequent tracing
Analysis of 337 files in /usr/bin and /usr/lib/x86_64-linux-gnu/shows .eh_frame typically consumes 5.2% (median: 5.1%) offile size:
If .sframe size is comparable to .eh_frame,this represents significant overhead for applications that never usestack tracing—likely the majority of users. Most users will not needstack trace features, raising the question of whether having.sframe always loaded is an acceptable overhead fordistributions shipping it by default.
perf supports .debug_frame(tools/perf/util/unwind-libunwind-local.c), which does not haveSHF_ALLOC. While there's a difference between status quoand what's optimal, the non-SHF_ALLOC approach deservesconsideration for scenarios where runtime tracing overhead can beamortized or where memory footprint matters more than immediateaccess.
Kernel challenges
The .sframe section may not be resident in the physicalmemory. SFrame proposers are attempting to defer user stack traces untilsyscall boundaries.
Ian Rogers points out that BPF programs can no longer simply stacktrace user code. This change breaks stack trace deduplication, acommonly used BPF primitive.
Miscellaneous minorconsiderations
Linker relaxation considerations:
Since .sframe carries the SHF_ALLOC flag,it affects text section addresses and consequently influences linkerrelaxation on architectures like RISC-V and LoongArch.
If variable-length encoding is introduced to the format,.sframe would behave as an address-dependent sectionsimilar to .relr.dyn. However, this dependency should notpose significant implementation challenges.
Endianness considerations:
The SFrame format currently supports endianness variants, whichcomplicates toolchain implementation. While runtime consumers typicallytarget a single endianness, development tools must handle both variantsto support cross-compilation workflows.
The endianness discussion in The future of 32-bit support inthe kernel reinforces my belief in preferring universallittle-endian for new formats. A universal little-endian approach wouldreduce implementation complexity by eliminating the need for:
Endianness-aware function calls likeread32le(config, p) where config->endianspecifies the object file's byte order
Template-based abstractions such astemplate <class Endian> that must wrap every dataaccess function
Instead, toolchain code could use straightforward calls likeread32le(p), streamlining both implementation andmaintenance.
This approach remains efficient even on big-endian architectures likeIBM z/Architecture and POWER. z/Architecture's LOAD REVERSEDinstructions, for instance, handle byte swapping with minimal overhead,often requiring no additional instructions beyond normal loads. Whileslight performance differences may exist compared to native endianoperations, the toolchain simplification benefits generally outweighthese concerns.
However, I understand that my opinion is probably not popular withinthe object file format community and faces resistance from stakeholderswith significant big-endian investments.
Questioned benefits
SFrame's primary benefit centers on enabling frame pointer omissionwhile preserving unwinding capabilities. In scenarios where usersalready omit leaf frame pointers, SFrame could theoretically allowswitching from-fno-omit-frame-pointer -momit-leaf-frame-pointer to-fomit-frame-pointer -momit-leaf-frame-pointer. Thisbenefit appears most significant on x86-64, which has limitedgeneral-purpose registers (without APX). Performance analyses show mixedresults: some studies claim frame pointers degrade performance by lessthan 1%, while others suggest 1-2%. However, this argument overlooks acritical tradeoff—SFrame unwinding itself performs worse than framepointer unwinding, potentially negating any performance gains fromregister availability.
Another claimed advantage is SFrame's ability to provide coverage infunction prologues and epilogues, where frame-pointer-based unwindingmay miss frames. Yet this overlooks a straightforward alternative: framepointer unwinding can be enhanced to detect prologue and epiloguepatterns by disassembling instructions at the program counter.
SFrame also faces a practical consideration: the .sframesection likely requires kernel page-in during unwinding, while theprocess stack is more likely already resident in physical memory. As IanRogers noted in LWN,system-wide profiling encounters limitations when system calls haven'ttransitioned to user code, BPF helpers may return placeholder values,and JIT compilers require additional SFrame support.
Looking ahead, hardware-assisted unwinding through features like x86Shadow Stack and AArch64 Guarded Control Stack may reshape the entirelandscape, potentially reducing the relevance of metadata-basedunwinding formats. Meanwhile, compact unwinding schemes like WindowsARM64 demonstrate that significantly smaller metadata formats remainviable alternatives to both SFrame and .eh_frame. Proposalslike Asynchronous Compact Unwind Descriptors have demonstrated thatcompact unwind formats can work with shrink-wrapping optimizations.There is a feature request for a compact information for AArch64 https://github.com/ARM-software/abi-aa/issues/344
Summary
Beyond these fundamental questions about SFrame's value proposition,the format presents a size improvement to Linux kernel's ORC unwinder.Its design presents several implementation challenges that meritconsideration for future versions:
Object file format design issues (mandatory index building, sectiongroup compliance, version compatibility)
Limited large text section support restricts deployment in modernbinaries
Size issue
These technical concerns, combined with the fundamental valuequestions raised above, suggest that careful consideration is warrantedbefore widespread adoption.
If we proceed, here ishow to do it right
According to thiscomment on llvm-project #64449, "v3 is the version that will besubmitted upstream when the time is right." Please share feedback on theformat before it's finalized, even if you may not be impressed with thedesign.
To ensure rapid SFrame evolution without compatibility concerns, abetter approach is to build a library that parses .eh_frameand generates SFrame. The Linux kernel can then use this library (inobjtool?) to generate SFrame for vmlinux and modules. Relying onassembler/linker output for this critical metadata format requires alevel of stability that is currently concerning.
The ongoing maintenance implications warrant particular attention.Observing the binutils mailing list reveals a significant volume ofSFrame commits. Most linker features stabilize quickly after initialimplementation, but SFrame appears to require continued evolution. Giventhe linker's foundational role in the build process, which demandsexceptional stability and robustness, the long-term maintenance burdendeserves careful consideration.
Early integration into GNU toolchain has provided valuable feedbackfor format evolution, but this comes at the cost of coupling theformat's maturity to linker stability. The SFrame GNU toolchaindevelopers exhibit a concerningtendency to disregard ELF and linker conventions—a serious problemfor all linker maintainers.
LLVM has had a battle-tested compact unwind format in production usesince 2009 with OS X 10.6. The efficiency gains are dramatic even if itmight only cover synchronous unwinding needs. OpenVMS's x86-64 port,which is ELF-based, also adopted this format as documented in their "VSIOpenVMS Calling Standard" and their 2018post on LLVM Discourse. This isn't to suggest we should simply adoptthe existing compact unwind format wholesale. The x86-64 design datesback to 2009 or earlier, and there are likely improvements we can make.However, we should aim for similar or better efficiency gains.
On AArch64, there are at least two formats the ELF one can learnfrom: LLVM's compact unwind format (aarch64) and Windows ARM64 FrameUnwind Code.
LLVM 21.1 have been released. As usual, I maintain lld/ELF and haveadded some notes to https://github.com/llvm/llvm-project/blob/release/21.x/lld/docs/ReleaseNotes.rst.I've meticulously reviewed nearly all the patches that are not authoredby me. I'll delve into some of the key changes.
Added -z dynamic-undefined-weak to make undefined weaksymbols dynamic when the dynamic symbol table is present. (#143831)
For -z undefs (default for -shared),relocations referencing undefined strong symbols now behave likerelocations referencing undefined weak symbols.
--why-live=<glob> prints for each symbol matching<glob> a chain of items that kept it live duringgarbage collection. This is inspired by the Mach-O LLD feature of thesame name.
--thinlto-distributor= and--thinlto-remote-compiler= options are added to supportIntegrated Distributed ThinLTO. (#142757)
Linker script OVERLAY descriptions now support virtualmemory regions (e.g. >region) andNOCROSSREFS.
When the last PT_LOAD segment is executable andincludes BSS sections, its p_memsz member is now correct.(#139207)
Spurious ASSERT errors before the layout converges arenow fixed.
For ARM and AArch64, --xosegment and--no-xosegment control whether to place executable-only andreadable-executable sections in the same segment. The default option is--no-xosegment. (#132412)
For AArch64, added support for the SHF_AARCH64_PURECODEsection flag, which indicates that the section only contains programcode and no data. An output section will only have this flag set if allinput sections also have it set. (#125689, #134798)
For AArch64 and ARM, added -zexecute-only-report, whichchecks for missing SHF_AARCH64_PURECODE andSHF_ARM_PURECODE section flags on executable sections. (#128883)
For AArch64, -z nopac-plt has been added.
For AArch64 and X86_64, added --branch-to-branch, whichrewrites branches that point to another branch instruction to insteadbranch directly to the target of the second instruction. Enabled bydefault at -O2.
For AArch64, added support for -zgcs-report-dynamic,enabling checks for GNU GCS Attribute Flags in Dynamic Objects when GCSis enabled. Inherits value from -zgcs-report (capped atwarning level) unless user-defined, ensuring compatibilitywith GNU ld linker.
The default Hexagon architecture version in ELF object filesproduced by lld is changed to v68. This change is only effective whenthe version is not provided in the command line by the user and cannotbe inferred from inputs.
For LoongArch, the initial-exec to local-exec TLS optimization hasbeen implemented.
For LoongArch, several relaxation optimizations are supported,including relaxation for R_LARCH_PCALA_HI20/LO12 andR_LARCH_GOT_PC_HI20/LO12 relocations, instructionrelaxation for R_LARCH_CALL36, TLS local-exec(LE)/global dynamic (GD)/ local dynamic(LD) model relaxation, and TLSDESC code sequencerelaxation.
For RISCV, an oscillation bug due to call relaxation is now fixed.(#142899)
For x86-64, the .ltext section is now placed before.rodata.
tl;dr https://gist.github.com/MaskRay/74cdaa83c1f44ee105fcebcdff0ba9a7is a single-file Ruby program that downloads and compiles multiplecompression utilities, then benchmarks their compression anddecompression performance on a specified input file, finally generates aHTML file with scatter charts. Scroll to the end to view example HTMLpages.
Compression algorithms can be broadly categorized into three groupsbased on their typical compression ratio and decompression speed:
Low ratio, high speed: lz4, snappy, Oodle Selkie.
Medium ratio, medium speed: zlib, zstd, brotli, OodleKraken.
Low ratio Codecs in this category prioritize speedabove all else. The compression and compression speeds are comparable.They are designed to decompress so quickly that they don't introduce anoticeable delay when reading data from storage like solid-state drives.These codecs typically producing byte-aligned output and often skip thefinal step of entropy encoding, which, while crucial for highcompression, is computationally intensive. They are excellent choicesfor applications where latency is critical, such as kernel features likezswap.
Medium ratio This is the sweet spot for many tasks.The codecs achieve better compression ratio by employing entropyencoding, usually Huffman coding.
zstd has emerged as a clear leader, gaining popularity andeffectively supplanting older codecs like the venerable DEFLATE(zlib).
High ratio They are designed to squeeze every lastbit of redundancy out of the data, often at the cost of significantlylonger compression and decompression times, and large memory usage. Theyare perfect for archival purposes or data distribution where the filesare compressed once and decompressed infrequently. Codecs typically have3 important components:
Transforms: Codecs typically implement strong transforms to increaseredundancy, even very specific ones like branch/call/jump filters formachine code.
Predication model: This model anticipates the next piece of databased on what has already been processed.
Entropy encoding: Traditional codecs use arithmetic encoder, whichis replaced by the more efficient Range variant of Asymmetric NumeralSystems (rANS).
Some projects apply neural network models, such as Recurrent NeuralNetwork, Long Short-Term Memory, and Transformer, to the predicationmodel. They are usually very slow.
This categorization is loose. Many modern programs offer a wide rangeof compression levels that allow them to essentially span multiplecategories. For example, a high-level zstd compression canachieve a ratio comparable to xz (a high-compression codec) byusing more RAM and CPU. While zstd's compression speed or ratiois generally lower, its decompression speed is often much faster thanthat of xz.
Benchmarking
I want to benchmark the single worker performance of a fewcompression programs:
lz4: Focuses on speed over compression ratio. Memory usageis extremely low. It seems Pareto superior to Google'sSnappy.
zstd: Gained significant traction and obsoleted manyexisting codecs. Its LZ77 variant uses three recent match offsets likeLZX. For entropy encoding, it employs Huffman coding for literals and2-way interleaved Finite State Entropy for Huffman weights, literallengths, match lengths, and offset codes. The large alphabet of literalsmakes Huffman a good choice, as compressing them with FSE provideslittle gain for a speed cost. However, other symbols have a small range,making them a sweet spot for FSE. zstd works on multiple streams at thesame time to utilize instruction-level parallelism. zstd is supported bythe Accept-Encoding: zstd HTTP header. Decompression memoryusage is very low.
brotli: Uses a combination of LZ77, 2nd order contextmodel, Huffman coding, and static dictionary. The decompression speed issimilar to gzip with a higher ratio. At lower levels, its performance isovershadowed by zstd. Compared with DEFLATE, it employs alarger sliding window (from 16KiB-16B to 16MiB-16B) and a smallerminimum match length (2 instead of 3). It has a predefined dictionarythat works well for web content (but feels less elegant) and supports120 transforms. brotli is supported by theAccept-Encoding: br HTTP header. Decompression memory usageis quite low.
bzip3: Combines BWT, RLE, and LZP and uses arithmeticencoder. Memory usage is large.
xz: LZMA2 with a few filters. The filters must be enabledexplicitly.
lzham: Provides a compression ratio similar to LZMA but with fasterdecompression. Compression is slightly slower while memory usage islarger. The build system is not well-polished for Linux. I have forkedit, fixed stdint.h build errors, and installedlzhamtest. The command line program lzhamtestshould really be renamed to lzham.
zpaq: Functions as a command-line archiver supportingmultiple files. It combines context mixing with arithmetic encoder butoperates very slowly.
kanzi: There are a wide variety of transforms and entropyencoders, unusual for a compresion program. For the compression speed ofenwik8, it's Pareto superior to xz, but decompression isslower. Levels 8 and 9 belong to the PAQ8 family and consume substantialmemory.
I'd like to test lzham (not updated for a few years), but I'm havingtrouble getting it to compile due to a cstdio headerissue.
Many modern compressors are parallel by default. I have to disablethis behavior by using options like -T1. Still,zstd uses a worker thread for I/O overlap, but I don't botherwith --single-thread.
To ensure fairness, each program is built with consistent compileroptimizations, such as -O3 -march=native.
Below is a Ruby program that downloads and compiles multiplecompression utilities, compresses then decompress a specified inputfile. It collects performance metrics including execution time, memoryusage, and compression ratio, and finally generates an HTML file withscatter charts visualizing the results. The program has several notablefeatures:
Adding new compressors is easy: just modifyCOMPRESSORS.
Benchmark results are cached in files namedcache_$basename_$digest.json, allowing reuse of previousruns for the same input file.
Adding a new compression level does not invalidate existingbenchmark results for other levels.
The script generates an HTML file with interactive scatter charts.Each compressor is assigned a unique, deterministic color based on ahash of its name (using the hsl function in CSS).
The single file Ruby program is available at https://gist.github.com/MaskRay/74cdaa83c1f44ee105fcebcdff0ba9a7
Limitation
A single run might not be representative.
Running the executable incurs initialization overhead, which would beamortized in a library setup. However, library setup would make updatinglibraries more difficult.
Demo
1 2 3 4 5
ruby bench.rb enwik8 # The first iframe below
ruby bench.rb clang # The second iframe below
Many programs exhibit a stable decompression speed (uncompressed size/ decompression time). There is typically a slightly higherdecompression speed at higher compression levels. If you think of thecompressed content as a form of "byte code", a more highly compressedfile means there are fewer bytes for the decompression algorithm toprocess, resulting in faster decompression. Some programs, likezpaq and kanzi, use different algorithms that canresult in significantly different decompression speeds.
xz -9 doesn't use parallelism on the two files under~100 MiB because their uncompressed size is smaller than the defaultblock size for level 9.
From install/include/lzma/container.h
For each thread, about 3 * block_size bytes of memory will beallocated. This may change in later liblzma versions. If so, the memoryusage will probably be reduced, not increased.
Alignment refers to the practice of placing data or code at memoryaddresses that are multiples of a specific value, typically a power of2. This is typically done to meet the requirements of the programminglanguage, ABI, or the underlying hardware. Misaligned memory accessesmight be expensive or will cause traps on certain architectures.
This blog post explores how alignment is represented and managed asC++ code is transformed through the compilation pipeline: from sourcecode to LLVM IR, assembly, and finally the object file. We'll focus onalignment for both variables and functions.
Object types have alignment requirements ([basic.fundamental],[basic.compound]) which place restrictions on the addresses at which anobject of that type may be allocated. An alignment is animplementation-defined integer value representing the number of bytesbetween successive addresses at which a given object can be allocated.An object type imposes an alignment requirement on every object of thattype; stricter alignment can be requested using the alignment specifier([dcl.align]). Attempting to create an object ([intro.object]) instorage that does not meet the alignment requirements of the object'stype is undefined behavior.
alignas can be used to request a stricter alignment. [decl.align]
An alignment-specifier may be applied to a variable or to a classdata member, but it shall not be applied to a bit-field, a functionparameter, or an exception-declaration ([except.handle]). Analignment-specifier may also be applied to the declaration of a class(in an elaborated-type-specifier ([dcl.type.elab]) or class-head([class]), respectively). An alignment-specifier with an ellipsis is apack expansion ([temp.variadic]).
Example:
1 2
alignas(16) int i0; structalignas(8) S {};
If the strictest alignas on a declaration is weaker thanthe alignment it would have without any alignas specifiers, the programis ill-formed.
1 2 3 4 5
% echo 'alignas(2) int v;' | clang -fsyntax-only -xc++ - <stdin>:1:1: error: requested alignment is less than minimum alignment of 4 for type 'int' 1 | alignas(2) int v; | ^ 1 error generated.
However, the GNU extension __attribute__((aligned(1)))can request a weaker alignment.
An explicit alignment may be specified for a global, which must be apower of 2. If not present, or if the alignment is set to zero, thealignment of the global is set by the target to whatever it feelsconvenient. If an explicit alignment is specified, the global is forcedto have exactly that alignment. Targets and optimizers are not allowedto over-align the global if the global has an assigned section. In thiscase, the extra alignment could be observable: for example, code couldassume that the globals are densely packed in their section and try toiterate over them as an array, alignment padding would break thisiteration. For TLS variables, the module flag MaxTLSAlign, if present,limits the alignment to the given value. Optimizers are not allowed toimpose a stronger alignment on these variables. The maximum alignment is1 << 32.
Function alignment
An explicit alignment may be specified for a function. If notpresent, or if the alignment is set to zero, the alignment of thefunction is set by the target to whatever it feels convenient. If anexplicit alignment is specified, the function is forced to have at leastthat much alignment. All alignments must be a power of 2.
A backend can override this with a preferred function alignment(STI->getTargetLowering()->getPrefFunctionAlignment()),if that is larger than the specified align value. (https://discourse.llvm.org/t/rfc-enhancing-function-alignment-attributes/88019/3)
In addition, align can be used in parameter attributesto decorate a pointer or vector of pointers.
LLVM back end representation
Global variablesAsmPrinter::emitGlobalVariable determines the alignment forglobal variables based on a set of nuanced rules:
With an explicit alignment (explicit),
If the variable has a section attribute, returnexplicit.
Otherwise, compute a preferred alignment for the data layout(getPrefTypeAlign, referred to as pref).Returnpref < explicit ? explicit : max(E, getABITypeAlign).
Without an explicit alignment: returngetPrefTypeAlign.
getPrefTypeAlign employs a heuristic for global variabledefinitions: if the variable's size exceeds 16 bytes and the preferredalignment is less than 16 bytes, it sets the alignment to 16 bytes. Thisheuristic balances performance and memory efficiency for common cases,though it may not be optimal for all scenarios. (See Preferredalignment of globals > 16bytes in 2012)
For assembly output, AsmPrinter emits .p2align (power of2 alignment) directives with a zero fill value (i.e. the padding bytesare zeros).
// FIXME: Shouldn't use pref alignment if explicit alignment is set on F. if (!F.hasOptSize()) Alignment = std::max(Alignment, STI.getTargetLowering()->getPrefFunctionAlignment());
The subtarget's minimum function alignment
If the function is not optimized for size (i.e. not compiled with-Os or -Oz), take the maximum of the minimumalignment and the preferred alignment. For example,X86TargetLowering sets the preferred function alignment to16.
1 2 3 4 5 6 7 8 9 10 11 12
% echo 'void f(){} [[gnu::aligned(32)]] void g(){}' | clang --target=x86_64 -S -xc - -o - .file "-" .text .globl f # -- Begin function f .p2align 4 .type f,@function f: # @f ... .globl g # -- Begin function g .p2align 5 .type g,@function g: # @g
The emitted .p2align directives omits the fill valueargument: for code sections, this space is filled with no-opinstructions.
Assembly representation
GNU Assembler supports multiple alignment directives:
.p2align 3: align to 2**3
.balign 8: align to 8
.align 8: this is identical to .balign onsome targets and .p2align on the others.
Clang supports "direct object emission" (clang -ctypically bypasses a separate assembler), the LLVMAsmPrinter directlyuses the MCObjectStreamer API. This allows Clang to emitthe machine code directly into the object file, bypassing the need toparse and interpret alignment directives and instructions from atext-based assembly file.
These alignment directives has an optional third argument: themaximum number of bytes to skip. If doing the alignment would requireskipping more bytes than the specified maximum, the alignment is notdone at all. GCC's -falign-functions=m:n utilizes thisfeature.
Object file format
In an object file, the section alignment is determined by thestrictest alignment directive present in that section. The assemblersets the section's overall alignment to the maximum of all thesedirectives, as if an implicit directive were at the start.
This alignment is stored in the sh_addralign fieldwithin the ELF section header table. You can inspect this value usingtools such as readelf -WS (llvm-readelf -S) orobjdump -h (llvm-objdump -h).
Linker considerations
The linker combines multiple object files into a single executable.When it maps input sections from each object file into output sectionsin the final executable, it ensures that section alignments specified inthe object files are preserved.
How the linker handlessection alignment
Output section alignment: This is the maximumsh_addralign value among all its contributing inputsections. This ensures the strictest alignment requirements are met.
Section placement: The linker also uses inputsh_addralign information to position each input sectionwithin the output section. As illustrated in the following example, eachinput section (like a.o:.text.f or b.o:.text)is aligned according to its sh_addralign value before beingplaced sequentially.
1 2 3 4 5 6 7 8 9
output .text # align to sh_addralign(a.o:.text). No-op if this is the first section without any preceding DOT assignment or data command. a.o:.text # align to sh_addralign(a.o:.text.f) a.o:.text.f # align to sh_addralign(b.o:.text) b.o:.text # align to sh_addralign(b.o:.text.g) b.o:.text.g
Link script control A linker script can override thedefault alignment behavior. The ALIGN keyword enforces astricter alignment. For example .text : ALIGN(32) { ... }aligns the section to at least a 32-byte boundary. This is often done tooptimize for specific hardware or for memory mapping requirements.
The SUBALIGN keyword on an output section overrides theinput section alignments.
Padding: To achieve the required alignment, thelinker may insert padding between sections or before the first inputsection (if there is a gap after the output section start). The fillvalue is determined by the following rules:
If specified, use the =fillexpoutput section attribute (within an output sectiondescription).
If a non-code section, use zero.
Otherwise, use a trap or no-op instructin.
Padding and sectionreordering
Linkers typically preserve the order of input sections from objectfiles. To minimize the padding required between sections, linker scriptscan use a SORT_BY_ALIGNMENT keyword to arrange inputsections in descending order of their alignment requirements. Similarly,GNU ld supports --sort-commonto sort COMMON symbols by decreasing alignment.
While this sorting can reduce wasted space, modern linking strategiesoften prioritize other factors, such as cache locality (for performance)and data similarity (for Lempel–Ziv compression ratio), which canconflict with sorting by alignment. (Search--bp-compression-sort= on Explain GNU stylelinker options).
System page size
The alignment of a variable or function can be as large as the systempage size. Some implementations allow a larger alignment. (Over-alignedsegment)
ABI compliance
Some platforms have special rules. For example,
On SystemZ, the larl (load address relative long)instruction cannot generate odd addresses. To prevent GOT indirection,compilers ensure that symbols are at least aligned by 2. (Toolchainnotes on z/Architecture)
On AIX, the default alignment mode is power: for doubleand long double, the first member of this data type is aligned accordingto its natural alignment value; subsequent members of the aggregate arealigned on 4-byte boundaries. (https://reviews.llvm.org/D79719)
The standard representation of the the Itanium C++ ABI requiresmember function pointers to be even, to distinguish between virtual andnon-virtual functions.
In the standard representation, a member function pointer for avirtual function is represented with ptr set to 1 plus the function'sv-table entry offset (in bytes), converted to a function pointer as ifbyreinterpret_cast<fnptr_t>(uintfnptr_t(1 + offset)),where uintfnptr_t is an unsigned integer of the same sizeas fnptr_t.
Conceptually, a pointer to member function is a tuple:
A function pointer or virtual table index, discriminated by theleast significant bit
A displacement to apply to the this pointer
Due to the least significant bit discriminator, members function needa stricter alignment even if __attribute__((aligned(1))) isspecified:
1
virtualvoidbar1() __attribute__((aligned(1)));
Side note: check out MSVC C++ ABI MemberFunction Pointers for a comparison with the MSVC C++ ABI.
Architecture considerations
Contemporary architectures generally support unaligned memory access,likely with very small performance penalties. However, someimplementations might restrict or penalize unaligned accesses heavily,or require specific handling. Even on architectures supporting unalignedaccess, atomic operations might still require alignment.
On AArch64, a bit in the system control registersctlr_el1 enables alignment check.
On x86, if the AM bit is set in the CR0 register and the AC bit isset in the EFLAGS register, alignment checking of user-mode dataaccessing is enabled.
Linux's RISC-V port supportsprctl(PR_SET_UNALIGN, PR_UNALIGN_SIGBUS); to enable strictalignment.
clang -fsanitize=alignment can detect misaligned memoryaccess. Check out my write-up.
In 1989, US Patent 4814976, which covers "RISC computer withunaligned reference handling and method for the same" (4 instructions:lwl, lwr, swl, and swr), was granted to MIPS Computer Systems Inc. Itcaused a barrier for other RISC processors, see The Lexra Story.
Almost every microprocessor in the world can emulate thefunctionality of unaligned loads and stores in software. MIPSTechnologies did not invent that. By any reasonable interpretation ofthe MIPS Technologies' patent, Lexra did not infringe. In mid-2001 Lexrareceived a ruling from the USPTO that all claims in the the lawsuit wereinvalid because of prior art in an IBM CISC patent. However, MIPSTechnologies appealed the USPTO ruling in Federal court, adding toLexra's legal costs and hurting its sales. That forced Lexra into anunfavorable settlement. The patent expired on December 23, 2006 at whichpoint it became legal for anybody to implement the complete MIPS-Iinstruction set, including unaligned loads and stores.
Aligning code forperformance
GCC offers a family of performance-tuning options named-falign-*, that instruct the compiler to align certain codesegments to specific memory boundaries. These options might improveperformance by preventing certain instructions from crossing cache lineboundaries (or instruction fetch boundaries), which can otherwise causean extra cache miss.
-falign-function=n: Align functions.
-falign-labels=n: Align branch targets.
-falign-jumps=n: Align branch targets, for branchtargets where the targets can only be reached by jumping.
-falign-loops=n: Align the beginning of loops.
Important considerations
Inefficiency with Small Functions: Aligning smallfunctions can be inefficient and may not be worth the overhead. Toaddress this, GCC introduced -flimit-function-alignment in2016. The option sets .p2align directive's max-skip operandto the estimated function size minus one.
The max-skip operand, if present, is evaluated at parse time, so youcannot do:
1 2 3 4
.p2align 4, , b-a a: nop b:
In LLVM, the x86 backend does not implementTargetInstrInfo::getInstSizeInBytes, making it challengingto implement -flimit-function-alignment.
Cold code: These options don't apply to coldfunctions. To ensure that cold functions are also aligned, use-fmin-function-alignment=n instead.
Benchmarking: Aligning functions can make benchmarksmore reliable. For example, on x86-64, a hot function less than 32 bytesmight be placed in a way that uses one or two cache lines (determined byfunction_addr % cache_line_size), making benchmark resultsnoisy. Using -falign-functions=32 can ensure the functionalways occupies a single cache line, leading to more consistentperformance measurements.
LLVM notes: In clang/lib/CodeGen/CodeGenModule.cpp,-falign-function=N sets the alignment if a function doesnot have the gnu::aligned attribute.
A hardware loop typically consistants of 3 parts:
A low-overhead loop (also called a zero-overhead loop) is ahardware-assisted looping mechanism found in many processorarchitectures, particularly digital signal processors (DSPs). Theprocessor includes dedicated registers that store the loop startaddress, loop end address, and loop count. A hardware loop typicallyconsists of three components:
Loop setup instruction: Sets the loop end address and iterationcount
Loop body: Contains the actual instructions to be repeated
Loop end instruction: Jumps back to the loop body if furtheriterations are required
Here is an example from Arm v8.1-M low-overhead branch extension.
1 2 3 4 5
1: dls lr, Rn // Setup loop with count in Rn ... // Loop body instructions 2: le lr, 1b // Loop end - branch back to label 1 if needed
To minimize the number of cache lines used by the loop body, ideallythe loop body (the instruction immediately following DLS) should bealigned to a 64-byte boundary. However, GNU Assembler lacks a directiveto specify alignment like "align DLS to a multiple of 64 plus 60 bytes."Inserting an alignment after the DLS is counterproductive, as it wouldintroduce unwanted NOP instructions at the beginning of the loop body,negating the performance benefits of the low-overhead loopmechanism.
It would be desirable to simulate the functionality with.org ((.+4+63) & -64) - 4 // ensure that .+4 is aligned to 64-byte boundary,but this complex expression involves bitwise AND and is not arelocatable expression. LLVM integrated assembler would reportexpected absolute expression while GNU Assembler has asimilar error.
A potential solution would be to extend the alignment directives withan optional offset parameter:
1 2 3 4 5
# Align to 64-byte boundary with 60-byte offset, using NOP padding in code sections .balign 64, , , 60
# Same alignment with offset, but skip at most 16 bytes of padding .balign 64, , 16, 60
Xtensa's LOOP instructions has similar alignmentrequirement, but I am not familiar with the detail. The GNU Assembleruses the special alignment as a special machine-dependent fragment. (https://sourceware.org/binutils/docs/as/Xtensa-Automatic-Alignment.html)
In my previous post, LLVMintegrated assembler: Improving expressions and relocations delvedinto enhancements made to LLVM's expression resolving and relocationgeneration. This post covers recent refinements to MC, focusing onsections and symbols.
Sections
Sections are named, contiguous blocks of code or data within anobject file. They allow you to logically group related parts of yourprogram. The assembler places code and data into these sections as itprocesses the source file.
In LLVM 20, the MCSectionclass used an enum called SectionVariant todifferentiate between various object file formats, such as ELF, Mach-O,and COFF. These subclasses are used in contexts where the section typeis known at compile-time, such as in MCStreamer and MCObjectTargetWriter.This change eliminates the need for runtime type information (RTTI)checks, simplifying the codebase and improving efficiency.
Additionally, the storage for fragments' fixups (adjustments toaddresses and offsets) has been moved into the MCSectionclass.
Symbols
Symbols are names that represent memory addresses or values.
classMCSymbol { protected: /// The kind of the symbol. If it is any value other than unset then this /// class is actually one of the appropriate subclasses of MCSymbol. enumSymbolKind { SymbolKindUnset, SymbolKindCOFF, SymbolKindELF, SymbolKindGOFF, SymbolKindMachO, SymbolKindWasm, SymbolKindXCOFF, };
/// A symbol can contain an Offset, or Value, or be Common, but never more /// than one of these. enumContents : uint8_t { SymContentsUnset, SymContentsOffset, SymContentsVariable, SymContentsCommon, SymContentsTargetCommon, // Index stores the section index };
Similar to sections, the MCSymbolclass also used a discriminator enum, SymbolKind, to distinguishbetween object file formats. This enum has also been removed.
Furthermore, the MCSymbol class had anenum Contents to specify the kind of symbol. This name wasa bit confusing, so it has been renamedto enum Kind for clarity.
A special enumerator, SymContentsTargetCommon, which wasused by AMDGPU for a specific type of common symbol, has also been removed.The functionality it provided is now handled by updatingELFObjectWriter to respect the symbol's section index(SHN_AMDGPU_LDS for this special AMDGPU symbol).
sizeof(MCSymbol) has been reduced to 24 bytes on 64-bitsystems.
The previous blog post LLVMintegrated assembler: Improving expressions and relocationsdescribes other changes:
The MCSymbol::IsUsed flag was a workaround fordetecting a subset of invalid reassignments and is removed.
The MCSymbol::IsResolving flag is added to detectcyclic dependencies of equated symbols.
In my previous assembler posts, I've discussed improvements on expressionresolving and relocation generation. Now, let's turn our attentionto recent refinements within section fragments. Understanding how anassembler utilizes these fragments is key to appreciating theimprovements we've made. At a high level, the process unfolds in threemain stages:
Parsing phase: The assembler constructs section fragments. Thesefragments represent sequences of regular instructions or data, span-dependentinstructions, alignment directives, and other elements.
Section layout phase: Once fragments are built, the assemblerassigns offsets to them and finalizes the span-dependent content.
Relocationdecision phase: In the final stage, the assembler evaluates fixupsand, if necessary, updates the content of the fragments.
When the LLVM integrated assembler was introduced in 2009, itssection and fragment design was quite basic. Performance wasn't theconcern at the time. As LLVM evolved, many assembler features added overthe years came to rely heavily on this original design. This created acomplex web that made optimizing the fragment representationincreasingly challenging.
Here's a look at some of the features that added to this complexityover the years:
2010: Mach-O .subsection_via_symbols and atoms
2012: NativeClient's bundle alignment mode. I've created a dedicatedchapter for this.
2015: Hexagon instruction bundle
2016: CodeView variable definition ranges
2018: RISC-V linker relaxation
2020: x86 -mbranches-within-32B-boundaries
2023: LoongArch linker relaxation. This is largely identical toRISC-V linker relaxation. Any refactoring or improvements to the RISC-Vlinker relaxation often necessitate corresponding changes to theLoongArch implementation.
I've included the start year for each feature to indicate when it wasinitially introduced, to the best of my knowledge. This doesn't implythat maintenance stopped after that year. On the contrary, many of thesefeatures, like RISC-V linker relaxation, require ongoing, activemaintenance.
Despite the intricate history, I've managed to untangle thesedependencies and implement the necessary fixes. And that, in a nutshell,is what this blog post is all about!
Reducing sizeof(MCFragment)
A significant aspect of optimizing fragment management involveddirectly reducing the memory footprint of the MCFragment object itself.Several targeted changes contributed to makingsizeof(MCFragment) smaller, as mentioned by my previousblog post: Integratedassembler improvements in LLVM 19.
The fragment management system has also been streamlined bytransitioning from a doubly-linked list (llvm::iplist) to asingly-linked list, eliminating unnecessary overhead. A few prerequisitecommits removed backward iterator requirements. It's worth noting thatthe complexities introduced by features like NaCl's bundle alignmentmode, x86's -mbranches-within-32B-boundaries option, andHexagon's instruction bundles presented challenges.
The quest fortrivially destructible fragments
Historically, MCFragment subclasses, specificallyMCDataFragment and MCRelaxableFragment, reliedon SmallVector member variables to store their content andfixups. This approach, while functional, presented two keyinefficiencies:
Inefficient storage of small objects: The content and fixups forindividual fragments are typically very small. Storing a multitude ofthese tiny objects individually within SmallVectors led toless-than-optimal memory utilization.
Non-trivial destructors: When deallocating sections, the~MCSection destructor had to meticulously traverse thefragment list and explicitly destroy each fragment.
In 2024, @aengelke initiated a draft to storefragment content out-of-line. Building upon that foundation, I'veextended this approach to also store fixups out-of-line, and ensuredcompatibility with the aforementioned features that cause complexity(especially RISC-V and LoongArch linker relaxation.)
Furthermore, MCRelaxableFragment previously containedMCInst Inst;, which also necessitated a non-trivialdestructor. To address this, I've redesigned its data structure.operands are now stored within the parent MCSection, and theMCRelaxableFragment itself only holds references:
Unfortunately, we still need to encode MCInst::Flags tosupport the x86 EVEX prefix, e.g., {evex} xorw $foo, %ax.My hope is that the x86 maintainers might refactorX86MCCodeEmitter::encodeInstruction to make this flagstorage unnecessary.
The new design of MCFragment and MCSectionis as follows:
classMCFragment { ... // Track content and fixups for the fixed-size part as fragments are // appended to the section. The content remains immutable, except when // modified by applyFixup. uint32_t ContentStart = 0; uint32_t ContentEnd = 0; uint32_t FixupStart = 0; uint32_t FixupEnd = 0;
// Track content and fixups for the optional variable-size tail part, // typically modified during relaxation. uint32_t VarContentStart = 0; uint32_t VarContentEnd = 0; uint32_t VarFixupStart = 0; uint32_t VarFixupEnd = 0; };
classMCSection { ... // Content and fixup storage for fragments SmallVector<char, 0> ContentStorage; SmallVector<MCFixup, 0> FixupStorage; SmallVector<MCOperand, 0> MCOperandStorage; };
(As a side note, the LLVMCamelCase variables are odd. As the MC maintainer, I'dbe delighted to see them refactored to camelBack orsnake_case if people agree on the direction.)
Prior to LLVM 21.1, the assembler, operated with a fragment designdating back to 2009, placed every span-dependent instruction into itsown distinct fragment. The x86 code sequencepush rax; jmp foo; nop; jmp foo would be represented withnumerous fragments:MCDataFragment(nop); MCRelaxableFragment(jmp foo); MCDataFragment(nop); MCRelaxableFragment(jmp foo).
A more efficient approach emerged: storing both a fixed-sizepart and an optional variable-size tail within a singlefragment.
The fixed-size part maintains a consistent size throughout theassembly process.
The variable-size tail, if present, encodes elements that can changein size or content, such as a span-dependent instruction, an alignmentdirective, a fill directive, or other similar span-dependentconstructs.
The new design led to significantly fewer fragments:
Encoding individual instructions is the most performance-criticaloperation within MCObjectStreamer. Recognizing this,significant effort has been dedicated to reducing this overhead sinceMay 2023.
"Enhanced relaxation":The feature allows x86 prefix padding for all instructions, effectivelymaking all instructions span-dependent and requiring its own fragment.My D94542 disabled this bydefault due to concern of -g vs -g0differences.
My recent optimization efforts demanded careful attention to theseparticularly complex and performance-sensitive code.
Eager fragment creation
Encoding an instruction is a far more frequent operation thanappending a variable-size tail to the current fragment. In the previousdesign, the instruction encoder was burdened with an extra check: it hadto determine if the current fragment already had a variable-sizetail.
1 2 3 4 5 6 7 8 9 10
encodeInstruction: if (current fragment has a variable-size tail) start a new fragment append data to the current fragment
emitValueToAlignment: Encode the alignment in the variable-size tail of the current fragment
emitDwarfLocDirective: Encode the .loc in the variable-size tail of the current fragment
Our new strategy optimizes this by maintaining a current fragmentthat is guaranteed not to have a variable-size tail. This meansfunctions appending data to the fixed-size part no longer need toperform this check. Instead, any function that sets a variable-size tailwill now immediately start a new fragment.
Here's how the workflow looks with this optimization:
1 2 3 4 5 6 7 8 9 10 11
encodeInstruction: assert(current fragment doesn't have a variable-size tail) append data to the current fragment
emitValueToAlignment: Encode the alignment in the variable-size tail of the current fragment start a new fragment
emitDwarfLocDirective: Encode the .loc in the variable-size tail of the current fragment start a new fragment
It's worth noting that the first patch was made possible thanks tothe removal of the bundle alignment mode.
Fragment content in trailingdata
Our MCFragment class manages four distinct sets ofappendable data: fixed-size content, fixed-size fixups, variable-sizetail content, and variable-size tail fixups. Of these, the fixed-sizecontent is typically the largest. We can optimize its storage byutilizing it as trailing data, akin to a flexible array member.
This approach offers several compelling advantages:
Improved data locality: Storing the content after the MCFragmentobject enhances cache utility.
Simplified metadata: We can replace the pair ofuint32_t ContentStart = 0; uint32_t ContentEnd = 0; with asingle uint32_t ContentSize;.
This optimization leverages a clever technique made possible by usinga special purpose bump allocator. After allocatingsizeof(MCFragment) bytes for a new fragment, we know thatany remaining space within the current bump allocator block immediatelyfollows the fragment's end. This contiguous space can then beefficiently used for the fragment's trailing data.
However, this design introduces a few important considerations:
Tail fragment appends only: Data can only be appended to the tailfragment of a subsection. Fragments located in the middle of asubsection are immutable in their fixed-size content. Anypost-assembler-layout adjustments must target the variable-sizetail.
Dynamic Allocation Management: When new data needs to be appended, afunction is invoked to ensure the current bump allocator block hassufficient space. If not, the current fragment is closed (its fixed-sizecontent is finalized), and a new fragment is started. For instance, an8-byte sequence could be stored as one single fragment, or, if spaceconstraints dictate, as two fragments each encoding 4 bytes.
New block allocation: If the available space in the current block isinsufficient, a new block large enough to accommodate both an MCFragmentand the required bytes for its trailing data is allocated.
Section/subsection Switching: The previously saved fragment listtail cannot be simply reused. This is because it's tied to the memoryspace of the previous bump allocator block. Instead, a new fragment mustbe allocated using the current bump allocator block and appended to thenew subsection's tail.
I have thought about making the variable-size content immediatelyfollow the fixed-size content, but leb128 and x86's potentially verylong instruction (15 bytes) stopped me from doing it. There is certainlyroom for future improvements, though.
Google's now-discontinued Native Client (NaCl) project provided asandboxing environment through a combination of Software Fault Isolation(SFI) and memory segmentation. A distinctive feature of its SFIimplementation was the "bundle alignment mode", which adds NOP paddingto ensure that no instruction crosses a 32-byte alignment boundary. Theverifier's job is to check all instructions starting at 32-byte-multipleaddresses.
While the core concept of aligned bundling is intriguing, itsimplementation within the LLVM assembler proved problematic. Introducedin 2012, this feature imposed noticeable performance penalties on userswho had no need for NaCl, perhaps more critically, significantlyincreased the complexity of MC's internal workings. I was particularlyconcerned by its pervasive modifications toMCObjectStreamer and MCAssembler.
In MCObjectStreamer, newly defined labels were put intoa "pending label" list and initially assigned to aMCDummyFragment associated with the current section. Thesymbols would be reassigned to a new fragment when the next instructionor directive was parsed. This pending label system introduced complexityand a missing flushPendingLabels could lead to subtle bugsrelated to incorrect symbol values. flushPendingLabels wascalled by many MCObjectStreamer functions, noticeably oncefor each new fragment, adding overhead. It also complicated the labeldifference evaluation due to MCDummyFragment inMCExpr.cpp:AttemptToFoldSymbolOffsetDifference.
For the following code, aligned bundling requires that .Ltmp isdefined at addl.
2025: Finally, MC: Removebundle alignment mode, after Derek Schuff agreed to drop NaClsupport from LLVM.
Should future features require a variant of bundle alignment, Ifirmly believe a much cleaner implementation is necessary. This couldpotentially be achieved through a backend hook withinX86AsmBackend::finishLayout, applied after the primaryassembler layout phase, similar to how the-mbranches-within-32B-boundaries option is handled, thougheven that implementation warrants an extensive revisit itself.
Lessons learned
The cost of missing early optimization
Early design choices can have a far-reaching impact on future code.The initial LLVM MC design, while admirably simple in its inception,inadvertently created a rigid foundation. As new features piled on, eachrelying more and more on the specific fragment internals, rectifyingfoundational inefficiencies became incredibly challenging. The Hyrum'sLaw was evident: features built on this foundation inevitably dependedon all its observable behaviors. Optimizing the underlying structurerequired not just a change to the core, but also a thorough fix for allits unsuspecting users. I encountered significant struggles with thedeeply ingrained complexities stemming from NaCl's bundle alignmentmode, x86's -mbranches-within-32B-boundaries option, andthe intricacies of RISC-V linker relaxation.
Cargo cult programming and snowball effect
I observed instances of "cargo cult programming", where existingsolutions were copied without a full understanding of their underlyingrationale or applicability. For example:
The WebAssembly implementation heavily mirrored that of ELF.Consequently, many improvements made to the ELF component oftennecessitated corresponding, sometimes redundant, changes to theWebAssembly implementation. In additin, the WebAssembly implementationcopied ELF-specific code that was irrelevant for WebAssembly'sarchitecture, adding unnecessary bloat and complexity.
LoongArch's RISC-V replication: LoongArch's linker relaxationimplementation directly copied the approach taken for RISC-V.Refactoring or improvements to RISC-V's linker relaxation frequentlyrequire mirrored changes in the LoongArch codebase, creating parallelmaintenance burdens. I am particularly glad that I landed myfoundational [RISCV] Makelinker-relaxable instructions terminate MCDataFragment and [RISCV] Allow delayed decisionfor ADD/SUB relocations in 2023, before the LoongArch teamreplicated the RISC-V approach. This timing, I hope, mitigated somefuture headaches for their implementation.
These patterns illustrate how initial design choices, or theexpedience of copying existing solutions, can lead to a "snowballeffect" of accumulating complexity and redundant code that makes futureoptimization and maintenance significantly harder. On a positive note,I'm also pleased that thestreamlining of the relocation generation framework was completedbefore Apple's upstreaming of their Mach-O support for 32-bit RISC-V.This critical work should provide a more robust and less complex basefor their contributions, and reducing maintenance on my end.
The cost of features
Specific features, particularly those designed for niche orspecialized use cases like NaCl's bundle alignment mode, introduceddisproportionate complexity and performance overhead across the entireassembler. Even though NaCl itself was deprecated in 2020, it took until2025 to finally excise its complex support from LLVM. This highlights acommon challenge in large, open-source projects: while many developersare motivated to add new features, there's often far less incentive ordedicated effort to streamline or remove their underlying implementationcomplexities once they're no longer strictly necessary or have become aperformance drain.
I want to acknowledge the work of individuals like Rafael Ávila deEspíndola, Saleem Abdulrasool, and Nirav Dave, whose improvements toLLVM MC were vital. Without their contributions, the MC layer wouldundoubtedly be in a far less optimized state today.
Epilogue
This extensive work on fragment optimization would not have beenpossible without the invaluable contributions of Alexis Engelke. My sincere thanks go toAlexis for his meticulous reviews of numerous patches, his insightfulsuggestions, and for contributing many significant improvementshimself.
What I have learnd through the process?
Appendix:How GNU Assembler mastered fragments decades ago
After dedicating several paragraphs to explaining the historicalshortcomings of LLVM MC's fragment representation, a natural questionarises: how does GNU Assembler (GAS), arguably the other most popularassembler on Linux systems, approach fragment handling?
Delving into its history reveals a fascinating answer. The earliestcommit I could locate is a cvs2svn-generated record from April 1991.Given the 1987 copyright notice within the code, it's highly probablethat this foundational work on fragments was laid down as early as1987.
You can explore this initial structure in as.h here: https://github.com/bminor/binutils-gdb/commit/3a69b3aca678a3caf3ade7f9d42d18233b097ec6#diff-0771d3312685417eb5061a8f0856da4f0406ca8bd6c7d68b6a50a026a4e48c9dR212.Please check out as.h and frags.c.
Observing the frag struct, a few points stand out:
While the exact purpose of fr_offset isn't immediatelyclear to me, fr_fix and fr_var bear a strikingresemblance to the concepts we've recently introduced in MCFragment. Itmight make the variable-size content immediately follow the fixed-sizecontent, though.
The char fr_literal[1] demonstrates an early use ofwhat we now call a flexible array member. Today, GCC and Clang's-fstrict-flex-arrays=2 would report a warning.
fr_symbol could be more appropriately placed within aunion
fr_pcrel_adjust and fr_bsr would ideallybe architecture-specific data.
Fragments are allocated using obstacks,which appear to be a more sophisticated form of a bump allocator, withadditional bookkeeping overhead.
But truly, I should stop the minor nit-picking. What astonishinglyimpresses me is the sheer foresight demonstrated in GAS's fragmentallocator design. Conceived in 1987 or even earlier, it masterfullyanticipated solutions that LLVM MC, first conceived in 2009, has onlynow achieved decades later. This design held the lead on fragmentarchitecture for nearly four decades!
My greatest tribute goes to the original authors of GNU Assembler forthis remarkable piece of engineering.
/* * A code fragment (frag) is some known number of chars, followed by some * unknown number of chars. Typically the unknown number of chars is an * instruction address whose size is yet unknown. We always know the greatest * possible size the unknown number of chars may become, and reserve that * much room at the end of the frag. * Once created, frags do not change address during assembly. * We chain the frags in (a) forward-linked list(s). The object-file address * of the 1st char of a frag is generally not known until after relax(). * Many things at assembly time describe an address by {object-file-address * of a particular frag}+offset. BUG: it may be smarter to have a single pointer off to various different notes for different frag kinds. See how code pans */ structfrag /* acodefragment */ { unsignedlong fr_address; /* Object file address. */ structfrag *fr_next;/* Chain forward; ascending address order. */ /* Rooted in frch_root. */
long fr_fix; /* (Fixed) number of chars we know we have. */ /* May be 0. */ long fr_var; /* (Variable) number of chars after above. */ /* May be 0. */ structsymbol *fr_symbol;/* For variable-length tail. */ long fr_offset; /* For variable-length tail. */ char *fr_opcode; /*->opcode low addr byte,for relax()ation*/ relax_stateT fr_type; /* What state is my tail in? */ relax_substateT fr_subtype; /* These are needed only on the NS32K machines */ char fr_pcrel_adjust; char fr_bsr; char fr_literal [1]; /* Chars begin here. */ /* One day we will compile fr_literal[0]. */ };
For years, I've been involved in updating LLVM's MC layer. A recentjourney led me to eliminatethe FK_PCRel_ fixup kinds:
MCFixup: Remove FK_PCRel_The generic FK_Data_ fixup kinds handle both absolute and PC-relativefixups. ELFObjectWriter sets IsPCRel to true for `.long foo-.`, so thebackend has to handle PC-relative FK_Data_.However, the existence of FK_PCRel_ encouraged backends to implement itas a separate fixup type, leading to redundant and error-prone code.Removing FK_PCRel_ simplifies the overall fixup mechanism.
As a prerequisite, I had to update several backends that relied onthe now-deleted fixup kinds. It was during this process that somethingunexpected happened. Contributors reportedthat when built by GCC 13.3.0, the LLVM integrated assembler hadtest failures.
To investigate, I downloaded and built GCC 13.3.0 locally:
1 2
../../configure --prefix=$HOME/opt/gcc-13.3.0 --disable-bootstrap --enable-languages=c,c++ --disable-libsanitizer --disable-multilib make -j 30 && make -j 30 install
I then built a Release build (-O3) of LLVM. Sure enough,the failure was reproducible:
1 2 3 4 5 6 7 8 9 10 11
% /tmp/out/custom-gcc-13/bin/llc llvm/test/CodeGen/X86/2008-08-06-RewriterBug.ll -mtriple=i686 -o s -filetype=obj Unknown immediate size UNREACHABLE executed at /home/ray/llvm/llvm/lib/Target/X86/MCTargetDesc/X86BaseInfo.h:904! PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace. Stack dump: 0. Program arguments: /tmp/out/custom-gcc-13/bin/llc llvm/test/CodeGen/X86/2008-08-06-RewriterBug.ll -mtriple=i686 -o s -filetype=obj 1. Running pass 'Function Pass Manager' on module 'llvm/test/CodeGen/X86/2008-08-06-RewriterBug.ll'. 2. Running pass 'X86 Assembly Printer' on function '@foo' Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it): 0 llc 0x0000000002f06bcb fish: Job 1, '/tmp/out/custom-gcc-13/bin/llc …' terminated by signal SIGABRT (Abort)
Interestingly, a RelWithDebInfo build (-O2 -g) of LLVMdid not reproduce the failure, suggesting either an undefined behavior,or an optimization-related issue within GCC 13.3.0.
The Bisection trail
I built GCC at the releases/gcc-13 branch, and the issuevanished. This strongly indicated that the problem lay somewhere betweenthe releases/gcc-13.3.0 tag and thereleases/gcc-13 branch.
The bisection led me to a specific commit, directing me to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109934#c6.
I developed a workaround at the code block with a typo "RemaningOps".Although I had observed it before, I was hesitant to introduce a commitsolely for a typo fix. However, it became clear this was the perfectopportunity to address both the typo and implement a workaround for theGCC miscompilation. This led to the landing of thiscommit, resolving the miscompilation.
Sam James from Gentoo mentioned that the miscompilation wasintroduced by a commit cherry-picked into GCC 13.3.0. GCC 13.2.0 and GCC13.4.0 are good.
In my previous post, LLVMintegrated assembler: Improving MCExpr and MCValue delved intoenhancements made to LLVM's internal MCExpr and MCValue representations.This post covers recent refinements to MC, focusing on expressionresolving and relocation generation.
Preventing cyclicdependencies
Equatedsymbols may form a cycle, which is not allowed.
1 2 3 4 5 6 7 8 9
# CHECK: [[#@LINE+2]]:7: error: cyclic dependency detected for symbol 'a' # CHECK: [[#@LINE+1]]:7: error: expression could not be evaluated a = a + 1
# CHECK: [[#@LINE+3]]:6: error: cyclic dependency detected for symbol 'b1' # CHECK: [[#@LINE+1]]:6: error: expression could not be evaluated b0 = b1 b1 = b2 b2 = b0
Previously, LLVM's interated assembler used an occurs check to detectthese cycles when parsing symbol equating directives.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
boolparseAssignmentExpression(StringRef Name, bool allow_redef, MCAsmParser &Parser, MCSymbol *&Sym, const MCExpr *&Value){ ... // Validate that the LHS is allowed to be a variable (either it has not been // used as a symbol, or it is an absolute symbol). Sym = Parser.getContext().lookupSymbol(Name); if (Sym) { // Diagnose assignment to a label. // // FIXME: Diagnostics. Note the location of the definition as a label. // FIXME: Diagnose assignment to protected identifier (e.g., register name). if (Value->isSymbolUsedInExpression(Sym)) return Parser.Error(EqualLoc, "Recursive use of '" + Name + "'"); ... }
isSymbolUsedInExpression implemented occurs check as atree (or more accurately, a DAG) traversal.
While generally effective, this routine wasn't universally appliedacross all symbol equating scenarios, such as with .weakrefor some target-specific parsing code, leading to potential undetectedcycles, and therefore infinite loop in assembler execution.
To address this, I adopted a 2-color depth-first search (DFS)algorithm. While a 3-color DFS is typical for DAGs, a 2-color approachsuffices for our trees, although this might lead to more work when asymbol is visited multiple times. Shared subexpressions are very rare inLLVM.
Here is the relevant change toevaluateAsRelocatableImpl. I also need a new bit fromMCSymbol.
// Evaluate recursively if this is a variable. + if (Sym.isResolving()) { + if (Asm && Asm->hasFinalLayout()) { + Asm->getContext().reportError( + Sym.getVariableValue()->getLoc(), + "cyclic dependency detected for symbol '" + Sym.getName() + "'"); + Sym.IsUsed = false; + Sym.setVariableValue(MCConstantExpr::create(0, Asm->getContext())); + } + return false; + } if (Sym.isVariable() && (Kind == MCSymbolRefExpr::VK_None || Layout) && canExpand(Sym, InSet)) { + Sym.setIsResolving(true); + auto _ = make_scope_exit([&] { Sym.setIsResolving(false); }); bool IsMachO = Asm && Asm->getContext().getAsmInfo()->hasSubsectionsViaSymbols(); if (Sym.getVariableValue()->evaluateAsRelocatableImpl(Res, Asm,
Unfortunately, I cannot removeMCExpr::isSymbolUsedInExpression, as it is still used byAMDGPU ([AMDGPU] Avoidresource propagation for recursion through multiple functions).
Revisiting the.weakref directive
The .weakref directive had intricate impact on the expressionresolving framework.
.weakref enables the creation of weak aliases withoutdirectly modifying the target symbol's binding. This allows a headerfile in library A to optionally depend on symbols from library B. Whenthe target symbol is otherwise not referenced, the object file affectedby the weakref directive will include an undefined weak symbol. However,when the target symbol is defined or referenced (by the user), it canretain STB_GLOBAL binding to support archive member extraction. GCC's[[gnu::weakref]] attribute, as used in runtime libraryheaders like libgcc/gthr-posix.h, utilizes thisfeature.
I've noticed a few issues:
Unreferenced .weakref alias, target created undefinedtarget.
Crash when alias was already defined.
VK_WEAKREF was mis-reused by the aliasdirective of llvm-ml (MASM replacement).
And addressed them with
[MC]Ignore VK_WEAKREF in MCValue::getAccessVariant (2019-12). Wow, it'sinteresting to realize I'd actually delved into this a few yearsago!
MC:Rework .weakref (2025-05)
Expression resolving andreassignments
= and its equivalents (.set,.equ) allow a symbol to be equatedmultiple times. This means when a symbol is referenced, its currentvalue is captured at that moment, and subsequent reassignments do notalter prior references.
1 2 3 4 5 6 7
.data .set x, 0 .long x // reference the first instance x = .-.data .long x // reference the second instance .set x,.-.data .long x // reference the third instance
The assembly code evaluates to.long 0; .long 4; .long 8.
Historically, the LLVM integrated assembler restricted reassigningsymbols whose value wasn't a parse-time integer constant(MCConstExpr). This was a safeguard against potentiallyunsafe reassignments, as an old value might still be referenced.
The safeguard was implemented with multiple conditions, aided by a mysterious IsUsedvariable.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
// Diagnose assignment to a label. // // FIXME: Diagnostics. Note the location of the definition as a label. // FIXME: Diagnose assignment to protected identifier (e.g., register name). if (Value->isSymbolUsedInExpression(Sym)) return Parser.Error(EqualLoc, "Recursive use of '" + Name + "'"); elseif (Sym->isUndefined(/*SetUsed*/false) && !Sym->isUsed() && !Sym->isVariable()) ; // Allow redefinitions of undefined symbols only used in directives. elseif (Sym->isVariable() && !Sym->isUsed() && allow_redef) ; // Allow redefinitions of variables that haven't yet been used. elseif (!Sym->isUndefined() && (!Sym->isVariable() || !allow_redef)) return Parser.Error(EqualLoc, "redefinition of '" + Name + "'"); elseif (!Sym->isVariable()) return Parser.Error(EqualLoc, "invalid assignment to '" + Name + "'"); elseif (!isa<MCConstantExpr>(Sym->getVariableValue())) return Parser.Error(EqualLoc, "invalid reassignment of non-absolute variable '" + Name + "'");
Over the past few years, during our work on porting Clang to Linuxkernel ports, we worked around this by modifying the assembly codeitself:
ARM:8971/1: replace the sole use of a symbol with its definition in2020-04
crypto:aesni - add compatibility with IAS in 2020-07
powerpc/64/asm:Do not reassign labels in 2021-12
This prior behavior wasn't ideal. I've since enabled properreassignment by implementing a system where the symbol is cloned uponredefinition, and the symbol table is updated accordingly. Crucially,any existing references to the original symbol remain unchanged, and theoriginal symbol is no longer included in the final emitted symboltable.
Before rolling out this improvement, I discovered problematic uses inthe AMDGPU and ARM64EC backends that required specific fixes orworkarounds. This is a common challenge when making general improvementsto LLVM's MC layer: you often need to untangle and resolve individualbackend-specific "hacks" before a more generic interface enhancement canbe applied.
MCParser:Error when .set reassigns a non-redefinable variable
MC:Allow .set to reassign non-MCConstantExpr expressions
For the following assembly, newer Clang emits relocations referencingfoo, foo, bar, foo like GNU Assembler.
1 2 3 4 5 6 7
b = a a = foo call a call b a = bar call a call b
Relocation generation
For a deeper dive into the concepts of relocation generation, youmight find my previous post, Relocationgeneration in assemblers, helpful.
Driven by the need to support new RISC-V vendor relocations (e.g.,Xqci extensions from Qualcomm) and my preference against introducing anextra MCAsmBackend hook, I've significantly refactoredLLVM's relocation generation framework. This effort generalized existingRISC-V/LoongArch ADD/SUB relocation logic and enabled its customizationfor other targets like AVR and PowerPC.
MC:Generalize RISCV/LoongArch handleAddSubRelocations and AVRshouldForceRelocation
The linker relaxation framework sometimes generated redundantrelocations that could have been resolved. This occurred in severalscenarios, including:
1 2 3 4 5 6 7 8 9 10 11 12
.option norelax j label // For assembly input, RISCVAsmParser::ParseInstruction sets ForceRelocs (https://reviews.llvm.org/D46423). // For direct object emission, RISCVELFStreamer sets ForceRelocs (#77436) .option relax call foo // linker-relaxable
.option norelax j label // redundant relocation due to ForceRelocs .option relax
label:
And also with label differences within a section withoutlinker-relaxable instructions:
1 2 3 4 5 6 7 8 9 10 11
call foo
.section .text1,"ax" # No linker-relaxable instruction. Label differences should be resolved. w1: nop w2:
.data # Redundant R_RISCV_SET32 and R_RISCV_SUB32 .long w2-w1
These issues have now been resolved through a series of patches,significantly revamping the target-neutral relocation generationframework. Key contributions include:
[MC]Refactor fixup evaluation and relocation generation
MC:Remove redundant relocations for label differences
I've also streamlined relocation generation within the SPARC backend.Given its minimal number of relocations, the SPARC implementation couldserve as a valuable reference for downstream targets seeking tocustomize their own relocation handling.
Simplificationto assembly and machine code emission
For a dive into the core classes involved in LLVM's assembly andmachine code emission, you might read my Noteson LLVM assembly and machine code emission.
The MCAssembler class orchestrates the emission process,managing MCAsmBackend, MCCodeEmitter, andMCObjectWriter. In turn, MCObjectWriteroversees MCObjectTargetWriter.
Historically, many member functions within the subclasses ofMCAsmBackend, MCObjectWriter, andMCObjectTargetWriter accepted a MCAssembler *argument. This was often redundant, as it was typically only used toaccess the MCContext instance. To streamline this, I'veadded a MCAssembler * member variable directly toMCAsmBackend, MCObjectWriter, andMCObjectTargetWriter, along with convenient helperfunctions like getContext. This change cleans up theinterfaces and improves code clarity.
MCAsmBackend:Add member variable MCAssembler * and define getContext
ELFObjectWriter:Remove the MCContext argument from getRelocType
MachObjectWriter:Remove the MCAssembler argument from getSymbolAddress
WinCOFFObjectWriter:Simplify code with member MCAssembler *
Previously, the ARM, Hexagon, and RISC-V backends had uniquerequirements that led to extra arguments being passed to MCAsmBackendhooks. These arguments were often unneeded by other targets. I've sincerefactored these interfaces, replacing those specialized arguments withmore generalized and cleaner approaches.
ELFObjectWriter:Move Thumb-specific condition to ARMELFObjectWriter
MCAsmBackend:Remove MCSubtargetInfo argument
MCAsmBackend,X86:Pass MCValue to fixupNeedsRelaxationAdvanced. NFC
MCAsmBackend,Hexagon:Remove MCRelaxableFragment from fixupNeedsRelaxationAdvanced
MCAsmBackend:Simplify applyFixup
Future plan
The assembler's ARM port has a limitation where only relocations withimplicit addends (REL) are handled. For CREL, weaim to use explicit addends across all targets to simplifylinker/tooling implementation, but this is incompatible withARMAsmBackend's current design. See this ARM CREL assemblerissue https://github.com/llvm/llvm-project/issues/141678.
To address this issue, we should
In MCAssembler::evaluateFixup, generalizeMCFixupKindInfo::FKF_IsAlignedDownTo32Bits (ARM hack, alsoused by other backends) to support more fixups, includingARM::fixup_arm_uncondbl (R_ARM_CALL). Create anew hook in MCAsmBackend.
In ARMAsmBackend, move the Value -= 8 codefrom adjustFixupValue to the new hook.
1 2 3 4 5 6 7 8 9 10 11 12
unsignedARMAsmBackend::adjustFixupValue(const MCAssembler &Asm, ... case ARM::fixup_arm_condbranch: case ARM::fixup_arm_uncondbranch: case ARM::fixup_arm_uncondbl: case ARM::fixup_arm_condbl: case ARM::fixup_arm_blx: // Check that the relocation value is legal. Value -= 8; if (!isInt<26>(Value)) { Ctx.reportError(Fixup.getLoc(), "Relocation out of range"); return0;
Enabling RELA/CREL support requires significant effort and exceeds myexpertise or willingness to address for AArch32. However, I do want toadd a new MCAsmBackend hook to minimize AArch32's invasive modificationsto the generic relocation generation framework.
For reference, the arm-vxworks port in binutils introducedRELA support in 2006.
In my previous post, RelocationGeneration in Assemblers, I explored some key concepts behindLLVM’s integrated assemblers. This post dives into recent improvementsI’ve made to refine that system.
The LLVM integrated assembler handles fixups and relocatableexpressions as distinct entities. Relocatable expressions, inparticular, are encoded using the MCValue class, whichoriginally looked like this:
RefKind acts as an optional relocation specifier,though only a handful of targets actually use it.
SymA represents an optional symbol reference (theaddend).
SymB represents another optional symbol reference (thesubtrahend).
Cst holds a constant value.
While functional, this design had its flaws. For one, the wayrelocation specifiers were encoded varied across architectures:
Targets like COFF, Mach-O, and ELF's PowerPC, SystemZ, and X86 embedthe relocation specifier within MCSymbolRefExpr *SymA aspart of SubclassData.
Conversely, ELF targets such as AArch64, MIPS, and RISC-V store itas a target-specific subclass of MCTargetExpr, and convertit to MCValue::RefKind duringMCValue::evaluateAsRelocatable.
Another issue was with SymB. Despite being typed asconst MCSymbolRefExpr *, itsMCSymbolRefExpr::VariantKind field went unused. This isbecause expressions like add - sub@got are notrelocatable.
Over the weekend, I tackled these inconsistencies and reworked therepresentation into something cleaner:
This updated design not only aligns more closely with the concept ofrelocatable expressions but also shaves off some compiler time in LLVM.The ambiguous RefKind has been renamed toSpecifier for clarity. Additionally, targets thatpreviously encoded the relocation specifier withinMCSymbolRefExpr (rather than usingMCTargetExpr) can now access it directly viaMCValue::Specifier.
To support this change, I made a few adjustments:
IntroducedgetAddSym and getSubSym methods, returningconst MCSymbol *, as replacements for getSymAand getSymB.
Eliminated dependencies on the old accessors,MCValue::getSymA and MCValue::getSymB.
Reworkedthe expression folding code that handles + and -
Some targets relied on PC-relative fixups with explicit specifiersforcing relocations. I have definedMCAsmBackend::shouldForceRelocation for SystemZ and cleanedup ARM and PowerPC
Changedthe type of SymA and SymB toconst MCSymbol *
Mach-O assembler support in LLVM has accumulated significanttechnical debt, impacting both target-specific and generic code. Oneparticularly nagging issue was theconst SectionAddrMap *Addrs parameter inMCExpr::evaluateAs* functions. This parameter existed tohandle cross-section label differences, primarily for generating(compact) unwind information in Mach-O. A typical example of this can beseen in assembly like:
The SectionAddrMap *Addrs parameter always felt like aclunky workaround to me. It wasn’t until I dug into the Mach-OAArch64 object writer that I realized this hack wasn't necessary forthat writer. This discovery prompted a cleanup effort to remove thedependency on SectionAddrMap for ARM and X86 and eliminatethe parameter:
[MC,MachO]Replace SectionAddrMap workaround with cleaner variablehandling
MCExpr:Remove unused SectionAddrMap workaround
While I was at it, I also tidied up MCSymbolRefExpr byremovingthe clunky HasSubsectionsViaSymbolsBit, furthersimplifying the codebase.
Stremlining InstPrinter
The MCExpr code also determines how expression operands in assemblyinstructions are printed. I have made improvements in this area aswell:
[MC]Don't print () around $ names
[MC]Simplify MCBinaryExpr/MCUnaryExpr printing by reducingparentheses