Long branches in compilers, assemblers, and linkers
Branch instructions on most architectures use PC-relative addressingwith a limited range. When the target is too far away, the branchbecomes "out of range" and requires special handling.
Consider a large binary where main() at address 0x10000calls foo() at address 0x8010000-over 128MiB away. OnAArch64, the bl instruction can only reach ±128MiB, so thiscall cannot be encoded directly. Without proper handling, the linkerwould fail with an error like "relocation out of range." The toolchainmust handle this transparently to produce correct executables.
This article explores how compilers, assemblers, and linkers worktogether to solve the long branch problem.
- Compiler (IR to assembly): Handles branches within a function thatexceed the range of conditional branch instructions
- Assembler (assembly to relocatable file): Handles branches within asection where the distance is known at assembly time
- Linker: Handles cross-section and cross-object branches discoveredduring final layout
Branch range limitations
Different architectures have different branch range limitations.Here's a quick comparison of unconditional / conditional branchranges:
| Architecture | Cond | Uncond | Call | Notes |
|---|---|---|---|---|
| AArch64 | ±1MiB | ±128MiB | ±128MiB | Thunks |
| AArch32 (A32) | ±32MiB | ±32MiB | ±32MiB | Thunks, interworking |
| AArch32 (T32) | ±1MiB | ±16MiB | ±16MiB | Thunks, interworking |
| LoongArch | ±128KiB | ±128MiB | ±128MiB | Linker relaxation |
| M68k (68020+) | ±2GiB | ±2GiB | ±2GiB | Assembler picks size |
| MIPS (pre-R6) | ±128KiB | ±128KiB (b offset) |
±128KiB (bal offset) |
In -fno-pic code, pseudo-absolutej/jal can be used for a 256MiB region. |
| MIPS R6 | ±128KiB | ±128MiB | ±128MiB | |
| PowerPC64 | ±32KiB | ±32MiB | ±32MiB | Thunks |
| RISC-V | ±4KiB | ±1MiB | ±1MiB | Linker relaxation |
| SPARC | ±1MiB | ±8MiB | ±2GiB | No thunks needed |
| SuperH | ±256B | ±4KiB | ±4KiB | |
| x86-64 | ±2GiB | ±2GiB | ±2GiB | Large code model changes call sequence |
| Xtensa | ±2KiB | ±128KiB | ±512KiB | Linker relaxation |
| z/Architecture | ±64KiB | ±4GiB | ±4GiB | No thunks needed |
The following subsections provide detailed per-architectureinformation, including relocation types relevant for linkerimplementation.
AArch32
In A32 state:
- Branch (
b/b<cond>), conditionalbranch and link (bl<cond>)(R_ARM_JUMP24): ±32MiB - Unconditional branch and link (
bl/blx,R_ARM_CALL): ±32MiB
Note: R_ARM_CALL is for unconditionalbl/blx which can be relaxed to BLX inline;R_ARM_JUMP24 is for branches which require a veneer forinterworking.
In T32 state (Thumb state pre-ARMv8):
- Conditional branch (
b<cond>,R_ARM_THM_JUMP8): ±256 bytes - Short unconditional branch (
b,R_ARM_THM_JUMP11): ±2KiB - ARMv5T branch and link (
bl/blx,R_ARM_THM_CALL): ±4MiB - ARMv6T2 wide conditional branch (
b<cond>.w,R_ARM_THM_JUMP19): ±1MiB - ARMv6T2 wide branch (
b.w,R_ARM_THM_JUMP24): ±16MiB - ARMv6T2 wide branch and link (
bl/blx,R_ARM_THM_CALL): ±16MiB.R_ARM_THM_CALLcan berelaxed to BLX.
AArch64
- Test bit and branch (
tbz/tbnz,R_AARCH64_TSTBR14): ±32KiB - Compare and branch (
cbz/cbnz,R_AARCH64_CONDBR19): ±1MiB - Conditional branches (
b.<cond>,R_AARCH64_CONDBR19): ±1MiB - Unconditional branches (
b/bl,R_AARCH64_JUMP26/R_AARCH64_CALL26):±128MiB
The compiler's BranchRelaxation pass handlesout-of-range conditional branches by inverting the condition andinserting an unconditional branch. The AArch64 assembler does notperform branch relaxation; out-of-range branches produce linker errorsif not handled by the compiler.
LoongArch
- Conditional branches(
beq/bne/blt/bge/bltu/bgeu,R_LARCH_B16): ±128KiB (18-bit signed) - Compare-to-zero branches (
beqz/bnez,R_LARCH_B21): ±4MiB (23-bit signed) - Unconditional branch/call (
b/bl,R_LARCH_B26): ±128MiB (28-bit signed) - Medium range call (
pcaddu12i+jirl,R_LARCH_CALL30): ±2GiB - Long range call (
pcaddu18i+jirl,R_LARCH_CALL36): ±128GiB
M68k
- Short branch(
Bcc.B/BRA.B/BSR.B): ±128 bytes(8-bit displacement) - Word branch(
Bcc.W/BRA.W/BSR.W): ±32KiB(16-bit displacement) - Long branch(
Bcc.L/BRA.L/BSR.L, 68020+):±2GiB (32-bit displacement)
GNU Assembler provides jbsr, jra, jXX) that"automatically expand to the shortest instruction capable of reachingthe target". For example, jeq .L0 emits one ofbeq.b, beq.w, and beq.l dependingon the displacement.
With the long forms available on 68020 and later, M68k doesn't needlinker range extension thunks.
MIPS
- Conditional branches(
beq/bne/bgez/bltz/etc,R_MIPS_PC16): ±128KiB - PC-relative jump (
b offset(bgez $zero, offset)): ±128KiB - PC-relative call (
bal offset(bgezal $zero, offset)): ±128KiB - Pseudo-absolute jump/call (
j/jal,R_MIPS_26): branch within the current 256MiB region, onlysuitable for-fno-piccode. Deprecated in R6 in favor ofbc/balc
16-bit instructions removed in Release 6:
- Conditional branch (
beqz16,R_MICROMIPS_PC7_S1): ±128 bytes - Unconditional branch (
b16,R_MICROMIPS_PC10_S1): ±1KiB
MIPS Release 6:
- Unconditional branch, compact (
bc16, unclear toolchainimplementation): ±1KiB - Compare and branch, compact(
beqc/bnec/bltc/bgec/etc,R_MIPS_PC16): ±128KiB - Compare register to zero and branch, compact(
beqzc/bnezc/etc,R_MIPS_PC21_S2): ±4MiB - Branch (and link), compact (
bc/balc,R_MIPS_PC26_S2): ±128MiB
LLVM's MipsBranchExpansion pass handles out-of-rangebranches.
lld implements LA25 thunks for MIPS PIC/non-PIC interoperability, butnot range extension thunks.
PowerPC
- Conditional branch (
bc/bcl,R_PPC64_REL14): ±32KiB - Unconditional branch (
b/bl,R_PPC64_REL24/R_PPC64_REL24_NOTOC):±32MiB
GCC-generated code relies on linker thunks. However, the legacy-mlongcall can be used to generate long code sequences.
RISC-V
- Compressed
c.beqz: ±256 bytes - Compressed
c.jal: ±2KiB -
jalr(I-type immediate): ±2KiB - Conditional branches(
beq/bne/blt/bge/bltu/bgeu,B-type immediate): ±4KiB -
jal(J-type immediate,PseudoBR): ±1MiB(notably smaller than other RISC architectures: AArch64 ±128MiB,PowerPC64 ±32MiB, LoongArch ±128MiB) -
PseudoJump(usingauipc+jalr): ±2GiB -
beqi/bnei(Zibi extension, 5-bit compareimmediate (1 to 31 and -1)): ±4KiB
Qualcomm uC Branch Immediate extension (Xqcibi):
-
qc.beqi/qc.bnei/qc.blti/qc.bgei/qc.bltui/qc.bgeui(32-bit, 5-bit compare immediate): ±4KiB -
qc.e.beqi/qc.e.bnei/qc.e.blti/qc.e.bgei/qc.e.bltui/qc.e.bgeui(48-bit, 16-bit compare immediate): ±4KiB
Qualcomm uC Long Branch extension (Xqcilb):
-
qc.e.j/qc.e.jal(48-bit,R_RISCV_VENDOR(QUALCOMM)+R_RISCV_QC_E_CALL_PLT): ±2GiB
For function calls:
- The Gocompiler emits a single
jalfor calls and relies on itslinker to generate trampolines when the target is out of range. - In contrast, GCC and Clang emit
auipc+jalrand rely on linker relaxation to shrink the sequence when possible.
The jal range (±1MiB) is notably smaller than other RISCarchitectures (AArch64 ±128MiB, PowerPC64 ±32MiB, LoongArch ±128MiB).This limits the effectiveness of linker relaxation ("start large andshrink"), and leads to frequent trampolines when the compileroptimistically emits jal ("start small and grow").
SPARC
- Compare and branch (
cxbe,R_SPARC_5): ±64bytes - Conditional branch (
bcc,R_SPARC_WDISP19):±1MiB - Unconditional branch (
b,R_SPARC_WDISP22):±8MiB -
call(R_SPARC_WDISP30/R_SPARC_WPLT30): ±2GiB
With ±2GiB range for call, SPARC doesn't need rangeextension thunks in practice.
SuperH
- Conditional branch (
bf/bt): ±256bytes - Unconditional branch (
bra): ±4KiB - Branch to subroutine (
bsr): ±4KiB
The very short range for conditional branches (±256 bytes) requiresthe compiler to invert the condition and generate register-indirectbraf/bsrf for longer distances. SuperH is notsupported by LLVM.
Xtensa
- Narrow conditional branch (
beqz.n/bnez.n):-28 to +35 bytes (6-bit signed + 4) - Conditional branch (compare two registers)(
beq/bne/blt/bge/etc):±256 bytes - Conditional branch (compare with zero)(
beqz/bnez/bltz/bgez):±2KiB - Unconditional jump (
j): ±128KiB - Call(
call0/call4/call8/call12):±512KiB
The assembler performs branch relaxation: when a conditional branchtarget is too far, it inverts the condition and inserts a jinstruction.
Per l32r+callx8) when the target distance isunknown. GNU ld then performs linker relaxation.
x86-64
- Short conditional jump (
Jcc rel8): -128 to +127bytes - Short unconditional jump (
JMP rel8): -128 to +127bytes - Near conditional jump (
Jcc rel32): ±2GiB - Near unconditional jump (
JMP rel32): ±2GiB
With a ±2GiB range for near jumps, x86-64 rarely encountersout-of-range branches in practice. That said, Google and Meta Platformsdeploy mostly statically linked executables on x86-64 production serversand have run into the huge executable problem for certainconfigurations.
z/Architecture
- Short conditional branch (
BRC,R_390_PC16DBL): ±64KiB (16-bit halfword displacement) - Long conditional branch (
BRCL,R_390_PC32DBL): ±4GiB (32-bit halfword displacement) - Short call (
BRAS,R_390_PC16DBL):±64KiB - Long call (
BRASL,R_390_PC32DBL):±4GiB
With ±4GiB range for long forms, z/Architecture doesn't need linkerrange extension thunks. LLVM's SystemZLongBranch passrelaxes short branches (BRC/BRAS) to longforms (BRCL/BRASL) when targets are out ofrange.
Compiler: branch rangehandling
Conditional branch instructions usually have shorter ranges thanunconditional ones, making them less suitable for linker thunks (as wewill explore later). Compilers typically keep conditional branch targetswithin the same section, allowing the compiler to handle out-of-rangecases via branch relaxation.
Within a function, conditional branches may still go out of range.The compiler measures branch distances and relaxes out-of-range branchesby inverting the condition and inserting an unconditional branch:
1 |
# Before relaxation (out of range) |
Some architectures have conditional branch instructions that comparewith an immediate, with even shorter ranges due to encoding additionalimmediates. For example, AArch64's cbz/cbnz(compare and branch if zero/non-zero) andtbz/tbnz (test bit and branch) have only±32KiB range. RISC-V Zibi beqi/bnei have ±4KiBrange. The compiler handles these in a similar way:
1 |
// Before relaxation (cbz has ±32KiB range) |
An Intel employee contributed
In LLVM, this is handled by the BranchRelaxation pass,which runs just before AsmPrinter. Different backends havetheir own implementations:
-
BranchRelaxation: AArch64, AMDGPU, AVR, RISC-V -
HexagonBranchRelaxation: Hexagon -
PPCBranchSelector: PowerPC -
SystemZLongBranch: SystemZ -
MipsBranchExpansion: MIPS -
MSP430BSel: MSP430
The generic BranchRelaxation pass computes block sizesand offsets, then iterates until all branches are in range. Forconditional branches, it tries to invert the condition and insert anunconditional branch. For unconditional branches that are still out ofrange, it calls TargetInstrInfo::insertIndirectBranch toemit an indirect jump sequence (e.g.,adrp+add+br on AArch64) or a longjump sequence (e.g., pseudo jump on RISC-V).
Unconditional branches and calls can target different sections sincethey have larger ranges. If the target is out of reach, the linker caninsert thunks to extend the range.
For x86-64, the large code model uses multiple instructions for callsand jumps to support text sections larger than 2GiB (see
Assembler: instructionrelaxation
The assembler converts assembly to machine code. When the target of abranch is within the same section and the distance is known at assemblytime, the assembler can select the appropriate encoding. This isdistinct from linker thunks, which handle cross-section or cross-objectreferences where distances aren't known until link time.
Assembler instruction relaxation handles two cases (see
-
Span-dependent instructions: Select an appropriateencoding based on displacement.
- On x86, a short jump (
jmp rel8) can be relaxed to anear jump (jmp rel32) when the target is far. - On RISC-V,
beqzmay be assembled to the 2-bytec.beqzwhen the displacement fits within ±256 bytes.
- On x86, a short jump (
-
Conditional branch transform: Invert the conditionand insert an unconditional branch. On RISC-V, a
bltmightbe relaxed tobgeplus an unconditional branch.
The assembler uses an iterative layout algorithm that alternatesbetween fragment offset assignment and relaxation until all fragmentsbecome legalized. See
Linker: range extensionthunks
When the linker resolves relocations, it may discover that a branchtarget is out of range. At this point, the instruction encoding isfixed, so the linker cannot simply change the instruction. Instead, itgenerates range extension thunks (also called veneers,branch stubs, or trampolines).
A thunk is a small piece of linker-generated code that can reach theactual target using a longer sequence of instructions. The originalbranch is redirected to the thunk, which then jumps to the realdestination.
Range extension thunks are one type of linker-generated thunk. Othertypes include:
-
ARM interworking veneers: Switch between ARM andThumb instruction sets (see
Linker notes onAArch32) -
MIPS LA25 thunks: Enable PIC and non-PIC codeinteroperability (see
Toolchain notes onMIPS) -
PowerPC64 TOC/NOTOC thunks: Handle calls betweenfunctions using different TOC pointer conventions (see
Linker notes on PowerISA)
Short range vs long rangethunks
A short range thunk (see
Long range thunks use indirection and can jump to (practically)arbitrary locations.
1 |
// Short range thunk: single branch, 4 bytes |
Thunk examples
AArch32 (PIC) (see
1
2
3
4
5__ARMV7PILongThunk_dst:
movw ip, :lower16:(dst - .) ; ip = intra-procedure-call scratch register
movt ip, :upper16:(dst - .)
add ip, ip, pc
bx ip
PowerPC64 ELFv2 (see
1
2
3
4
5__long_branch_dst:
addis 12, 2, .branch_lt@ha # Load high bits from branch lookup table
ld 12, .branch_lt@l(12) # Load target address
mtctr 12 # Move to count register
bctr # Branch to count register
Thunk impact ondebugging and profiling
Thunks are transparent at the source level but visible in low-leveltools:
-
Stack traces: May show thunk symbols (e.g.,
__AArch64ADRPThunk_foo) between caller and callee - Profilers: Samples may attribute time to thunkcode; some profilers aggregate thunk time with the target function
-
Disassembly:
objdumporllvm-objdumpwill show thunk sections interspersed withregular code - Code size: Each thunk adds bytes; large binariesmay have thousands of thunks
lld/ELF's thunk creationalgorithm
lld/ELF uses a multi-pass algorithm infinalizeAddressDependentContent:
1 |
assignAddresses(); |
Key details:
- Multi-pass: Iterates until convergence (max 30passes). Adding thunks changes addresses, potentially puttingpreviously-in-range calls out of range.
-
Pre-allocated ThunkSections: On pass 0,
createInitialThunkSectionsplaces emptyThunkSections at regular intervals(thunkSectionSpacing). For AArch64: 128 MiB - 0x30000 ≈127.8 MiB. -
Thunk reuse:
getThunkreturns existingthunk if one exists for the same target;normalizeExistingThunkchecks if a previously-created thunkis still in range. -
ThunkSection placement:
getISDThunkSecfinds a ThunkSection within branch range of the call site, or createsone adjacent to the calling InputSection.
lld/MachO's thunk creationalgorithm
lld/MachO uses a single-pass algorithm inTextOutputSection::finalize:
1 |
for (callIdx = 0; callIdx < inputs.size(); ++callIdx) { |
Key differences from lld/ELF:
- Single pass: Addresses are assigned monotonicallyand never revisited
-
Slop reservation: Reserves
slopScale * thunkSizebytes (default: 256 × 12 = 3072 byteson ARM64) to leave room for future thunks -
Thunk naming:
<function>.thunk.<sequence>where sequenceincrements per target
Thunkstarvation problem: If many consecutive branches need thunks, eachthunk (12 bytes) consumes slop faster than call sites (4 bytes apart)advance. The test lld/test/MachO/arm64-thunk-starvation.sdemonstrates this edge case. Mitigation is increasing--slop-scale, but pathological cases with hundreds ofconsecutive out-of-range callees can still fail.
mold's thunk creationalgorithm
mold uses a two-pass approach:
- Pessimistically over-allocate thunks. Out-of-section relocations andrelocations referencing to a section not assigned address yetpessimistically need thunks.(
requires_thunk(ctx, isec, rel, first_pass)whenfirst_pass=true) - Then remove unnecessary ones.
Linker pass ordering:
-
compute_section_sizes()callscreate_range_extension_thunks()— final section addressesare NOT yet known -
set_osec_offsets()assigns section addresses -
remove_redundant_thunks()is called AFTER addresses areknown — check unneeded thunks due to out-of-section relocations - Rerun
set_osec_offsets()
Pass 1 (create_range_extension_thunks):Process sections in batches using a sliding window. The window tracksfour positions:
1 |
Sections: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] ... |
- [B, C) = current batch of sections to process (size≤ branch_distance/5)
- A = earliest section still reachable from C (forthunk expiration)
- D = where to place the thunk (furthest pointreachable from B)
1 |
// Simplified from OutputSection<E>::create_range_extension_thunks |
Pass 2 (remove_redundant_thunks): Afterfinal addresses are known, remove thunk entries for symbols actually inrange.
Key characteristics:
- Pessimistic over-allocation: Assumes allout-of-section calls need thunks; safe to shrink later
- Batch size: branch_distance/5 (25.6 MiB forAArch64, 3.2 MiB for AArch32)
- Parallelism: Uses TBB for parallel relocationscanning within each batch
-
Single branch range: Uses one conservative
branch_distanceper architecture. For AArch32, uses ±16 MiB(Thumb limit) for all branches, whereas lld/ELF uses ±32 MiB for A32branches. - Thunk size not accounted in D-advancement: Theactual thunk group size is unknown when advancing D, so the end of alarge thunk group may be unreachable from the beginning of thebatch.
- No convergence loop: Single forward pass foraddress assignment, no risk of non-convergence
GNU ld's thunk creationalgorithm
Each port implements the algorithm on their own. There is no codesharing.
GNU ld's AArch64 port (bfd/elfnn-aarch64.c) uses aniterative algorithm but with a single stub type and no lookup table.
Main iteration loop(elfNN_aarch64_size_stubs()):
1 |
group_sections(htab, stub_group_size, ...); // Default: 127 MiB |
GNU ld's ppc64 port (bfd/elf64-ppc.c) uses an iterativemulti-pass algorithm with a branch lookup table(.branch_lt) for long-range stubs.
Section grouping: Sections are grouped bystub_group_size (~28-30 MiB default); each group gets onestub section. For 14-bit conditional branches(R_PPC64_REL14, ±32KiB range), group size is reduced by1024x.
Main iteration loop(ppc64_elf_size_stubs()):
1 |
while (1) { |
Convergence control:
-
STUB_SHRINK_ITER = 20(PR28827): After 20 iterations,stub sections only grow (prevents oscillation) - Convergence when:
!stub_changed && all section sizes stable
Stub type upgrade: ppc_type_of_stub()initially returns ppc_stub_long_branch for out-of-rangebranches. Later, ppc_size_one_stub() checks if the stub'sbranch can reach; if not, it upgrades toppc_stub_plt_branch and allocates an 8-byte entry in.branch_lt.
Comparing linker thunkalgorithms
| Aspect | lld/ELF | lld/MachO | mold | GNU ld ppc64 |
|---|---|---|---|---|
| Passes | Multi (max 30) | Single | Two | Multi (shrink after 20) |
| Strategy | Iterative refinement | Sliding window | Sliding window | Iterative refinement |
| Thunk placement | Pre-allocated intervals | Inline with slop | Batch intervals | Per stub-group |
Linker relaxation (RISC-V)
In GCC and Clang, their RISC-V ports take a different approach:instead of only expanding branches, it can also shrinkinstruction sequences when the target is close enough. See
Consider a function call using the callpseudo-instruction, which expands to auipc +jalr:
1
2
3
4
5# Before linking (8 bytes)
call ext
# Expands to:
# auipc ra, %pcrel_hi(ext)
# jalr ra, ra, %pcrel_lo(ext)
If ext is within ±1MiB, the linker can relax this to:
1
2# After relaxation (4 bytes)
jal ext
This is enabled by R_RISCV_RELAX relocations thataccompany R_RISCV_CALL relocations. TheR_RISCV_RELAX relocation signals to the linker that thisinstruction sequence is a candidate for shrinking.
Example object code before linking:
1
2
3
4
5
6
7
8
90000000000000006 <foo>:
6: 97 00 00 00 auipc ra, 0
R_RISCV_CALL ext
R_RISCV_RELAX *ABS*
a: e7 80 00 00 jalr ra
e: 97 00 00 00 auipc ra, 0
R_RISCV_CALL ext
R_RISCV_RELAX *ABS*
12: e7 80 00 00 jalr ra
After linking with relaxation enabled, the 8-byteauipc+jalr pairs become 4-bytejal instructions:
1
2
3
4
5
60000000000000244 <foo>:
244: 41 11 addi sp, sp, -16
246: 06 e4 sd ra, 8(sp)
248: ef 00 80 01 jal ext
24c: ef 00 40 01 jal ext
250: ef 00 00 01 jal ext
When the linker deletes instructions, it must also adjust:
- Subsequent instruction offsets within the section
- Symbol addresses
- Other relocations that reference affected locations
- Alignment directives (
R_RISCV_ALIGN)
This makes RISC-V linker relaxation more complex than thunkinsertion, but it provides code size benefits that other architecturescannot achieve at link time.
Diagnosing out-of-rangeerrors
When you encounter a "relocation out of range" error, check thelinker diagnostic and locate the relocatable file and function.Determine how the function call is lowered in assembly.
Summary
Handling long branches requires coordination across thetoolchain:
| Stage | Technique | Example |
|---|---|---|
| Compiler | Branch relaxation pass | Invert condition + add unconditional jump |
| Assembler | Instruction relaxation | Invert condition + add unconditional jump |
| Linker | Range extension thunks | Generate trampolines |
| Linker | Linker relaxation | Shrink auipc+jalr to jal(RISC-V) |
The linker's thunk generation is particularly important for largeprograms where function calls may exceed branch ranges. Differentlinkers use different algorithms with various tradeoffs betweencomplexity, optimality, and robustness.
Linker relaxation approaches adopted by RISC-V and LoongArch is analternative that avoids range extension thunks but introduces othercomplexities.
Related
Relocationoverflow and code models - Linker notes onAArch32
- Linker notes onAArch64
- Linker notes onPower ISA
- Linker notes onx86
- Toolchain noteson MIPS
Toolchainnotes on z/Architecture