阅读视图

发现新文章,点击刷新页面。

Long branches in compilers, assemblers, and linkers

Branch instructions on most architectures use PC-relative addressingwith a limited range. When the target is too far away, the branchbecomes "out of range" and requires special handling.

Consider a large binary where main() at address 0x10000calls foo() at address 0x8010000-over 128MiB away. OnAArch64, the bl instruction can only reach ±128MiB, so thiscall cannot be encoded directly. Without proper handling, the linkerwould fail with an error like "relocation out of range." The toolchainmust handle this transparently to produce correct executables.

This article explores how compilers, assemblers, and linkers worktogether to solve the long branch problem.

  • Compiler (IR to assembly): Handles branches within a function thatexceed the range of conditional branch instructions
  • Assembler (assembly to relocatable file): Handles branches within asection where the distance is known at assembly time
  • Linker: Handles cross-section and cross-object branches discoveredduring final layout

Branch range limitations

Different architectures have different branch range limitations.Here's a quick comparison of unconditional / conditional branchranges:

Architecture Cond Uncond Call Notes
AArch64 ±1MiB ±128MiB ±128MiB Thunks
AArch32 (A32) ±32MiB ±32MiB ±32MiB Thunks, interworking
AArch32 (T32) ±1MiB ±16MiB ±16MiB Thunks, interworking
LoongArch ±128KiB ±128MiB ±128MiB Linker relaxation
M68k (68020+) ±2GiB ±2GiB ±2GiB Assembler picks size
MIPS (pre-R6) ±128KiB ±128KiB (b offset) ±128KiB (bal offset) In -fno-pic code, pseudo-absolutej/jal can be used for a 256MiB region.
MIPS R6 ±128KiB ±128MiB ±128MiB
PowerPC64 ±32KiB ±32MiB ±32MiB Thunks
RISC-V ±4KiB ±1MiB ±1MiB Linker relaxation
SPARC ±1MiB ±8MiB ±2GiB No thunks needed
SuperH ±256B ±4KiB ±4KiB Use register-indirect if needed
x86-64 ±2GiB ±2GiB ±2GiB Large code model changes call sequence
Xtensa ±2KiB ±128KiB ±512KiB Linker relaxation
z/Architecture ±64KiB ±4GiB ±4GiB No thunks needed

The following subsections provide detailed per-architectureinformation, including relocation types relevant for linkerimplementation.

AArch32

In A32 state:

  • Branch (b/b<cond>), conditionalbranch and link (bl<cond>)(R_ARM_JUMP24): ±32MiB
  • Unconditional branch and link (bl/blx,R_ARM_CALL): ±32MiB

Note: R_ARM_CALL is for unconditionalbl/blx which can be relaxed to BLX inline;R_ARM_JUMP24 is for branches which require a veneer forinterworking.

In T32 state (Thumb state pre-ARMv8):

  • Conditional branch (b<cond>,R_ARM_THM_JUMP8): ±256 bytes
  • Short unconditional branch (b,R_ARM_THM_JUMP11): ±2KiB
  • ARMv5T branch and link (bl/blx,R_ARM_THM_CALL): ±4MiB
  • ARMv6T2 wide conditional branch (b<cond>.w,R_ARM_THM_JUMP19): ±1MiB
  • ARMv6T2 wide branch (b.w,R_ARM_THM_JUMP24): ±16MiB
  • ARMv6T2 wide branch and link (bl/blx,R_ARM_THM_CALL): ±16MiB. R_ARM_THM_CALL can berelaxed to BLX.

AArch64

  • Test bit and branch (tbz/tbnz,R_AARCH64_TSTBR14): ±32KiB
  • Compare and branch (cbz/cbnz,R_AARCH64_CONDBR19): ±1MiB
  • Conditional branches (b.<cond>,R_AARCH64_CONDBR19): ±1MiB
  • Unconditional branches (b/bl,R_AARCH64_JUMP26/R_AARCH64_CALL26):±128MiB

The compiler's BranchRelaxation pass handlesout-of-range conditional branches by inverting the condition andinserting an unconditional branch. The AArch64 assembler does notperform branch relaxation; out-of-range branches produce linker errorsif not handled by the compiler.

LoongArch

  • Conditional branches(beq/bne/blt/bge/bltu/bgeu,R_LARCH_B16): ±128KiB (18-bit signed)
  • Compare-to-zero branches (beqz/bnez,R_LARCH_B21): ±4MiB (23-bit signed)
  • Unconditional branch/call (b/bl,R_LARCH_B26): ±128MiB (28-bit signed)
  • Medium range call (pcaddu12i+jirl,R_LARCH_CALL30): ±2GiB
  • Long range call (pcaddu18i+jirl,R_LARCH_CALL36): ±128GiB

M68k

  • Short branch(Bcc.B/BRA.B/BSR.B): ±128 bytes(8-bit displacement)
  • Word branch(Bcc.W/BRA.W/BSR.W): ±32KiB(16-bit displacement)
  • Long branch(Bcc.L/BRA.L/BSR.L, 68020+):±2GiB (32-bit displacement)

GNU Assembler provides pseudoopcodes (jbsr, jra, jXX) that"automatically expand to the shortest instruction capable of reachingthe target". For example, jeq .L0 emits one ofbeq.b, beq.w, and beq.l dependingon the displacement.

With the long forms available on 68020 and later, M68k doesn't needlinker range extension thunks.

MIPS

  • Conditional branches(beq/bne/bgez/bltz/etc,R_MIPS_PC16): ±128KiB
  • PC-relative jump (b offset(bgez $zero, offset)): ±128KiB
  • PC-relative call (bal offset(bgezal $zero, offset)): ±128KiB
  • Pseudo-absolute jump/call (j/jal,R_MIPS_26): branch within the current 256MiB region, onlysuitable for -fno-pic code. Deprecated in R6 in favor ofbc/balc

16-bit instructions removed in Release 6:

  • Conditional branch (beqz16,R_MICROMIPS_PC7_S1): ±128 bytes
  • Unconditional branch (b16,R_MICROMIPS_PC10_S1): ±1KiB

MIPS Release 6:

  • Unconditional branch, compact (bc16, unclear toolchainimplementation): ±1KiB
  • Compare and branch, compact(beqc/bnec/bltc/bgec/etc,R_MIPS_PC16): ±128KiB
  • Compare register to zero and branch, compact(beqzc/bnezc/etc,R_MIPS_PC21_S2): ±4MiB
  • Branch (and link), compact (bc/balc,R_MIPS_PC26_S2): ±128MiB

LLVM's MipsBranchExpansion pass handles out-of-rangebranches.

lld implements LA25 thunks for MIPS PIC/non-PIC interoperability, butnot range extension thunks.

GCC's mips port ported added-mlong-calls in 1993-03.

PowerPC

  • Conditional branch (bc/bcl,R_PPC64_REL14): ±32KiB
  • Unconditional branch (b/bl,R_PPC64_REL24/R_PPC64_REL24_NOTOC):±32MiB

GCC-generated code relies on linker thunks. However, the legacy-mlongcall can be used to generate long code sequences.

RISC-V

  • Compressed c.beqz: ±256 bytes
  • Compressed c.jal: ±2KiB
  • jalr (I-type immediate): ±2KiB
  • Conditional branches(beq/bne/blt/bge/bltu/bgeu,B-type immediate): ±4KiB
  • jal (J-type immediate, PseudoBR): ±1MiB(notably smaller than other RISC architectures: AArch64 ±128MiB,PowerPC64 ±32MiB, LoongArch ±128MiB)
  • PseudoJump (using auipc +jalr): ±2GiB
  • beqi/bnei (Zibi extension, 5-bit compareimmediate (1 to 31 and -1)): ±4KiB

Qualcomm uC Branch Immediate extension (Xqcibi):

  • qc.beqi/qc.bnei/qc.blti/qc.bgei/qc.bltui/qc.bgeui(32-bit, 5-bit compare immediate): ±4KiB
  • qc.e.beqi/qc.e.bnei/qc.e.blti/qc.e.bgei/qc.e.bltui/qc.e.bgeui(48-bit, 16-bit compare immediate): ±4KiB

Qualcomm uC Long Branch extension (Xqcilb):

  • qc.e.j/qc.e.jal (48-bit,R_RISCV_VENDOR(QUALCOMM)+R_RISCV_QC_E_CALL_PLT): ±2GiB

For function calls:

  • The Gocompiler emits a single jal for calls and relies on itslinker to generate trampolines when the target is out of range.
  • In contrast, GCC and Clang emit auipc+jalrand rely on linker relaxation to shrink the sequence when possible.

The jal range (±1MiB) is notably smaller than other RISCarchitectures (AArch64 ±128MiB, PowerPC64 ±32MiB, LoongArch ±128MiB).This limits the effectiveness of linker relaxation ("start large andshrink"), and leads to frequent trampolines when the compileroptimistically emits jal ("start small and grow").

SPARC

  • Compare and branch (cxbe, R_SPARC_5): ±64bytes
  • Conditional branch (bcc, R_SPARC_WDISP19):±1MiB
  • Unconditional branch (b, R_SPARC_WDISP22):±8MiB
  • call(R_SPARC_WDISP30/R_SPARC_WPLT30): ±2GiB

With ±2GiB range for call, SPARC doesn't need rangeextension thunks in practice.

SuperH

SuperH uses fixed-width 16-bit instructions, which limits branchranges.

  • Conditional branch (bf/bt): ±256 bytes(8-bit displacement)
  • Unconditional branch (bra): ±4KiB (12-bitdisplacement)
  • Branch to subroutine (bsr): ±4KiB (12-bitdisplacement)

For longer distances, register-indirect branches(braf/bsrf) are used. The compiler invertsconditions and emits these when targets exceed the short ranges.

SuperH is supported by GCC and binutils, but not by LLVM.

Xtensa

Xtensa uses variable-length instructions: 16-bit (narrow,.n suffix) and 24-bit (standard).

  • Narrow conditional branch (beqz.n/bnez.n,16-bit): -28 to +35 bytes (6-bit signed + 4)
  • Conditional branch (compare two registers)(beq/bne/blt/bge/etc,24-bit): ±256 bytes
  • Conditional branch (compare with zero)(beqz/bnez/bltz/bgez,24-bit): ±2KiB
  • Unconditional jump (j, 24-bit): ±128KiB
  • Call(call0/call4/call8/call12,24-bit): ±512KiB

The assembler performs branch relaxation: when a conditional branchtarget is too far, it inverts the condition and inserts a jinstruction.

Per https://www.sourceware.org/binutils/docs/as/Xtensa-Call-Relaxation.html,for calls, GNU Assembler pessimistically generates indirect sequences(l32r+callx8) when the target distance isunknown. GNU ld then performs linker relaxation.

x86-64

  • Short conditional jump (Jcc rel8): -128 to +127bytes
  • Short unconditional jump (JMP rel8): -128 to +127bytes
  • Near conditional jump (Jcc rel32): ±2GiB
  • Near unconditional jump (JMP rel32): ±2GiB

With a ±2GiB range for near jumps, x86-64 rarely encountersout-of-range branches in practice. That said, Google and Meta Platformsdeploy mostly statically linked executables on x86-64 production serversand have run into the huge executable problem for certainconfigurations.

z/Architecture

  • Short conditional branch (BRC,R_390_PC16DBL): ±64KiB (16-bit halfword displacement)
  • Long conditional branch (BRCL,R_390_PC32DBL): ±4GiB (32-bit halfword displacement)
  • Short call (BRAS, R_390_PC16DBL):±64KiB
  • Long call (BRASL, R_390_PC32DBL):±4GiB

With ±4GiB range for long forms, z/Architecture doesn't need linkerrange extension thunks. LLVM's SystemZLongBranch passrelaxes short branches (BRC/BRAS) to longforms (BRCL/BRASL) when targets are out ofrange.

Compiler: branch rangehandling

Conditional branch instructions usually have shorter ranges thanunconditional ones, making them less suitable for linker thunks (as wewill explore later). Compilers typically keep conditional branch targetswithin the same section, allowing the compiler to handle out-of-rangecases via branch relaxation.

Within a function, conditional branches may still go out of range.The compiler measures branch distances and relaxes out-of-range branchesby inverting the condition and inserting an unconditional branch:

1
2
3
4
5
6
7
# Before relaxation (out of range)
beq .Lfar_target # ±4KiB range on RISC-V

# After relaxation
bne .Lskip # Inverted condition, short range
j .Lfar_target # Unconditional jump, ±1MiB range
.Lskip:

Some architectures have conditional branch instructions that comparewith an immediate, with even shorter ranges due to encoding additionalimmediates. For example, AArch64's cbz/cbnz(compare and branch if zero/non-zero) andtbz/tbnz (test bit and branch) have only±32KiB range. RISC-V Zibi beqi/bnei have ±4KiBrange. The compiler handles these in a similar way:

1
2
3
4
5
6
7
// Before relaxation (cbz has ±32KiB range)
cbz w0, far

// After relaxation
cbnz w0, .Lskip // Inverted condition
b far // Unconditional branch, ±128MiB range
.Lskip:

An Intel employee contributed https://reviews.llvm.org/D41634 (in 2017) when inversionof a branch condintion is impossible. This is for an out-of-treebackend. As of Jan 2026 there is no in-tree test for this code path.

In LLVM, this is handled by the BranchRelaxation pass,which runs just before AsmPrinter. Different backends havetheir own implementations:

  • BranchRelaxation: AArch64, AMDGPU, AVR, RISC-V
  • HexagonBranchRelaxation: Hexagon
  • PPCBranchSelector: PowerPC
  • SystemZLongBranch: SystemZ
  • MipsBranchExpansion: MIPS
  • MSP430BSel: MSP430

The generic BranchRelaxation pass computes block sizesand offsets, then iterates until all branches are in range. Forconditional branches, it tries to invert the condition and insert anunconditional branch. For unconditional branches that are still out ofrange, it calls TargetInstrInfo::insertIndirectBranch toemit an indirect jump sequence (e.g.,adrp+add+br on AArch64) or a longjump sequence (e.g., pseudo jump on RISC-V).

Note: The size estimates may be inaccurate due to inline assembly.LLVM uses heuristics to estimate inline assembly sizes, but for certainassembly constructs the size is not precisely known at compile time.

Unconditional branches and calls can target different sections sincethey have larger ranges. If the target is out of reach, the linker caninsert thunks to extend the range.

For x86-64, the large code model uses multiple instructions for callsand jumps to support text sections larger than 2GiB (see Relocationoverflow and code models: x86-64 large code model). This is apessimization if the callee ends up being within reach. Google and MetaPlatforms have interest in allowing range extension thunks as areplacement for the multiple instructions.

Assembler: instructionrelaxation

The assembler converts assembly to machine code. When the target of abranch is within the same section and the distance is known at assemblytime, the assembler can select the appropriate encoding. This isdistinct from linker thunks, which handle cross-section or cross-objectreferences where distances aren't known until link time.

Assembler instruction relaxation handles two cases (see Clang-O0 output: branch displacement and size increase for examples):

  • Span-dependent instructions: Select an appropriateencoding based on displacement.
    • On x86, a short jump (jmp rel8) can be relaxed to anear jump (jmp rel32) when the target is far.
    • On RISC-V, beqz may be assembled to the 2-bytec.beqz when the displacement fits within ±256 bytes.
  • Conditional branch transform: Invert the conditionand insert an unconditional branch. On RISC-V, a blt mightbe relaxed to bge plus an unconditional branch.

The assembler uses an iterative layout algorithm that alternatesbetween fragment offset assignment and relaxation until all fragmentsbecome legalized. See Integratedassembler improvements in LLVM 19 for implementation details.

Linker: range extensionthunks

When the linker resolves relocations, it may discover that a branchtarget is out of range. At this point, the instruction encoding isfixed, so the linker cannot simply change the instruction. Instead, itgenerates range extension thunks (also called veneers,branch stubs, or trampolines).

A thunk is a small piece of linker-generated code that can reach theactual target using a longer sequence of instructions. The originalbranch is redirected to the thunk, which then jumps to the realdestination.

Range extension thunks are one type of linker-generated thunk. Othertypes include:

  • ARM interworking veneers: Switch between ARM andThumb instruction sets (see Linker notes onAArch32)
  • MIPS LA25 thunks: Enable PIC and non-PIC codeinteroperability (see Toolchain notes onMIPS)
  • PowerPC64 TOC/NOTOC thunks: Handle calls betweenfunctions using different TOC pointer conventions (see Linker notes on PowerISA)

Short range vs long rangethunks

A short range thunk (see lld/ELF's AArch64implementation) contains just a single branch instruction. Since ituses a branch, its reach is also limited by the branch range—it can onlyextend coverage by one branch distance. For targets further away,multiple short range thunks can be chained, or a long range thunk withaddress computation must be used.

Long range thunks use indirection and can jump to (practically)arbitrary locations.

1
2
3
4
5
6
7
8
9
// Short range thunk: single branch, 4 bytes
__AArch64AbsLongThunk_dst:
b dst // ±128MiB range

// Long range thunk: address computation, 12 bytes
__AArch64ADRPThunk_dst:
adrp x16, dst // Load page address (±4GiB range)
add x16, x16, :lo12:dst // Add page offset
br x16 // Indirect branch

Thunk examples

AArch32 (PIC) (see Linker notes onAArch32):

1
2
3
4
5
__ARMV7PILongThunk_dst:
movw ip, :lower16:(dst - .) ; ip = intra-procedure-call scratch register
movt ip, :upper16:(dst - .)
add ip, ip, pc
bx ip

PowerPC64 ELFv2 (see Linker notes on PowerISA):

1
2
3
4
5
__long_branch_dst:
addis 12, 2, .branch_lt@ha # Load high bits from branch lookup table
ld 12, .branch_lt@l(12) # Load target address
mtctr 12 # Move to count register
bctr # Branch to count register

Thunk impact ondebugging and profiling

Thunks are transparent at the source level but visible in low-leveltools:

  • Stack traces: May show thunk symbols (e.g.,__AArch64ADRPThunk_foo) between caller and callee
  • Profilers: Samples may attribute time to thunkcode; some profilers aggregate thunk time with the target function
  • Disassembly: objdump orllvm-objdump will show thunk sections interspersed withregular code
  • Code size: Each thunk adds bytes; large binariesmay have thousands of thunks

lld/ELF's thunk creationalgorithm

lld/ELF uses a multi-pass algorithm infinalizeAddressDependentContent:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
assignAddresses();
for (pass = 0; pass < 30; ++pass) {
if (pass == 0)
createInitialThunkSections(); // pre-create empty ThunkSections
bool changed = false;
for (relocation : all_relocations) {
if (pass > 0 && normalizeExistingThunk(rel))
continue; // existing thunk still in range
if (!needsThunk(rel)) continue;
Thunk *t = getOrCreateThunk(rel);
ts = findOrCreateThunkSection(rel, src);
ts->addThunk(t);
rel.sym = t->getThunkTargetSym(); // redirect
changed = true;
}
mergeThunks(); // insert ThunkSections into output
if (!changed) break;
assignAddresses(); // recalculate with new thunks
}

Key details:

  • Multi-pass: Iterates until convergence (max 30passes). Adding thunks changes addresses, potentially puttingpreviously-in-range calls out of range.
  • Pre-allocated ThunkSections: On pass 0,createInitialThunkSections places emptyThunkSections at regular intervals(thunkSectionSpacing). For AArch64: 128 MiB - 0x30000 ≈127.8 MiB.
  • Thunk reuse: getThunk returns existingthunk if one exists for the same target;normalizeExistingThunk checks if a previously-created thunkis still in range.
  • ThunkSection placement: getISDThunkSecfinds a ThunkSection within branch range of the call site, or createsone adjacent to the calling InputSection.

lld/MachO's thunk creationalgorithm

lld/MachO uses a single-pass algorithm inTextOutputSection::finalize:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
for (callIdx = 0; callIdx < inputs.size(); ++callIdx) {
// Finalize sections within forward branch range (minus slop)
while (finalIdx < endIdx && fits_in_range(inputs[finalIdx]))
finalizeOne(inputs[finalIdx++]);

// Process branch relocations in this section
for (Relocation &r : reverse(isec->relocs)) {
if (!isBranchReloc(r)) continue;
if (targetInRange(r)) continue;
if (existingThunkInRange(r)) { reuse it; continue; }
// Create new thunk and finalize it
createThunk(r);
}
}

Key differences from lld/ELF:

  • Single pass: Addresses are assigned monotonicallyand never revisited
  • Slop reservation: ReservesslopScale * thunkSize bytes (default: 256 × 12 = 3072 byteson ARM64) to leave room for future thunks
  • Thunk naming:<function>.thunk.<sequence> where sequenceincrements per target

Thunkstarvation problem: If many consecutive branches need thunks, eachthunk (12 bytes) consumes slop faster than call sites (4 bytes apart)advance. The test lld/test/MachO/arm64-thunk-starvation.sdemonstrates this edge case. Mitigation is increasing--slop-scale, but pathological cases with hundreds ofconsecutive out-of-range callees can still fail.

mold's thunk creationalgorithm

mold uses a two-pass approach:

  • Pessimistically over-allocate thunks. Out-of-section relocations andrelocations referencing to a section not assigned address yetpessimistically need thunks.(requires_thunk(ctx, isec, rel, first_pass) whenfirst_pass=true)
  • Then remove unnecessary ones.

Linker pass ordering:

  • compute_section_sizes() callscreate_range_extension_thunks() — final section addressesare NOT yet known
  • set_osec_offsets() assigns section addresses
  • remove_redundant_thunks() is called AFTER addresses areknown — check unneeded thunks due to out-of-section relocations
  • Rerun set_osec_offsets()

Pass 1 (create_range_extension_thunks):Process sections in batches using a sliding window. The window tracksfour positions:

1
2
3
4
5
6
7
8
9
Sections:   [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] ...
^ ^ ^ ^
A B C D
| |_______| |
| batch |
| |
earliest thunk
reachable placement
from C
  • [B, C) = current batch of sections to process (size≤ branch_distance/5)
  • A = earliest section still reachable from C (forthunk expiration)
  • D = where to place the thunk (furthest pointreachable from B)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Simplified from OutputSection<E>::create_range_extension_thunks
while (b < sections.size()) {
// Advance D: find furthest point where thunk is reachable from B
while (d < size && thunk_at_d_reachable_from_b)
assign_address(sections[d++]);

// Compute batch [B, C)
c = b + 1;
while (c < d && sections[c] < sections[b] + batch_size) c++;

// Advance A: expire thunks no longer reachable
while (a < b && sections[a] + branch_distance < sections[c]) a++;
// Expire thunk groups before A: clear symbol flags.
for (; t < thunks.size() && thunks[t].offset < sections[a]; t++)
for (sym in thunks[t].symbols) sym->flags = 0;

// Scan [B,C) relocations. If a symbol is not assigned to a thunk group yet,
// assign it to the new thunk group at D.
auto &thunk = thunks.emplace_back(new Thunk(offset));
parallel_for(b, c, [&](i64 i) {
for (rel in sections[i].relocs) {
if (requires_thunk(rel)) {
Symbol &sym = rel.symbol;
if (!sym.flags.test_and_set()) { // atomic: skip if already set
lock_guard lock(mu);
thunk.symbols.push_back(&sym);
}
}
}
});
offset += thunk.size();
b = c; // Move to next batch
}

Pass 2 (remove_redundant_thunks): Afterfinal addresses are known, remove thunk entries for symbols actually inrange.

Key characteristics:

  • Pessimistic over-allocation: Assumes allout-of-section calls need thunks; safe to shrink later
  • Batch size: branch_distance/5 (25.6 MiB forAArch64, 3.2 MiB for AArch32)
  • Parallelism: Uses TBB for parallel relocationscanning within each batch
  • Single branch range: Uses one conservativebranch_distance per architecture. For AArch32, uses ±16 MiB(Thumb limit) for all branches, whereas lld/ELF uses ±32 MiB for A32branches.
  • Thunk size not accounted in D-advancement: Theactual thunk group size is unknown when advancing D, so the end of alarge thunk group may be unreachable from the beginning of thebatch.
  • No convergence loop: Single forward pass foraddress assignment, no risk of non-convergence

GNU ld's thunk creationalgorithm

Each port implements the algorithm on their own. There is no codesharing.

GNU ld's AArch64 port (bfd/elfnn-aarch64.c) uses aniterative algorithm but with a single stub type and no lookup table.

Main iteration loop(elfNN_aarch64_size_stubs()):

1
2
3
4
5
6
7
8
9
10
11
group_sections(htab, stub_group_size, ...);  // Default: 127 MiB
layout_sections_again();

for (;;) {
stub_changed = false;
_bfd_aarch64_add_call_stub_entries(&stub_changed, ...);
if (!stub_changed)
return true;
_bfd_aarch64_resize_stubs(htab);
layout_sections_again();
}

GNU ld's ppc64 port (bfd/elf64-ppc.c) uses an iterativemulti-pass algorithm with a branch lookup table(.branch_lt) for long-range stubs.

Section grouping: Sections are grouped bystub_group_size (~28-30 MiB default); each group gets onestub section. For 14-bit conditional branches(R_PPC64_REL14, ±32KiB range), group size is reduced by1024x.

Main iteration loop(ppc64_elf_size_stubs()):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
while (1) {
// Scan all relocations in all input sections
for (input_bfd; section; irela) {
// Only process branch relocations (R_PPC64_REL24, R_PPC64_REL14, etc.)
stub_type = ppc_type_of_stub(section, irela, ...);
if (stub_type == ppc_stub_none)
continue;
// Create or merge stub entry
stub_entry = ppc_add_stub(...);
}

// Size all stubs, potentially upgrading long_branch to plt_branch
bfd_hash_traverse(&stub_hash_table, ppc_size_one_stub, ...);

// Check for convergence
if (!stub_changed && all_sizes_stable)
break;

// Re-layout sections
layout_sections_again();
}

Convergence control:

  • STUB_SHRINK_ITER = 20 (PR28827): After 20 iterations,stub sections only grow (prevents oscillation)
  • Convergence when:!stub_changed && all section sizes stable

Stub type upgrade: ppc_type_of_stub()initially returns ppc_stub_long_branch for out-of-rangebranches. Later, ppc_size_one_stub() checks if the stub'sbranch can reach; if not, it upgrades toppc_stub_plt_branch and allocates an 8-byte entry in.branch_lt.

Comparing linker thunkalgorithms

Aspect lld/ELF lld/MachO mold GNU ld ppc64
Passes Multi (max 30) Single Two Multi (shrink after 20)
Strategy Iterative refinement Sliding window Sliding window Iterative refinement
Thunk placement Pre-allocated intervals Inline with slop Batch intervals Per stub-group

Linker relaxation

Some architectures take a different approach: instead of onlyexpanding branches, the linker can also shrinkinstruction sequences when the target is close enough. RISC-V andLoongArch both use this technique. See Thedark side of RISC-V linker relaxation for a deeper dive into thecomplexities and tradeoffs.

Consider a function call using the callpseudo-instruction, which expands to auipc +jalr:

1
2
3
4
5
# Before linking (8 bytes)
call ext
# Expands to:
# auipc ra, %pcrel_hi(ext)
# jalr ra, ra, %pcrel_lo(ext)

If ext is within ±1MiB, the linker can relax this to:

1
2
# After relaxation (4 bytes)
jal ext

This is enabled by R_RISCV_RELAX relocations thataccompany R_RISCV_CALL relocations. TheR_RISCV_RELAX relocation signals to the linker that thisinstruction sequence is a candidate for shrinking.

Example object code before linking:

1
2
3
4
5
6
7
8
9
0000000000000006 <foo>:
6: 97 00 00 00 auipc ra, 0
R_RISCV_CALL ext
R_RISCV_RELAX *ABS*
a: e7 80 00 00 jalr ra
e: 97 00 00 00 auipc ra, 0
R_RISCV_CALL ext
R_RISCV_RELAX *ABS*
12: e7 80 00 00 jalr ra

After linking with relaxation enabled, the 8-byteauipc+jalr pairs become 4-bytejal instructions:

1
2
3
4
5
6
0000000000000244 <foo>:
244: 41 11 addi sp, sp, -16
246: 06 e4 sd ra, 8(sp)
248: ef 00 80 01 jal ext
24c: ef 00 40 01 jal ext
250: ef 00 00 01 jal ext

When the linker deletes instructions, it must also adjust:

  • Subsequent instruction offsets within the section
  • Symbol addresses
  • Other relocations that reference affected locations
  • Alignment directives (R_RISCV_ALIGN)

This makes RISC-V linker relaxation more complex than thunkinsertion, but it provides code size benefits that other architecturescannot achieve at link time.

LoongArch uses a similar approach. Apcaddu12i+jirl sequence(R_LARCH_CALL36, ±128GiB range) can be relaxed to a singlebl instruction (R_LARCH_B26, ±128MiB range)when the target is close enough.

Diagnosing out-of-rangeerrors

When you encounter a "relocation out of range" error, check thelinker diagnostic and locate the relocatable file and function.Determine how the function call is lowered in assembly.

Summary

Handling long branches requires coordination across thetoolchain:

Stage Technique Example
Compiler Branch relaxation pass Invert condition + add unconditional jump
Assembler Instruction relaxation Invert condition + add unconditional jump
Linker Range extension thunks Generate trampolines
Linker Linker relaxation Shrink auipc+jalr to jal(RISC-V)

The linker's thunk generation is particularly important for largeprograms where function calls may exceed branch ranges. Differentlinkers use different algorithms with various tradeoffs betweencomplexity, optimality, and robustness.

Linker relaxation approaches adopted by RISC-V and LoongArch is analternative that avoids range extension thunks but introduces othercomplexities.

Related

Handling long branches

Branch instructions on most architectures use PC-relative addressingwith a limited range. When the target is too far away, the branchbecomes "out of range" and requires special handling.

Consider a large binary where main() at address 0x10000calls foo() at address 0x8010000-over 128MiB away. OnAArch64, the bl instruction can only reach ±128MiB, so thiscall cannot be encoded directly. Without proper handling, the linkerwould fail with an error like "relocation out of range." The toolchainmust handle this transparently to produce correct executables.

This article explores how compilers, assemblers, and linkers worktogether to solve the long branch problem.

  • Compiler (IR to assembly): Handles branches within a function thatexceed the range of conditional branch instructions
  • Assembler (assembly to relocatable file): Handles branches within asection where the distance is known at assembly time
  • Linker: Handles cross-section and cross-object branches discoveredduring final layout

Branch range limitations

Different architectures have different branch range limitations.Here's a quick comparison of unconditional branch/call ranges:

Architecture Unconditional Branch Conditional Branch Notes
AArch64 ±128MiB ±1MiB Range extension thunks
AArch32 (A32) ±32MiB ±32MiB Range extension and interworking veneers
AArch32 (T32) ±16MiB ±1MiB Thumb has shorter ranges
PowerPC64 ±32MiB ±32KiB Range extension and TOC/NOTOC interworking thunks
RISC-V ±1MiB (jal) ±4KiB Linker relaxation
x86-64 ±2GiB ±2GiB Code models or thunk extension

The following subsections provide detailed per-architectureinformation, including relocation types relevant for linkerimplementation.

AArch32

In A32 state:

  • Branch (b/b<cond>), conditionalbranch and link (bl<cond>)(R_ARM_JUMP24): ±32MiB
  • Unconditional branch and link (bl/blx,R_ARM_CALL): ±32MiB

Note: R_ARM_CALL is for unconditionalbl/blx which can be relaxed to BLX inline;R_ARM_JUMP24 is for branches which require a veneer forinterworking.

In T32 state:

  • Conditional branch (b<cond>,R_ARM_THM_JUMP8): ±256 bytes
  • Short unconditional branch (b,R_ARM_THM_JUMP11): ±2KiB
  • ARMv5T branch and link (bl/blx,R_ARM_THM_CALL): ±4MiB
  • ARMv6T2 wide conditional branch (b<cond>.w,R_ARM_THM_JUMP19): ±1MiB
  • ARMv6T2 wide branch (b.w,R_ARM_THM_JUMP24): ±16MiB
  • ARMv6T2 wide branch and link (bl/blx,R_ARM_THM_CALL): ±16MiB. R_ARM_THM_CALL can berelaxed to BLX.

AArch64

  • Test and compare branches(tbnz/tbz/cbnz/cbz):±32KiB
  • Conditional branches (b.<cond>): ±1MiB
  • Unconditional branches (b/bl):±128MiB

PowerPC

  • Conditional branch (bc/bcl,R_PPC64_REL14): ±32KiB
  • Unconditional branch (b/bl,R_PPC64_REL24/R_PPC64_REL24_NOTOC):±32MiB

RISC-V

  • Compressed c.beqz: ±256 bytes
  • Compressed c.jal: ±2KiB
  • jalr (I-type immediate): ±2KiB
  • Conditional branches(beq/bne/blt/bge/bltu/bgeu,B-type immediate): ±4KiB
  • jal (J-type immediate, PseudoBR):±1MiB
  • PseudoJump (using auipc +jalr): ±2GiB

Qualcomm uC Branch Immediate extension (Xqcibi):

  • qc.beqi/qc.bnei/qc.blti/qc.bgei/qc.bltui/qc.bgeui(32-bit, 5-bit compare immediate): ±4KiB
  • qc.e.beqi/qc.e.bnei/qc.e.blti/qc.e.bgei/qc.e.bltui/qc.e.bgeui(48-bit, 16-bit compare immediate): ±4KiB

Qualcomm uC Long Branch extension (Xqcilb):

  • qc.e.j/qc.e.jal (48-bit,R_RISCV_VENDOR(QUALCOMM)+R_RISCV_QC_E_CALL_PLT): ±2GiB

SPARC

  • Compare and branch (cxbe, R_SPARC_5): ±64bytes
  • Conditional branches (bcc,R_SPARC_WDISP19): ±1MiB
  • call (R_SPARC_WDISP30): ±2GiB

Note: lld does not implement range extension thunks for SPARC.

x86-64

  • Short conditional jump (Jcc rel8): -128 to +127bytes
  • Short unconditional jump (JMP rel8): -128 to +127bytes
  • Near conditional jump (Jcc rel32): ±2GiB
  • Near unconditional jump (JMP rel32): ±2GiB

With a ±2GiB range for near jumps, x86-64 rarely encountersout-of-range branches in practice. A single text section would need toexceed 2GiB before thunks become necessary. For this reason, mostlinkers (including lld) do not implement range extension thunks forx86-64.

Compiler: branch relaxation

The compiler typically generates branches using a form with a largerange. However, certain conditional branches may still go out of rangewithin a function.

The compiler measures branch distances and relaxes out-of-rangebranches. In LLVM, this is handled by the BranchRelaxationpass, which runs just before AsmPrinter.

Different backends have their own implementations:

  • BranchRelaxation: AArch64, AMDGPU, AVR, RISC-V
  • HexagonBranchRelaxation: Hexagon
  • PPCBranchSelector: PowerPC
  • SystemZLongBranch: SystemZ
  • MipsBranchExpansion: MIPS
  • MSP430BSel: MSP430

For a conditional branch that is out of range, the pass typicallyinverts the condition and inserts an unconditional branch:

1
2
3
4
5
6
7
# Before relaxation (out of range)
beq .Lfar_target # ±4KiB range on RISC-V

# After relaxation
bne .Lskip # Inverted condition, short range
j .Lfar_target # Unconditional jump, ±1MiB range
.Lskip:

Assembler: instructionrelaxation

The assembler converts assembly to machine code. When the target of abranch is within the same section and the distance is known at assemblytime, the assembler can select the appropriate encoding. This isdistinct from linker thunks, which handle cross-section or cross-objectreferences where distances aren't known until link time.

Assembler instruction relaxation handles two cases (see Clang-O0 output: branch displacement and size increase for examples):

  • Span-dependent instructions: Select a largerencoding when the displacement exceeds the range of the smallerencoding. For x86, a short jump (jmp rel8) can be relaxedto a near jump (jmp rel32).
  • Conditional branch transform: Invert the conditionand insert an unconditional branch. On RISC-V, a blt mightbe relaxed to bge plus an unconditional branch.

The assembler uses an iterative layout algorithm that alternatesbetween fragment offset assignment and relaxation until all fragmentsbecome legalized. See Integratedassembler improvements in LLVM 19 for implementation details.

Linker: range extensionthunks

When the linker resolves relocations, it may discover that a branchtarget is out of range. At this point, the instruction encoding isfixed, so the linker cannot simply change the instruction. Instead, itgenerates range extension thunks (also called veneers,branch stubs, or trampolines).

A thunk is a small piece of linker-generated code that can reach theactual target using a longer sequence of instructions. The originalbranch is redirected to the thunk, which then jumps to the realdestination.

Range extension thunks are one type of linker-generated thunk. Othertypes include:

  • ARM interworking veneers: Switch between ARM andThumb instruction sets (see Linker notes onAArch32)
  • MIPS LA25 thunks: Enable PIC and non-PIC codeinteroperability (see Toolchain notes onMIPS)
  • PowerPC64 TOC/NOTOC thunks: Handle calls betweenfunctions using different TOC pointer conventions (see Linker notes on PowerISA)

Short range vs long rangethunks

A short range thunk (see lld/ELF's AArch64implementation) contains just a single branch instruction. Since ituses a branch, its reach is also limited by the branch range—it can onlyextend coverage by one branch distance. For targets further away,multiple short range thunks can be chained, or a long range thunk withaddress computation must be used.

Long range thunks use indirection and can jump to (practically)arbitrary locations.

1
2
3
4
5
6
7
8
9
// Short range thunk: single branch, 4 bytes
__AArch64AbsLongThunk_dst:
b dst // ±128MiB range

// Long range thunk: address computation, 12 bytes
__AArch64ADRPThunk_dst:
adrp x16, dst // Load page address (±4GiB range)
add x16, x16, :lo12:dst // Add page offset
br x16 // Indirect branch

Thunk examples

AArch32 (PIC) (see Linker notes onAArch32):

1
2
3
4
5
__ARMV7PILongThunk_dst:
movw ip, :lower16:(dst - .) ; ip = intra-procedure-call scratch register
movt ip, :upper16:(dst - .)
add ip, ip, pc
bx ip

PowerPC64 ELFv2 (see Linker notes on PowerISA):

1
2
3
4
5
__long_branch_dst:
addis 12, 2, .branch_lt@ha # Load high bits from branch lookup table
ld 12, .branch_lt@l(12) # Load target address
mtctr 12 # Move to count register
bctr # Branch to count register

Thunk impact ondebugging and profiling

Thunks are transparent at the source level but visible in low-leveltools:

  • Stack traces: May show thunk symbols (e.g.,__AArch64ADRPThunk_foo) between caller and callee
  • Profilers: Samples may attribute time to thunkcode; some profilers aggregate thunk time with the target function
  • Disassembly: objdump orllvm-objdump will show thunk sections interspersed withregular code
  • Code size: Each thunk adds bytes; large binariesmay have thousands of thunks

lld/ELF's thunk creationalgorithm

lld/ELF uses a multi-pass algorithm infinalizeAddressDependentContent:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
assignAddresses();
for (pass = 0; pass < 30; ++pass) {
if (pass == 0)
createInitialThunkSections(); // pre-create empty ThunkSections
bool changed = false;
for (relocation : all_relocations) {
if (pass > 0 && normalizeExistingThunk(rel))
continue; // existing thunk still in range
if (!needsThunk(rel)) continue;
Thunk *t = getOrCreateThunk(rel);
ts = findOrCreateThunkSection(rel, src);
ts->addThunk(t);
rel.sym = t->getThunkTargetSym(); // redirect
changed = true;
}
mergeThunks(); // insert ThunkSections into output
if (!changed) break;
assignAddresses(); // recalculate with new thunks
}

Key details:

  • Multi-pass: Iterates until convergence (max 30passes). Adding thunks changes addresses, potentially puttingpreviously-in-range calls out of range.
  • Pre-allocated ThunkSections: On pass 0,createInitialThunkSections places emptyThunkSections at regular intervals(thunkSectionSpacing). For AArch64: 128 MiB - 0x30000 ≈127.8 MiB.
  • Thunk reuse: getThunk returns existingthunk if one exists for the same target;normalizeExistingThunk checks if a previously-created thunkis still in range.
  • ThunkSection placement: getISDThunkSecfinds a ThunkSection within branch range of the call site, or createsone adjacent to the calling InputSection.

lld/MachO's thunk creationalgorithm

lld/MachO uses a single-pass algorithm inTextOutputSection::finalize:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
for (callIdx = 0; callIdx < inputs.size(); ++callIdx) {
// Finalize sections within forward branch range (minus slop)
while (finalIdx < endIdx && fits_in_range(inputs[finalIdx]))
finalizeOne(inputs[finalIdx++]);

// Process branch relocations in this section
for (Relocation &r : reverse(isec->relocs)) {
if (!isBranchReloc(r)) continue;
if (targetInRange(r)) continue;
if (existingThunkInRange(r)) { reuse it; continue; }
// Create new thunk and finalize it
createThunk(r);
}
}

Key differences from lld/ELF:

  • Single pass: Addresses are assigned monotonicallyand never revisited
  • Slop reservation: ReservesslopScale * thunkSize bytes (default: 256 × 12 = 3072 byteson ARM64) to leave room for future thunks
  • Thunk naming:<function>.thunk.<sequence> where sequenceincrements per target

Thunkstarvation problem: If many consecutive branches need thunks, eachthunk (12 bytes) consumes slop faster than call sites (4 bytes apart)advance. The test lld/test/MachO/arm64-thunk-starvation.sdemonstrates this edge case. Mitigation is increasing--slop-scale, but pathological cases with hundreds ofconsecutive out-of-range callees can still fail.

mold's thunk creationalgorithm

mold uses a two-pass approach: first pessimistically over-allocatethunks, then remove unnecessary ones.

Intuition: It's safe to allocate thunk space andlater shrink it, but unsafe to add thunks after addresses are assigned(would create gaps breaking existing references).

Pass 1 (create_range_extension_thunks):Process sections in batches using a sliding window. The window tracksfour positions:

1
2
3
4
5
6
7
8
9
Sections:   [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] ...
^ ^ ^ ^
A B C D
| |_______| |
| batch |
| |
earliest thunk
reachable placement
from C
  • [B, C) = current batch of sections to process (size≤ branch_distance/5)
  • A = earliest section still reachable from C (forthunk expiration)
  • D = where to place the thunk (furthest pointreachable from B)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Simplified from OutputSection<E>::create_range_extension_thunks
while (b < sections.size()) {
// Advance D: find furthest point where thunk is reachable from B
while (d < size && thunk_at_d_reachable_from_b)
assign_address(sections[d++]);

// Compute batch [B, C)
c = b + 1;
while (c < d && sections[c] < sections[b] + batch_size) c++;

// Advance A: expire thunks no longer reachable
while (a < b && sections[a] + branch_distance < sections[c]) a++;
// Expire thunk groups before A: clear symbol flags.
for (; t < thunks.size() && thunks[t].offset < sections[a]; t++)
for (sym in thunks[t].symbols) sym->flags = 0;

// Scan [B,C) relocations. If a symbol is not assigned to a thunk group yet,
// assign it to the new thunk group at D.
auto &thunk = thunks.emplace_back(new Thunk(offset));
parallel_for(b, c, [&](i64 i) {
for (rel in sections[i].relocs) {
if (requires_thunk(rel)) {
Symbol &sym = rel.symbol;
if (!sym.flags.test_and_set()) { // atomic: skip if already set
lock_guard lock(mu);
thunk.symbols.push_back(&sym);
}
}
}
});
offset += thunk.size();
b = c; // Move to next batch
}

Pass 2 (remove_redundant_thunks): Afterfinal addresses are known, remove thunk entries for symbols actually inrange.

Key characteristics:

  • Pessimistic over-allocation: Assumes allout-of-section calls need thunks; safe to shrink later
  • Batch size: branch_distance/5 (25.6 MiB forAArch64, 3.2 MiB for AArch32)
  • Parallelism: Uses TBB for parallel relocationscanning within each batch
  • Single branch range: Uses one conservativebranch_distance per architecture. For AArch32, uses ±16 MiB(Thumb limit) for all branches, whereas lld/ELF uses ±32 MiB for A32branches.
  • Thunk size not accounted in D-advancement: Theactual thunk group size is unknown when advancing D, so the end of alarge thunk group may be unreachable from the beginning of thebatch.
  • No convergence loop: Single forward pass foraddress assignment, no risk of non-convergence

Comparing thunk algorithms

Aspect lld/ELF lld/MachO mold
Passes Multi-pass (max 30) Single-pass Two-pass
Strategy Iterative refinement Greedy Greedy
Thunk placement Pre-allocated at intervals Inline with slop reservation Batch-based at intervals
Convergence Always (bounded iterations) Almost always Almost always
Range handling Per-relocation type Single conservative range Single conservative range
Parallelism Sequential Sequential Parallel (TBB)

Linker relaxation (RISC-V)

RISC-V takes a different approach: instead of only expandingbranches, it can also shrink instruction sequences whenthe target is close enough.

Consider a function call using the callpseudo-instruction, which expands to auipc +jalr:

1
2
3
4
5
# Before linking (8 bytes)
call ext
# Expands to:
# auipc ra, %pcrel_hi(ext)
# jalr ra, ra, %pcrel_lo(ext)

If ext is within ±1MiB, the linker can relax this to:

1
2
# After relaxation (4 bytes)
jal ext

This is enabled by R_RISCV_RELAX relocations thataccompany R_RISCV_CALL relocations. TheR_RISCV_RELAX relocation signals to the linker that thisinstruction sequence is a candidate for shrinking.

Example object code before linking:

1
2
3
4
5
6
7
8
9
0000000000000006 <foo>:
6: 97 00 00 00 auipc ra, 0
R_RISCV_CALL ext
R_RISCV_RELAX *ABS*
a: e7 80 00 00 jalr ra
e: 97 00 00 00 auipc ra, 0
R_RISCV_CALL ext
R_RISCV_RELAX *ABS*
12: e7 80 00 00 jalr ra

After linking with relaxation enabled, the 8-byteauipc+jalr pairs become 4-bytejal instructions:

1
2
3
4
5
6
0000000000000244 <foo>:
244: 41 11 addi sp, sp, -16
246: 06 e4 sd ra, 8(sp)
248: ef 00 80 01 jal ext
24c: ef 00 40 01 jal ext
250: ef 00 00 01 jal ext

When the linker deletes instructions, it must also adjust:

  • Subsequent instruction offsets within the section
  • Symbol addresses
  • Other relocations that reference affected locations
  • Alignment directives (R_RISCV_ALIGN)

This makes RISC-V linker relaxation more complex than thunkinsertion, but it provides code size benefits that other architecturescannot achieve at link time.

Diagnosing out-of-rangeerrors

When you encounter a "relocation out of range" error, here are somediagnostic steps:

  1. Check the error message: lld reports the sourcelocation, relocation type, and the distance. For example:

    1
    ld.lld: error: a.o:(.text+0x1000): relocation R_AARCH64_CALL26 out of range: 150000000 is not in [-134217728, 134217727]

  2. Use --verbose or-Map: Generate a link map to see sectionlayout and identify which sections are far apart.

  3. Consider -ffunction-sections:Splitting functions into separate sections gives the linker moreflexibility in placement, potentially reducing distances.

  4. Check for large data in .text:Embedded data (jump tables, constant pools) can push functions apart.Some compilers have options to place these elsewhere.

  5. LTO considerations: Link-time optimization candramatically change code layout. If thunk-related issues appear onlywith LTO, the optimizer may be creating larger functions or differentinlining decisions.

Summary

Handling long branches requires coordination across thetoolchain:

Stage Technique Example
Compiler Branch relaxation pass Invert condition + add unconditional jump
Assembler Instruction relaxation Short jump to near jump
Linker Range extension thunks Generate trampolines
Linker Linker relaxation Shrink auipc+jalr to jal(RISC-V)

The linker's thunk generation is particularly important for largeprograms where cross-compilation-unit calls may exceed branch ranges.Different linkers use different algorithms with various tradeoffsbetween complexity, optimality, and robustness.

RISC-V's linker relaxation is unique in that it can both expand andshrink code, optimizing for both correctness and code size.

Maintaining shadow branches for GitHub PRs

I've created pr-shadow with vibecoding, a tool that maintains a shadow branch for GitHub pull requests(PR) that never requires force-pushing. This addresses pain points Idescribed in Reflectionson LLVM's switch to GitHub pull requests#Patch evolution.

The problem

GitHub structures pull requests around branches, enforcing abranch-centric workflow. There are multiple problems when you force-pusha branch after a rebase:

  • The UI displays "force-pushed the BB branch from X to Y". Clicking"compare" shows git diff X..Y, which includes unrelatedupstream commits—not the actual patch difference. For a project likeLLVM with 100+ commits daily, this makes the comparison essentiallyuseless.
  • Inline comments may become "outdated" or misplaced after forcepushes.
  • If your commit message references an issue or another PR, each forcepush creates a new link on the referenced page, cluttering it withduplicate mentions. (Adding backticks around the link text works aroundthis, but it's not ideal.)

These difficulties lead to recommendations favoring less flexibleworkflows that only append commits (including merge commits) anddiscourage rebases. However, this means working with an outdated base,and switching between the main branch and PR branches causes numerousrebuilds-especially painful for large repositories likellvm-project.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
git switch main; git pull; ninja -C build

# Switching to a feature branch with an outdated base requires numerous rebuilds.
git switch feature0
git merge origin/main # I prefer `git rebase main` to remove merge commits, which clutter the history
ninja -C out/release

# Switching to another feature branch with an outdated base requires numerous rebuilds.
git switch feature1
git merge origin/main
ninja -C out/release

# Listing fixup commits ignoring upstream merges requires the clumsy --first-parent.
git log --first-parent

In a large repository, avoiding rebases isn't realistic—other commitsfrequently modify nearby lines, and rebasing is often the only way todiscover that your patch needs adjustments due to interactions withother landed changes.

In 2022, GitHub introduced "Pull request title and description" forsquash merging. This means updating the final commit message requiresediting via the web UI. I prefer editing the local commit message andsyncing the PR description from it.

The solution

After updating my main branch, before switching to afeature branch, I always run

1
git rebase main feature

to minimize the number of modified files. To avoid the force-pushproblems, I use pr-shadow to maintain a shadow PR branch (e.g.,pr/feature) that only receives fast-forward commits(including merge commits).

I work freely on my local branch (rebase, amend, squash), then syncto the PR branch using git commit-tree to create a commitwith the same tree but parented to the previous PR HEAD.

1
2
3
4
5
6
Local branch (feature)     PR branch (pr/feature)
A A (init)
| |
B (amend) C1 "Fix bug"
| |
C (rebase) C2 "Address review"

Reviewers see clean diffs between C1 and C2, even though theunderlying commits were rewritten.

When a rebase is detected (git merge-base withmain/master changed), the new PR commit is created as a merge commitwith the new merge-base as the second parent. GitHub displays these as"condensed" merges, preserving the diff view for reviewers.

Usage

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Initialize and create PR
git switch -c feature
edit && git commit -m feature

# Set `git merge-base origin/main feature` as the initial base. Push to pr/feature and open a GitHub PR.
prs init
# Same but create a draft PR. Repeated `init`s are rejected.
prs init --draft

# Work locally (rebase, amend, etc.)
git fetch origin main:main
git rebase main
git commit --amend

# Sync to PR
prs push "Rebase and fix bug"
# Force push if remote diverged due to messing with pr/feature directly.
prs push --force "Rewrite"

# Update PR title/body from local commit message.
prs desc

# Run gh commands on the PR.
prs gh view
prs gh checks

The tool supports both fork-based workflows (pushing to your fork)and same-repo workflows (for branches likeuser/<name>/feature). It also works with GitHubEnterprise, auto-detecting the host from the repository URL.

Related work

The name "prs" is a tribute to spr, which implements asimilar shadow branch concept. However, spr pushes user branches to themain repository rather than a personal fork. While necessary for stackedpull requests, this approach is discouraged for single PRs as itclutters the upstream repository. pr-shadow avoids this by pushing toyour fork by default.

I owe an apology to folks who receiveusers/MaskRay/feature branches (if they use the defaultfetch = +refs/heads/*:refs/remotes/origin/* to receive userbranches). I had been abusing spr for a long time after LLVM'sGitHub transition to avoid unnecessary rebuilds when switchingbetween the main branch and PR branches.

Additionally, spr embeds a PR URL in commit messages (e.g.,Pull Request: https://github.com/llvm/llvm-project/pull/150816),which can cause downstream forks to add unwanted backlinks to theoriginal PR.

If I need stacked pull requests, I will probably use pr-shadow withthe base patch and just rebase stacked ones - it's unclear how sprhandles stacked PRs.

❌