阅读视图

发现新文章,点击刷新页面。

Stack walking: space and time trade-offs

On most Linux platforms (except AArch32, which uses.ARM.exidx), DWARF .eh_frame is required forC++ exceptionhandling and stackunwinding to restore callee-saved registers. While.eh_frame can be used for call trace recording, it is oftencriticized for its runtime overhead. As an alternative, developers canenable frame pointers, or adopt SFrame, a newer format designedspecifically for profiling. This article examines the size overhead ofenabling non-DWARF stack walking mechanisms when building several LLVMexecutables.

Runtime performance analysis will be added in a future update.

Stack walking mechanisms

Here is a survey of mechanisms available for x86-64:

  • Frame pointers: fast but costs a register
  • DWARF .eh_frame: comprehensive but slower, supportsadditional features like C++ exception handling
  • SFrame: a new format being developed, profiling only..eh_frame is still needed for debugging and C++ exceptionhandling. Check out Remarkson SFrame for details.
  • x86 Last Branch Record (LBR): Skylake increased the LBR stack sizeto 32. Supported by AMD Zen 4 as LastBranch Record Extension Version 2 (LbrExtV2)
  • Apple'sCompact Unwinding Format: This has llvm, lld/MachO, and libunwindimplementation. Supports x86-64 and AArch64. This can mostly replaceDWARF CFI, but some entries need DWARF escape.
  • OpenVMS's Compact Unwinding Format: This modifies Apple's CompactUnwinding Format.

Space overhead analysis

Frame pointer size impact

For most architectures, GCC defaults to-fomit-frame-pointer in -O compilation to freeup a register for general use. To enable frame pointers, specify-fno-omit-frame-pointer, which reserves the frame pointerregister (e.g., rbp on x86-64) and emits push/popinstructions in function prologues/epilogues.

For leaf functions (those that don't call other functions), while theframe pointer register should still be reserved for consistency, thepush/pop operations are often unnecessary. Compilers provide-momit-leaf-frame-pointer (with target-specific defaults)to reduce code size.

The viability of this optimization depends on the targetarchitecture:

  • On AArch64, the return address is available in the link register(X30). The immediate caller can be retrieved by inspecting X30, so-momit-leaf-frame-pointer does not compromiseunwinding.
  • On x86-64, after the prologue instructions execute, the returnaddress is stored at RSP plus an offset. An unwinder needs to know thestack frame size to retrieve the return address, or it must utilizeDWARF information for the leaf frame and then switch to the FP chain forparent frames.

Beyond this architectural consideration, there are additionalpractical reasons to use -momit-leaf-frame-pointer onx86-64:

  • Many hand-written assembly implementations (including numerous glibcfunctions) don't establish frame pointers, creating gaps in the framepointer chain anyway.
  • In the prologue sequence push rbp; mov rbp, rsp, afterthe first instruction executes, RBP does not yet reference the currentstack frame. When shrink-wrapping optimizations are enabled, theinstruction region where RBP still holds the old value becomes larger,increasing the window where the frame pointer is unreliable.

Given these trade-offs, three common configurations have emerged:

  • omitting FP:-fomit-frame-pointer -momit-leaf-frame-pointer (smallestoverhead)
  • reserving FP, but removing FP push/pop for leaf functions:-fno-omit-frame-pointer -momit-leaf-frame-pointer (framepointer chain omitting the leaf frame)
  • reserving FP:-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer(complete frame pointer chain, largest overhead)

The size impact varies significantly by program. Here's a Rubyscript section_size.rb that compares section sizes:

1
2
3
4
5
6
7
8
9
% ~/Dev/unwind-info-size-analyzer/section_size.rb /tmp/out/custom-{none,nonleaf,all}/bin/{llvm-mc,opt}
Filename | .text size | EH size | VM size | VM increase
------------------------------------+------------------+----------------+----------+------------
/tmp/out/custom-none/bin/llvm-mc | 2114687 (23.7%) | 367992 (4.1%) | 8914057 | -
/tmp/out/custom-nonleaf/bin/llvm-mc | 2124143 (24.0%) | 301688 (3.4%) | 8856713 | -0.6%
/tmp/out/custom-all/bin/llvm-mc | 2149535 (24.0%) | 362408 (4.1%) | 8942729 | +0.3%
/tmp/out/custom-none/bin/opt | 39018511 (70.2%) | 4561112 (8.2%) | 55583965 | -
/tmp/out/custom-nonleaf/bin/opt | 38879897 (71.4%) | 3542288 (6.5%) | 54424789 | -2.1%
/tmp/out/custom-all/bin/opt | 38980905 (71.0%) | 3888624 (7.1%) | 54871285 | -1.3%

For instance, llvm-mc is dominated by read-only data,making the relative .text percentage quite small, so framepointer impact on the VM size is minimal. ("VM size" is a metric used bybloaty, representing the total p_memsz size ofPT_LOAD segments, excluding alignmentpadding.) As expected, llvm-mc grows larger as morefunctions set up the frame pointer chain. However, optactually becomes smaller when -fno-omit-frame-pointer isenabled—a counterintuitive result that warrants explanation.

Without frame pointer, the compiler uses RSP-relative addressing toaccess stack objects. When using the register-indirect + disp8/disp32addresing mode, RSP needs an extra SIB byte while RBP doesn't. Forlarger functions accessing many local variables, the savings fromshorter RBP-relative encodings can outweigh the additionalpush rbp; mov rbp, rsp; pop rbp instructions in theprologues/epilogues.

1
2
3
4
5
6
% echo 'mov rax, [rsp+8]; mov rax, [rbp-8]' | /tmp/Rel/bin/llvm-mc -x86-asm-syntax=intel -output-asm-variant=1 -show-encoding
mov rax, qword ptr [rsp + 8] # encoding: [0x48,0x8b,0x44,0x24,0x08]
mov rax, qword ptr [rbp - 8] # encoding: [0x48,0x8b,0x45,0xf8]

# ModR/M byte 0x44: Mod=01 (register-indirect addressing + disp8), Reg=0 (dest reg RAX), R/M=100 (SIB byte follows)
# ModR/M byte 0x45: Mod=01 (register-indirect addressing + disp8), Reg=0 (dest reg RAX), R/M=101 (RBP)

SFrame vs .eh_frame

Oracle is advocating for SFrame adoption in Linux distributions. TheSFrame implementation is handled by the assembler and linker rather thanthe compiler. Let's build the latest binutils-gdb to test it.

Building test program

We'll use the clang compiler from https://github.com/llvm/llvm-project/tree/release/21.xas our test program.

There are still issues related to garbage collection (object fileformat design issue), so I'll just disable-Wl,--gc-sections.

1
2
3
4
5
6
7
8
9
--- i/llvm/cmake/modules/AddLLVM.cmake
+++ w/llvm/cmake/modules/AddLLVM.cmake
@@ -331,4 +331,4 @@ function(add_link_opts target_name)
# TODO Revisit this later on z/OS.
- set_property(TARGET ${target_name} APPEND_STRING PROPERTY
- LINK_FLAGS " -Wl,--gc-sections")
+ #set_property(TARGET ${target_name} APPEND_STRING PROPERTY
+ # LINK_FLAGS " -Wl,--gc-sections")
endif()
1
2
configure-llvm custom-sframe -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_PROJECTS='clang' -DLLVM_ENABLE_UNWIND_TABLES=on -DLLVM_ENABLE_LLD=off -DCMAKE_{EXE,SHARED}_LINKER_FLAGS=-fuse-ld=bfd -DCMAKE_C_COMPILER=$HOME/opt/gcc-15/bin/gcc -DCMAKE_CXX_COMPILER=$HOME/opt/gcc-15/bin/g++ -DCMAKE_C_FLAGS="-B$HOME/opt/binutils/bin -Wa,--gsframe" -DCMAKE_CXX_FLAGS="-B$HOME/opt/binutils/bin -Wa,--gsframe"
ninja -C /tmp/out/custom-sframe clang
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
% ~/Dev/bloaty/out/release/bloaty /tmp/out/custom-sframe/bin/clang
FILE SIZE VM SIZE
-------------- --------------
63.9% 88.0Mi 73.9% 88.0Mi .text
11.1% 15.2Mi 0.0% 0 .strtab
7.2% 9.96Mi 8.4% 9.96Mi .rodata
6.4% 8.87Mi 7.5% 8.87Mi .sframe
5.1% 7.07Mi 5.9% 7.07Mi .eh_frame
2.9% 3.96Mi 0.0% 0 .symtab
1.4% 1.98Mi 1.7% 1.98Mi .data.rel.ro
0.9% 1.23Mi 1.0% 1.23Mi [LOAD #4 [R]]
0.7% 999Ki 0.8% 999Ki .eh_frame_hdr
0.0% 0 0.5% 614Ki .bss
0.2% 294Ki 0.2% 294Ki .data
0.0% 23.1Ki 0.0% 23.1Ki .rela.dyn
0.0% 8.99Ki 0.0% 8.99Ki .dynstr
0.0% 8.77Ki 0.0% 8.77Ki .dynsym
0.0% 7.24Ki 0.0% 7.24Ki .rela.plt
0.0% 6.73Ki 0.0% 0 [Unmapped]
0.0% 6.29Ki 0.0% 3.84Ki [21 Others]
0.0% 4.84Ki 0.0% 4.84Ki .plt
0.0% 3.36Ki 0.0% 3.30Ki .init_array
0.0% 2.50Ki 0.0% 2.50Ki .hash
0.0% 2.44Ki 0.0% 2.44Ki .got.plt
100.0% 137Mi 100.0% 119Mi TOTAL
% ~/Dev/unwind-info-size-analyzer/eh_size.rb /tmp/out/custom-sframe/bin/clang
clang: sframe=9303875 eh_frame=7408976 eh_frame_hdr=1023004 eh=8431980 sframe/eh_frame=1.2558 sframe/eh=1.1034

The results show that .sframe (8.87 MiB) isapproximately 10% larger than the combined size of.eh_frame and .eh_frame_hdr (7.07 + 0.99 =8.06 MiB). While SFrame is designed for efficiency during stack walking,it carries a non-trivial space overhead compared to traditional DWARFunwind information.

SFrame vs FP

Having examined SFrame's overhead compared to .eh_frame,let's now compare the two primary approaches for non-hardware-assistedstack walking.

  • Frame pointer approach: Reserve FP but omitpush/pop for leaf functionsg++ -fno-omit-frame-pointer -momit-leaf-frame-pointer
  • SFrame approach: Omit FP and use SFrame metadatag++ -fomit-frame-pointer -momit-leaf-frame-pointer -Wa,--gsframe

To conduct a fair comparison, we build LLVM executables using bothapproaches with both Clang and GCC compilers. The following scriptconfigures and builds test binaries with each combination:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/bin/zsh
conf() {
configure-llvm $@ -DCMAKE_EXE_LINKER_FLAGS='-pie -Wl,-z,pack-relative-relocs' -DLLVM_ENABLE_UNWIND_TABLES=on \
-DCMAKE_{EXE,SHARED}_LINKER_FLAGS=-fuse-ld=bfd -DLLVM_ENABLE_LLD=off
}

clang=-fno-integrated-as
gcc=("-DCMAKE_C_COMPILER=$HOME/opt/gcc-15/bin/gcc" "-DCMAKE_CXX_COMPILER=$HOME/opt/gcc-15/bin/g++")

fp="-fno-omit-frame-pointer -momit-leaf-frame-pointer -B$HOME/opt/binutils/bin -Wa,--gsframe=no"
sframe="-fomit-frame-pointer -momit-leaf-frame-pointer -B$HOME/opt/binutils/bin -Wa,--gsframe"

conf custom-fp -DCMAKE_{C,CXX}_FLAGS="$clang $fp"
conf custom-sframe -DCMAKE_{C,CXX}_FLAGS="$clang $sframe"
conf custom-fp-gcc -DCMAKE_{C,CXX}_FLAGS="$fp" ${gcc[@]}
conf custom-sframe-gcc -DCMAKE_{C,CXX}_FLAGS="$sframe" ${gcc[@]}

for i in fp sframe fp-gcc sframe-gcc; do ninja -C /tmp/out/custom-$i llvm-mc opt; done

The results reveal interesting differences between compilerimplementations:

1
2
3
4
5
6
7
8
9
10
11
% ~/Dev/unwind-info-size-analyzer/section_size.rb /tmp/out/custom-{fp,sframe,fp-gcc,sframe-gcc}/bin/{llvm-mc,opt}
Filename | .text size | EH size | .sframe size | VM size | VM increase
---------------------------------------+------------------+----------------+----------------+----------+------------
/tmp/out/custom-fp/bin/llvm-mc | 2124031 (23.5%) | 301136 (3.3%) | 0 (0.0%) | 9050149 | -
/tmp/out/custom-sframe/bin/llvm-mc | 2114383 (22.3%) | 367452 (3.9%) | 348235 (3.7%) | 9483621 | +4.8%
/tmp/out/custom-fp-gcc/bin/llvm-mc | 2744214 (29.2%) | 301836 (3.2%) | 0 (0.0%) | 9389677 | +3.8%
/tmp/out/custom-sframe-gcc/bin/llvm-mc | 2705860 (27.7%) | 354292 (3.6%) | 356073 (3.6%) | 9780985 | +8.1%
/tmp/out/custom-fp/bin/opt | 38872825 (69.9%) | 3538408 (6.4%) | 0 (0.0%) | 55598265 | -
/tmp/out/custom-sframe/bin/opt | 39011167 (62.4%) | 4557012 (7.3%) | 4452908 (7.1%) | 62494509 | +12.4%
/tmp/out/custom-fp-gcc/bin/opt | 54654471 (78.1%) | 3631068 (5.2%) | 0 (0.0%) | 70001565 | +25.9%
/tmp/out/custom-sframe-gcc/bin/opt | 53644639 (70.4%) | 4857236 (6.4%) | 5263558 (6.9%) | 76205645 | +37.1%
  • SFrame incurs a significant VM size increase.
  • GCC-built binaries are significantly larger than their Clangcounterparts, probably due to more aggressive inlining or vectorizationstrategies.

With Clang-built binaries, the frame pointer configuration produces asmaller opt executable (55.6 MiB) compared to the SFrameconfiguration (62.5 MiB). This reinforces our earlier observation thatRBP addressing can be more compact than RSP-relative addressing forlarge functions with frequent local variable accesses.

Assembly comparison reveals that functions using RBP and RSPaddressing produce quite similar code.

In contrast, GCC-built binaries show the opposite trend: the framepointer version of opt (70.0 MiB) is smaller than theSFrame version (76.2 MiB).

The generated assembly differs significantly between omit-FP andnon-omit-FP builds, I have compared symbol sizes between two GCC builds.

1
nvim -d =(/tmp/Rel/bin/llvm-nm -U --size-sort /tmp/out/custom-fp-gcc/bin/llvm-mc) =(/tmp/Rel/bin/llvm-nm -U --size-sort /tmp/out/custom-sframe-gcc/bin/llvm-mc)

Many functions, such as_ZN4llvm15ELFObjectWriter24executePostLayoutBindingEv, havesignificant more instructions in the keep-FP build. This suggests thatGCC's frame pointer code generation may not be as optimized as itsdefault omit-FP path.

Runtime performance analysis

TODO

perf record overhead with EH

perf record overhead with FP

Summary

This article examines the space overhead of different stack walkingmechanisms when building LLVM executables.

Frame pointer configurations: Enabling framepointers (-fno-omit-frame-pointer) can paradoxically reducex86-64 binary size when stack object accesses are frequent. This occursbecause RBP-relative addressing produces more compact encodings thanRSP-relative addressing, which requires an extra SIB byte. The savingsfrom shorter instructions can outweigh the prologue/epilogueoverhead.

SFrame vs .eh_frame: For the x86-64clang executable, SFrame metadata is approximately 10%larger than the combined size of .eh_frame and.eh_frame_hdr. Given the significant VM size overhead andthe lack of clear advantages over established alternatives, I amskeptical about SFrame's viability as the future of stack walking foruserspace programs. While SFrame will receive a major revision V3 in theupcoming months, it needs to achieve substantial size reductionscomparable to existing compact unwinding schemes to justify its adoptionover frame pointers. I hope interested folks can implement somethingsimilar to macOS's compact unwind descriptors (with x86-64 support) andOpenVMS's.

GCC's frame pointer code generation appears less optimized than itsdefault omit-frame-pointer path, as evidenced by substantial differencesin generated assembly.

Runtime performance analysis remains to be conducted to complete thetrade-off evaluation.

Appendix:configure-llvm

This script specifies common options when configuring llvm-project:https://github.com/MaskRay/Config/blob/master/home/bin/configure-llvm

  • -DCMAKE_CXX_ARCHIVE_CREATE="$HOME/Stable/bin/llvm-ar qc --thin <TARGET> <OBJECTS>" -DCMAKE_CXX_ARCHIVE_FINISH=::Use thin archives to reduce disk usage
  • -DLLVM_TARGETS_TO_BUILD=host: Build a singletarget
  • -DCLANG_ENABLE_OBJC_REWRITER=off -DCLANG_ENABLE_STATIC_ANALYZER=off:Disable less popular components
  • -DLLVM_ENABLE_PLUGINS=off -DCLANG_PLUGIN_SUPPORT=off:Disable -Wl,--export-dynamic, preventing large.dynsym and .dynstr sections

Appendix: My SFrame build

1
2
3
4
mkdir -p out/release && cd out/release
../../configure --prefix=$HOME/opt/binutils --disable-multilib
make -j $(nproc) all-ld all-binutils all-gas
make -j $(nproc) install-ld install-binutils install-gas

gcc -B$HOME/opt/binutils/bin andclang -B$HOME/opt/binutils/bin -fno-integrated-as will useas and ld from the install directory.

Appendix: Scripts

Ruby scripts used by this post are available at https://github.com/MaskRay/unwind-info-size-analyzer/

Remarks on SFrame

SFrame is a new format for stack walking, suitable forprofilers. It intends to replace Linux's in-kernel ORC unwindformat and serve as an alternative to .eh_frame and.eh_frame_hdr for userspace programs. While SFrameeliminates some .eh_frame CIE/FDE overhead, it sacrificesfunctionality (e.g., personality, LSDA, callee-saved registers) andflexibility, and its stack offsets are less compact than.eh_frame's bytecode-style CFI instructions. Inllvm-project executables I've tested on x86-64, .sframesection is 20% larger than .eh_frame. It also remainssignificantly larger than highly compact schemes like WindowsARM64 unwind codes.

SFrame describes three elements for each function:

  • Canonical Frame Address (CFA): The base address for stack framecalculations
  • Return address
  • Frame pointer

An .sframe section follows a straightforward layout:

  • Header: Contains metadata and offset information
  • Auxiliary header (optional): Reserved for future extensions
  • Function Descriptor Entries (FDEs): Array describing eachfunction
  • Frame Row Entries (FREs): Arrays of unwinding information perfunction
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
struct [[gnu::packed]] sframe_header {
struct {
uint16_t sfp_magic;
uint8_t sfp_version;
uint8_t sfp_flags;
} sfh_preamble;
uint8_t sfh_abi_arch;
int8_t sfh_cfa_fixed_fp_offset;
// Used by x86-64 to define the return address slot relative to CFA
int8_t sfh_cfa_fixed_ra_offset;
// Size in bytes of the auxiliary header, allowing extensibility
uint8_t sfh_auxhdr_len;
// Numbers of FDEs and FREs
uint32_t sfh_num_fdes;
uint32_t sfh_num_fres;
// Size in bytes of FREs
uint32_t sfh_fre_len;
// Offsets in bytes of FDEs and FREs
uint32_t sfh_fdeoff;
uint32_t sfh_freoff;
};

While magic is popular choices for file formats, they deviate fromestablished ELF conventions, which simplifies utilizes the section typefor distinction.

The version field resembles the similar uses within DWARF sectionheaders. SFrame will likely evolve over time, unlike ELF's more stablecontrol structures. This means we'll probably need to keep producers andconsumers evolving in lockstep, which creates a stronger case forinternal versioning. An internal version field would allow linkers toupgrade or ignore unsupported low-version input pieces, providing moreflexibility in handling version mismatches.

Data structures

Function Descriptor Entries(FDEs)

Function Descriptor Entries serve as the bridge between functions andtheir unwinding information. Each FDE describes a function's locationand provides a direct link to its corresponding Frame Row Entries(FREs), which contain the actual unwinding data.

1
2
3
4
5
6
7
8
9
10
11
12
13
struct [[gnu::packed]] sframe_func_desc_entry {
int32_t sfde_func_start_address;
uint32_t sfde_func_size;
uint32_t sfde_func_start_fre_off;
uint32_t sfde_func_num_fres;
// bits 0-3 fretype: sfre_start_address type
// bit 4 fdetype: SFRAME_FDE_TYPE_PCINC or SFRAME_FDE_TYPE_PCMASK
// bit 5 pauth_key: (AArch64 only) the signing key for the return address
uint8_t sfde_func_info;
// The size of the repetitive code block for SFRAME_FDE_TYPE_PCMASK; used by .plt
uint8_t sfde_func_rep_size;
uint16_t sfde_func_padding2;
};

The current design has room for optimization. Thesfde_func_num_fres field uses a full 32 bits, which iswasteful for most functions. We could use uint16_t instead,requiring exceptionally large functions to be split across multipleFDEs.

It's important to note that SFrame's function concept represents coderanges rather than logical program functions. This distinction becomesparticularly relevant with compiler optimizations like hot-coldsplitting, where a single logical function may span multiplenon-contiguous code ranges, each requiring its own FDE.

The padding field sfde_func_padding2 representsunnecessary overhead in modern architectures where unaligned memoryaccess performs efficiently, making the alignment benefitsnegligible.

To enable binary search on sfde_func_start_address, FDEsmust maintain a fixed size, which precludes the use of variable-lengthinteger encodings like PrefixVarInt.

Frame Row Entries (FREs)

Frame Row Entries contain the actual unwinding information forspecific program counter ranges within a function. The template designallows for different address sizes based on the function'scharacteristics.

1
2
3
4
5
6
7
8
9
template <class AddrType>
struct [[gnu::packed]] sframe_frame_row_entry {
// If the fdetype is SFRAME_FDE_TYPE_PCINC, this is an offset relative to sfde_func_start_address
AddrType sfre_start_address;
// bit 0 fre_cfa_base_reg_id: define BASE_REG as either FP or SP
// bits 1-4 fre_offset_count: typically 1 to 3, describing CFA, FP, and RA
// bits 5-6 fre_offset_size: byte size of offset entries (1, 2, or 4 bytes)
sframe_fre_info sfre_info;
};

Each FRE contains variable-length stack offsets stored as trailingdata. The fre_offset_size field determines whether offsetsuse 1, 2, or 4 bytes (uint8_t, uint16_t, oruint32_t), allowing optimal space usage based on stackframe sizes.

Architecture-specific stackoffsets

SFrame adapts to different processor architectures by varying itsoffset encoding to match their respective calling conventions andarchitectural constraints.

x86-64

The x86-64 implementation takes advantage of the architecture'spredictable stack layout:

  • First offset: Encodes CFA as BASE_REG + offset
  • Second offset (if present): Encodes FP asCFA + offset
  • Return address: Computed implicitly asCFA + sfh_cfa_fixed_ra_offset (using the header field)

AArch64

AArch64's more flexible calling conventions require explicit returnaddress tracking:

  • First offset: Encodes CFA as BASE_REG + offset
  • Second offset: Encodes return address asCFA + offset
  • Third offset (if present): Encodes FP asCFA + offset

The explicit return address encoding accommodates AArch64's variablestack layouts and link register usage patterns.

s390x

TODO

ORC and .sframe

TODO

.eh_frame and.sframe

SFrame reduces header size compared to .eh_frame plus.eh_frame_hdr by:

  • Eliminating .eh_frame_hdr through sortedsfde_func_start_address fields
  • Replacing CIE pointers with direct FDE-to-FRE references
  • Using variable-width sfre_start_address fields (1 or 2bytes) for small functions
  • Storing start addresses instead of address ranges..eh_frame address ranges
  • Start addresses in a small function use 1 or 2 byte fields, moreefficient than .eh_frame initial_location, which needs atleast 4 bytes (DW_EH_PE_sdata4).
  • Hard-coding stack offsets rather than using flexible registerspecifications

However, the bytecode design of .eh_frame can sometimesbe more efficient than .sframe, as demonstrated onx86-64.


SFrame serves as a specialized complement to .eh_framerather than a complete replacement. The current version does not includepersonality routines, Language Specific Data Area (LSDA) information, orthe ability to encode extra callee-saved registers. While theseconstraints make SFrame ideal for profilers and debuggers, they preventit from supporting C++ exception handling, where libstdc++/libc++abirequires the full .eh_frame feature set.

In practice, executables and shared objects will likely contain allthree sections:

  • .eh_frame: Complete unwinding information for exceptionhandling
  • .eh_frame_hdr (encompassed by thePT_GNU_EH_FRAME program header): Fast lookup table for.eh_frame
  • .sframe (encompassed by the PT_GNU_SFRAMEprogram header)

The auxiliary header, currently unused, provides a pathway for futureenhancements. It could potentially accommodate .eh_frameaugmentation data such as personality routines, language-specific dataareas (LSDAs), and signal frame handling, bridging some of the currentfunctionality gaps.

Large text section support

The sfde_func_start_address field uses a signed 32-bitoffset to reference functions, providing a ±2GB addressing range fromthe field's location. This signed encoding offers flexibility in sectionordering-.sframe can be placed either before or after textsections.

However, this approach faces limitations with large binaries,particularly when LLVM generates .ltext sections forx86-64. The typical section layout creates significant gaps between.sframe and .ltext:

1
2
3
4
5
6
7
8
9
.ltext          // Large text section
.lrodata // Large read-only data
.rodata // Regular read-only data
// .eh_frame and .sframe position
.text // Regular text section
.data
.bss
.ldata // Large data
.lbss // Large BSS

Object file format designissues

Mandatory index buildingproblems

Currently, Binutils enforces a single-element structure within each.sframe section, regardless of whether it resides in arelocatable object or final executable. While theSFRAME_F_FDE_SORTED flag can be cleared to permit unsortedFDEs, proposed unwinder implementations for the Linux kernel do not seemto support multiple elements in a single section. The design choicemakes linker merging mandatory rather than optional.

This design choice stems from Linux kernel requirements, where kernelmodules are relocatable files created with ld -r. Thepending SFrame support for linux-perf expects each module to contain asingle indexed format for efficient runtime processing. Consequently,GNU ld merges all input .sframe sections into a singleindexed element, even when producing relocatable files. This behaviordeviates from standard relocatable linkingconventions that suppress synthetic section finalization.

This approach differs from almost every metadata section, whichsupport multiple concatenated elements, each with its own header andbody. LLVM supports numerous well-behaved metadata sections(__asan_globals, .stack_sizes,__patchable_function_entries, __llvm_prf_cnts,__sancov_bools, __llvm_covmap,__llvm_gcov_ctr_section, .llvmcmd, andllvm_offload_entries) that concatenate without issues.SFrame stands apart as the only metadata section demandingversion-specific merging as default linker behavior, creatingunprecedented maintenance burden. For optimal portability, unwindersshould support multiple-element structures within a .sframesection.

For optimal portability, we must support object files from diverseorigins—not just those built from a single toolchain. In environmentswhere almost everything is built from source with a single toolchainoffering strong SFrame support, forcing default-on index building may beacceptable. However, we must also accommodate environments with prebuiltobject files using older SFrame versions, or toolchains that don'tsupport old formats. I believe unwinders should support multiple-elementstructures within a .sframe section. When a linker buildsan index for .sframe, it should be viewed as anoptimization that relieves the unwinder from constructing its own indexat runtime. This index construction should remain optional rather thanrequired.

Sectiongroup compliance and garbage collection issues

GNU Assembler generates a single .sframe sectioncontaining relocations to STB_LOCAL symbols from multipletext sections, including those in different section groups.

This creates ELF specification violations when a referenced textsection is discarded by the COMDAT section grouprule. The ELF specification states:

A symbol table entry with STB_LOCAL binding that isdefined relative to one of a group's sections, and that is contained ina symbol table section that is not part of the group, must be discardedif the group members are discarded. References to this symbol tableentry from outside the group are not allowed.

The problem manifests when inline functions are deduplicated:

1
2
3
4
5
6
7
8
9
cat > a.cc <<'eof'
[[gnu::noinline]] inline int inl() { return 0; }
auto *fa = inl;
eof
cat > b.cc <<'eof'
[[gnu::noinline]] inline int inl() { return 0; }
auto *fb = inl;
eof
~/opt/gcc-15/bin/g++ -Wa,--gsframe -c a.cc b.cc

Linkers correctly reject this violation:

1
2
3
4
5
6
7
8
9
10
% ld.lld a.o b.o
ld.lld: error: relocation refers to a discarded section: .text._Z3inlv
>>> defined in b.o
>>> referenced by b.cc
>>> b.o:(.sframe+0x1c)

% gold a.o b.o
b.o(.sframe+0x1c): error: relocation refers to local symbol ".text._Z3inlv" [2], which is defined in a discarded section
section group signature: "inl()"
prevailing definition is from a.o

(In 2020, I reported a similarissue for GCC -fpatchable-function-entry=.)

Some linkers don't implement this error check. A separate issuearises with garbage collection: by default, an unreferenced.sframe section will be discarded. If the linker implementsa workaround to force-retain .sframe, it mightinadvertently retain all text sections referenced by.sframe, even those that would otherwise be garbagecollected.

The solution requires restructuring the assembler's output strategy.Instead of creating a monolithic .sframe section, theassembler should generate individual SFrame sections corresponding toeach text section. When a text section belongs to a COMDAT group, itsassociated SFrame section must join the same group. For standalone textsections, the SHF_LINK_ORDER flag should establish theproper association.

This approach would create multiple SFrame sections withinrelocatable files, making the size optimization benefits of a simplifiedlinking view format even more compelling. While this comes with theoverhead of additional section headers (where eachElf64_Shdr consumes 64 bytes), it's a cost we should pay tobe a good ELF citizen. This reinforces the value of my sectionheader reduction proposal.

Version compatibilitychallenges

The current design creates significant version compatibilityproblems. When a linker only supports v3 but encounters object fileswith v2 .sframe sections, it faces impossible choices:

  • Discard v2 sections: Silently losing functionality
  • Report errors: Breaking builds with mixed-version object files
  • Concatenate sections: Currently unsupported by unwinders
  • Upgrade v2 to v3: Requires maintaining version-specific merge logicfor every version

This differs fundamentally from reading a format—each version needsversion-specific merging logic in every linker. Consider thescenario where v2 uses layout A, v3 uses layout B, and v4 uses layout C.A linker receiving objects with all three versions must produce coherentoutput with proper indexing while maintaining version-specific mergelogic for each.

Real-world mixing scenarios include:

  • Third-party vendor libraries built with older toolchains
  • Users linking against prebuilt libraries from different sources
  • Users who don't need SFrame but must handle prebuilt libraries witholder versions
  • Users updating their linker to a newer version that drops legacySFrame support

Most users will not need stack tracing features—this may changeeventually, but that will take many years. In the meantime, they mustaccept unneeded information while handling the resulting compatibilityissues.

Requiring version-specific merging as default behavior would createmaintenance burden unmatched by any other loadable metadata section.

Proposed format separation

A future version should distinguish between linking and executionviews to resolve the compatibility and maintenance challenges outlinedabove. This separation has precedent in existing debug formats:.debug_pubnames/.gdb_index provides anexcellent model for separate linking and execution views. DWARF v5's.debug_names takes a different approach, unifying bothviews at the cost of larger linking formats—a reasonable tradeoff sincerelocatable files contain only a single .debug_namessection, and debuggers can efficiently load sections with concatenatedname tables.

For SFrame, the separation would work as follows:

Separate linking format. Assemblers produce asimpler format, omitting index-specific metadata fields such assfh_num_fdes, sfh_num_fres,sfh_fdeoff, and sfh_freoff.

Default concatenation behavior. Linkers concatenate.sframe input sections by default, consistent with DWARFand other metadata sections. Linkers can handle mixed-version scenariosgracefully without requiring version-specific merge logic, eliminatingthe impossible maintenance burden of keeping version-specific mergelogic for every SFrame version in every linker implementation.Distributions can roll out SFrame support incrementally withoutrequiring all linkers to support index building immediately.

The unwinder implementation cost is manageable. Stack unwindersalready need to support .sframe sections across the mainexecutable and all shared objects. Supporting multiple concatenatedelements within a single .sframe section presents nofundamental technical barrier—this is a one-time implementation costthat provides forward and backward compatibility.

Optional index construction. When the opt-in option--sframe-index is requested, the linker builds an indexfrom recognized versions while reporting warnings for unrecognized ones.This is analogous to --gdb-indexand --debug-names.

With this approach, the linker builds .sframe_idx frominput .sframe sections. To support the Linux kernelworkflow (ld -r for kernel modules),ld -r --sframe-index must also generate the indexedformat.

The index construction happens before section matching in linkescripts. The output section description.sframe_idx : { *(.sframe_idx) } places the synthesized.sframe_idx into the .sframe_idx outputsection. .sframe input sections have been replaced by thelinker-synthesized .sframe_idx, so we don't write*(.sframe).

Alternative:Deriving SFrame from .eh_frame

An alternative approach could eliminate the need for assemblers togenerate .sframe sections directly. Instead, the linkerwould merge and optimize .eh_frame as usual (which requiresCIE and FDE boundary information), then derive .sframe (or.sframe_idx) from the optimized .eh_frame.

This approach offers a significant advantage: since the linker onlyreads the stable .eh_frame format and produces.sframe or .sframe_idx as output, versioncompatibility concerns disappear entirely.

While CFI instruction decoding introduces additional complexity (astep previously unneeded), this is balanced by the architecturaladvantage of centralizing the conversion logic. Rather than scatteringformat-specific processing code throughout the linker (similar to howSHF_MERGE and .eh_frame require specialinternal representations), the transformation logic remainslocalized.

The counterargument centers on maintenance burden. This fine-grainedknowledge of the SFrame format may expose the linker to more frequentupdates as the format evolves—a serious risk, given that the linker'sfoundational role in the build process demands exceptional stability androbustness.

Post-processing alternative

A more cautious intermediate strategy could leverage existing Linuxdistribution post-processing tools, modifying them to append.sframe sections to executable and shared object filesafter linking completes. While this introduces more friction than nativelinker support and requires integration into package build systems, itoffers several compelling advantages:

  • Allows .sframe format experimentation without imposinglinker complexity
  • Provides time for the format to mature and prove its value beforecommitting to linker integration
  • Enables testing across diverse userspace packages in real-worldscenarios
  • Post-link tools can optimize and even overwrite sections in-placewithout linker constraints
  • For cases where optimization significantly shrinks the section,.sframe can be placed at the end of the file (similar toBOLT moving .rodata)

However, this approach faces practical challenges. Post-processingadds build complexity, particularly with features like build-ids andread-only file systems. The success of .gdb_index, wherelinker support (--gdb-index) proved more popular thanpost-link tools, suggests that native linker support eventually becomesnecessary for widespread adoption.

The key question is timing: should linker integration be the startingpoint or the outcome of proven stability?

SHF_ALLOC considerations

The .sframe section carries the SHF_ALLOCflag, meaning it's loaded as part of the program's read-only datasegment. This design choice creates tradeoffs:

With SHF_ALLOC: - .sframe contributesto initial read-only data segment consumption - Can be accessed directlyas part of the memory-mapped area - No runtime mmap cost for tracers

Without SHF_ALLOC: - No upfront memory cost -Tracers must open the file and mmap the section on demand - Runtime costmay not amortize well for frequent tracing

Analysis of 337 files in /usr/bin and /usr/lib/x86_64-linux-gnu/shows .eh_frame typically consumes 5.2% (median: 5.1%) offile size:

1
2
3
4
5
6
7
8
9
EH_Frame size distribution:
Min: 0.3% Max: 11.5% Mean: 5.2% Median: 5.1%

0%-1%: 7 files 5%-6%: 62 files
1%-2%: 17 files 6%-7%: 33 files
2%-3%: 37 files 7%-8%: 36 files
3%-4%: 49 files 8%-9%: 20 files
4%-5%: 50 files 9%-10%: 20 files
10%-12%: 6 files

If .sframe size is comparable to .eh_frame,this represents significant overhead for applications that never usestack tracing—likely the majority of users. Most users will not needstack trace features, raising the question of whether having.sframe always loaded is an acceptable overhead fordistributions shipping it by default.

perf supports .debug_frame(tools/perf/util/unwind-libunwind-local.c), which does not haveSHF_ALLOC. While there's a difference between status quoand what's optimal, the non-SHF_ALLOC approach deservesconsideration for scenarios where runtime tracing overhead can beamortized or where memory footprint matters more than immediateaccess.

Kernel challenges

The .sframe section may not be resident in the physicalmemory. SFrame proposers are attempting to defer user stack traces untilsyscall boundaries.

Ian Rogers points out that BPF programs can no longer simply stacktrace user code. This change breaks stack trace deduplication, acommonly used BPF primitive.

Miscellaneous minorconsiderations

Linker relaxation considerations:

Since .sframe carries the SHF_ALLOC flag,it affects text section addresses and consequently influences linkerrelaxation on architectures like RISC-V and LoongArch.

If variable-length encoding is introduced to the format,.sframe would behave as an address-dependent sectionsimilar to .relr.dyn. However, this dependency should notpose significant implementation challenges.

Endianness considerations:

The SFrame format currently supports endianness variants, whichcomplicates toolchain implementation. While runtime consumers typicallytarget a single endianness, development tools must handle both variantsto support cross-compilation workflows.

The endianness discussion in The future of 32-bit support inthe kernel reinforces my belief in preferring universallittle-endian for new formats. A universal little-endian approach wouldreduce implementation complexity by eliminating the need for:

  • Endianness-aware function calls likeread32le(config, p) where config->endianspecifies the object file's byte order
  • Template-based abstractions such astemplate <class Endian> that must wrap every dataaccess function

Instead, toolchain code could use straightforward calls likeread32le(p), streamlining both implementation andmaintenance.

This approach remains efficient even on big-endian architectures likeIBM z/Architecture and POWER. z/Architecture's LOAD REVERSEDinstructions, for instance, handle byte swapping with minimal overhead,often requiring no additional instructions beyond normal loads. Whileslight performance differences may exist compared to native endianoperations, the toolchain simplification benefits generally outweighthese concerns.

1
2
3
4
5
6
7
8
9
10
#define WIDTH(x) \
typedef __UINT##x##_TYPE__ [[gnu::aligned(1)]] uint##x; \
uint##x load_inc##x(uint##x *p) { return *p+1; } \
uint##x load_bswap_inc##x(uint##x *p) { return __builtin_bswap##x(*p)+1; }; \
uint##x load_eq##x(uint##x *p) { return *p==3; } \
uint##x load_bswap_eq##x(uint##x *p) { return __builtin_bswap##x(*p)==3; }; \

WIDTH(16);
WIDTH(32);
WIDTH(64);

However, I understand that my opinion is probably not popular withinthe object file format community and faces resistance from stakeholderswith significant big-endian investments.

Questioned benefits

SFrame's primary benefit centers on enabling frame pointer omissionwhile preserving unwinding capabilities. In scenarios where usersalready omit leaf frame pointers, SFrame could theoretically allowswitching from-fno-omit-frame-pointer -momit-leaf-frame-pointer to-fomit-frame-pointer -momit-leaf-frame-pointer. Thisbenefit appears most significant on x86-64, which has limitedgeneral-purpose registers (without APX). Performance analyses show mixedresults: some studies claim frame pointers degrade performance by lessthan 1%, while others suggest 1-2%. However, this argument overlooks acritical tradeoff—SFrame unwinding itself performs worse than framepointer unwinding, potentially negating any performance gains fromregister availability.

Another claimed advantage is SFrame's ability to provide coverage infunction prologues and epilogues, where frame-pointer-based unwindingmay miss frames. Yet this overlooks a straightforward alternative: framepointer unwinding can be enhanced to detect prologue and epiloguepatterns by disassembling instructions at the program counter.

SFrame also faces a practical consideration: the .sframesection likely requires kernel page-in during unwinding, while theprocess stack is more likely already resident in physical memory. As IanRogers noted in LWN,system-wide profiling encounters limitations when system calls haven'ttransitioned to user code, BPF helpers may return placeholder values,and JIT compilers require additional SFrame support.

Looking ahead, hardware-assisted unwinding through features like x86Shadow Stack and AArch64 Guarded Control Stack may reshape the entirelandscape, potentially reducing the relevance of metadata-basedunwinding formats. Meanwhile, compact unwinding schemes like WindowsARM64 demonstrate that significantly smaller metadata formats remainviable alternatives to both SFrame and .eh_frame. Proposalslike Asynchronous Compact Unwind Descriptors have demonstrated thatcompact unwind formats can work with shrink-wrapping optimizations.There is a feature request for a compact information for AArch64 https://github.com/ARM-software/abi-aa/issues/344

Summary

Beyond these fundamental questions about SFrame's value proposition,the format presents a size improvement to Linux kernel's ORC unwinder.Its design presents several implementation challenges that meritconsideration for future versions:

  • Object file format design issues (mandatory index building, sectiongroup compliance, version compatibility)
  • Limited large text section support restricts deployment in modernbinaries
  • Size issue

These technical concerns, combined with the fundamental valuequestions raised above, suggest that careful consideration is warrantedbefore widespread adoption.

If we proceed, here ishow to do it right

According to thiscomment on llvm-project #64449, "v3 is the version that will besubmitted upstream when the time is right." Please share feedback on theformat before it's finalized, even if you may not be impressed with thedesign.

To ensure rapid SFrame evolution without compatibility concerns, abetter approach is to build a library that parses .eh_frameand generates SFrame. The Linux kernel can then use this library (inobjtool?) to generate SFrame for vmlinux and modules. Relying onassembler/linker output for this critical metadata format requires alevel of stability that is currently concerning.

The ongoing maintenance implications warrant particular attention.Observing the binutils mailing list reveals a significant volume ofSFrame commits. Most linker features stabilize quickly after initialimplementation, but SFrame appears to require continued evolution. Giventhe linker's foundational role in the build process, which demandsexceptional stability and robustness, the long-term maintenance burdendeserves careful consideration.

Early integration into GNU toolchain has provided valuable feedbackfor format evolution, but this comes at the cost of coupling theformat's maturity to linker stability. The SFrame GNU toolchaindevelopers exhibit a concerningtendency to disregard ELF and linker conventions—a serious problemfor all linker maintainers. It's unfortunate that there might not betime to consider compact unwinding schemes like WindowsARM64 before committing to SFrame.

❌