阅读视图

发现新文章，点击刷新页面。

2025年总结

MaskRay

2025年12月31日 15:00

TODO

一如既往，主要在工具链领域耕耘。但由于工作忙碌在opensource社区投入的时间减少了。

Blogging

不包括这篇总结，一共写了18篇文章。

Understandingand improving Clang -ftime-report
Natural loops
lld 20 ELFchanges
Migratingcomments to giscus
CompilingC++ with the Clang API
Relocationgeneration in assemblers
LLVMintegrated assembler: Improving MCExpr and MCValue
LLVMintegrated assembler: Improving expressions and relocations
GCC 13.3.0miscompiles LLVM
LLVMintegrated assembler: Engineering better fragments
LLVMintegrated assembler: Improving sections and symbols
Understandingalignment - from source to object file
Benchmarkingcompression programs
lld 21 ELFchanges
Remarks onSFrame
Stackwalking: space and time trade-offs
Sacramento游记
Weak AVL Tree

llvm-project

翻新了integrated assembler，写了4篇相关的blog posts: https://maskray.me/blog/tags/assembler/
Reviewednumerous patches. queryis:pr created:>2025-01-01 reviewed-by:MaskRay => "989Closed"

Linux kernel

贡献了两个commits，被引用了一次。

ccls

clang.prependArgs
支持了LLVM 21和22

ELF specification

尝试推进compactsection header table，没有取得共识。一些成员希望采用generalcompression (likezstd)的方式，像SHF_COMPRESSED那样压缩section headertable。包括我在内的另一些人不喜欢采用general compression。

Misc

Reported 6 feature requests or bugs to binutils.

ld --build-id does not use symtab/strtab content
gas: monolithic .sframe violates COMDAT group rule
gas: Clarify whitespace between a label's symbol and its colon
ld: Add --print-gc-sections=file
ld riscv: Relocatable linking challenge with R_RISCV_ALIGN
ld: add --why-live

旅行

第一次去：台南、西安、兰州、天水、Sacramento、Puerto Vallarta,Jalisco, Mexico、Mazatlán, Sinaloa, Mexico
曾经去过：台北(上一次是近11年前)、北京

2025年总结

MaskRay

2025年12月31日 15:00

TODO

一如既往，主要在工具链领域耕耘。但由于工作忙碌在opensource社区投入的时间减少了。

Blogging

不包括这篇总结，一共写了18篇文章。

Understandingand improving Clang -ftime-report
Natural loops
lld 20 ELFchanges
Migratingcomments to giscus
CompilingC++ with the Clang API
Relocationgeneration in assemblers
LLVMintegrated assembler: Improving MCExpr and MCValue
LLVMintegrated assembler: Improving expressions and relocations
GCC 13.3.0miscompiles LLVM
LLVMintegrated assembler: Engineering better fragments
LLVMintegrated assembler: Improving sections and symbols
Understandingalignment - from source to object file
Benchmarkingcompression programs
lld 21 ELFchanges
Remarks onSFrame
Stackwalking: space and time trade-offs
Sacramento游记
Weak AVL Tree

llvm-project

翻新了integrated assembler，写了4篇相关的blog posts: https://maskray.me/blog/tags/assembler/
Reviewednumerous patches. queryis:pr created:>2025-01-01 reviewed-by:MaskRay => "989Closed"

Linux kernel

贡献了两个commits，被引用了一次。

ccls

clang.prependArgs
支持了LLVM 21和22

ELF specification

Misc

Reported 6 feature requests or bugs to binutils.

ld --build-id does not use symtab/strtab content
gas: monolithic .sframe violates COMDAT group rule
gas: Clarify whitespace between a label's symbol and its colon
ld: Add --print-gc-sections=file
ld riscv: Relocatable linking challenge with R_RISCV_ALIGN
ld: add --why-live

旅行

第一次去：台南、西安、兰州、天水、Sacramento、Puerto Vallarta,Jalisco, Mexico、Mazatlán, Sinaloa, Mexico
曾经去过：台北(上一次是近11年前)、北京

Weak AVL Tree

MaskRay

2025年12月14日 16:00

tl;dr: Weak AVL trees are replacements for AVL trees and red-blacktrees.

The 2014 paper Rank-BalancedTrees (Haeupler, Sen, Tarjan) presents a framework using ranksand rank differences to define binary search trees.

Each node has a non-negative integer rank r(x). Nullnodes have rank -1.
The rank difference of a node x with parentp(x) is r(p(x)) − r(x).
A node is i,j if its children have rank differencesi and j (unordered), e.g., a 1,2 node haschildren with rank differences 1 and 2.
A node is called 1-node if its rank difference is 1.

Several balanced trees fit this framework:

AVL tree: Ranks are defined as heights. Every node is 1,1 or 1,2(rank differences of children)
Red-Black tree: All rank differences are 0 or 1, and no parent of a0-child is a 0-child. (red: 0-child; black: 1-child; null nodes areblack)
Weak AVL tree (new tree described by this paper): All rankdifferences are 1 or 2, and every leaf has rank 0.
- A weak AVL tree without 2,2 nodes is an AVL tree.

1	AVL trees ⫋ weak AVL trees ⫋ red-black trees

Weak AVL Tree

Weak AVL trees are replacements for AVL trees and red-black trees. Asingle insertion or deletion operation requires at most two rotations(forming a double rotation when two are needed). In contrast, AVLdeletion requires O(log n) rotations, and red-black deletion requires upto three.

Without deletions, a weak AVL tree is exactly an AVL tree. Withdeletions, its height remains at most that of an AVL tree with the samenumber of insertions but no deletions.

The rank rules imply:

Null nodes have rank -1, leaves have rank 0, unary nodes have rank1.

Insertion

The new node x has a rank of 0, changed from the nullnode of rank -1. There are three cases.

If the tree was previously empty, the new node becomes theroot.
If the parent of the new node was previously a unary node (1,2node), it is now a 1,1 binary node.
If the parent of the new node was previously a leaf (1,1 node), itis now a 0,1 binary node, leading to a rank violation.

When the tree was previously non-empty, x has a parentnode. We call the following subroutine with x indicatingthe new node to handle the second and third cases.

The following subroutine handles the rank increase of x.We call break if there is no more rank violation, i.e. weare done.

The 2014 paper isn't very clear about the conditions.

// Assume that x's rank has just increased by 1 and rank_diff(x) has been updated.

p = x->parent;
if (rank_diff(x) == 1) {
  // x was previously a 2-child before increasing x->rank.
  // Done.
} else {
  for (;;) {
    // Otherwise, p is a 0,1 node (previously a 1,1 node before increasing x->rank).
    // x being a 0-child is a violation.

    Promote p.
    // Since we have promoted both x and p, it's as if rank_diff(x's sibling) is flipped.
    // p is now a 1,2 node.

    x = p;
    p = p->parent;
    // x is a 1,2 node.
    if (!p) break;
    d = p->ch[1] == x;

    if (rank_diff(x) == 1) { break; }
    // Otherwise, x is a 0-child, leading to a new rank violation.

    auto sib = p->ch[!d];
    if (rank_diff(sib) == 2) { // p is a 0,2 node
      auto y = x->ch[d^1];
      if (y && rank_diff(y) == 1) {
        // y is a 1-child. y must the previous `x` in the last iteration.
        Perform a double rotation involving `p`, `x`, and `y`.
      } else {
        // Otherwise, y is null or a 2-child.
        Perform a single rotation involving `p` and `x`.
        x is now a 1,1 node and there is no more violation.
      }
      break;
    }

    // Otherwise, p is a 0,1 node. Goto the next iteration.
  }
}

Insertion never introduces a 2,2 node, so insertion-only sequencesproduce AVL trees.

Deletion

TODO: Describe deletion

1
2
3

if (!was_2 && !x && !p->ch[0] && !p->ch[1] && p->rp()) {
  // p was unary and becomes 2,2. Demote it.
}

Implementation

Since valid rank differences can only be 1 or 2, ranks can be encodedefficiently using bit flags. There are three approaches:

Store two bits representing the rank differences to each child. Bit0: rank difference to left child (1 = diff is 2, 0 = diff is 1). Bit 1:rank difference to right child
Store a single bit representing the parity (even/odd) of the node'sabsolute rank. The rank difference to a child is computed by comparingparities. Same parity → rank difference of 2. Different parity → rankdifference of 1
Store a 1-bit rank difference parity in each node.

FreeBSD's sys/tree.h (https://reviews.freebsd.org/D25480, 2020) uses the firstapproach. The rb_ prefix remains as it can also indicateRank-Balanced:) Note: its insertion operation can be futheroptimized as the following code demonstrates.

https://github.com/pvachon/wavl_tree and https://crates.io/crates/wavltree use the secondapproach.

The third approach is less efficient because a null node can beeither a 1-child (parent is binary) or a 2-child (parent is unary),requiring the sibling node to be probed to determine the rankdifference:int rank_diff(Node *p, int d) { return p->ch[d] ? p->ch[d]->par_and_flg & 1 : p->ch[!d] ? 2 : 1; }

https://maskray.me/blog/2025-12-14-weak-avl-tree is aC++ implementation covering both approaches, supporting the followingoperations:

insert: insert a node
remove: remove a node
rank: count elements less than a key
select: find the k-th smallest element (0-indexed)
prev: find the largest element less than a key
next: find the smallest element greater than a key

Node structure:

ch[2]: left and right child pointers.
par_and_flg: packs the parent pointer with 2 flag bitsin the low bits. Bit 0 indicates whether the left child has rankdifference 2; bit 1 indicates whether the right child has rankdifference 2. A cleared bit means rank difference 1.
i: the key value.
sum, size: augmented data maintained bymconcat for order statistics operations.

Helper methods:

rd2(d): returns true if child d has rankdifference 2.
flip(d): toggles the rank difference of childd between 1 and 2.
clr_flags(): sets both children to rank difference 1(used after rotations to reset a node to 1,1).

Invariants:

Leaves always have flags() == 0, meaning both nullchildren are 1-children (null nodes have rank -1, leaf has rank 0).
After each insertion or deletion, mconcat is calledalong the path to the root to update augmented data.

Rotations:

The rotate(x, d) function rotates node x indirection d. It lifts x->ch[d] to replacex, and updates the augmented data for x. Thecaller is responsible for updating rank differences.

Misc

Visualization: https://tjkendev.github.io/bst-visualization/avl-tree/bu-weak.html

Sacramento游记

MaskRay

2025年12月7日 16:00

周末从旧金山湾南部去Sacramento参观。

周六

周六上午看了Crocker Art Museum，相当不错。博物馆以Edwin BryantCrocker命名(他是Central Pacific Railroad的The Big Four之一CharlesCrocker的兄弟)

十九世纪，Sacramento被视作“二埠”(SanFrancisco为“大埠”)。我好奇Sacramento是否还存在Chinatown。在DOCO -DowntownCommons(购物商场)附近简单逛逛后，沿着"Chinatown"指示牌向北走来到4thSt和J St路口。马路对面的高大建筑溯源堂(Soo Yuen BenevolentAssociation)没有开放。穿过JSt后看到溯源堂左边有个牌坊，上书“沙加缅度华埠”。穿过牌坊则来到一个广场，没有看到人迹。

1950和1960年代，I-5州际公路建设和城市更新项目拆除了部分Chinatown建筑，I-5如今即位于Chinatown遗迹西侧。残存的街区仅限于J Street和I Street之间、3rd Street和5thStreet之间的两个街区。人口也逐渐迁出，此处已是一个ghosttown，非常冷清。

Google地图显示的场所"ChinatownMall"似乎是中华会馆遗迹，无人、不可进入。一个Reddit贴文显示这是1959年建成的，现在已经荒废。有一个建筑物写着“邓高密公所”，也没有人。中山纪念馆(415 JSt，似乎是1971年建成)则是我们唯一找到的开放的场所，在周六周日13:00至15:00开放，里面悬挂着挂着孙中山像、美国国旗和民国旗。

下午参观了California State Railroad Museum，周边停车都是flat rate$20。

Chinese Railroad Workers Historic Photos & Painting Exhibition

历史讲述部分仿佛回味了一遍The Iron Horse (1924 film)。

周日

11:57到达Leland Stanford Mansion。正好12:00有一个tour，得以进入参观。只能跟随free guided tour进入。

这座宅邸最初建于1856年。1861年，LelandStanford购得了这处房产。他当选州长在1862和1863年在此处加盖的房子里处理公务。在他之后，还有两位州长（第9任和第10任）也曾在此办公。

1868年，Stanford夫妇的唯一孩子LelandJr.出生于此。Mansion在1872年进行了扩建。Stanford夫妇和他们的儿子在此居住，直到1876年才迁往旧金山（他们在旧金山的那处宅邸后来毁于1906年的旧金山大地震）。1900年，LelandStanford的遗孀将这座府邸捐赠给教会，作为孤儿院使用。孤儿院用此地直到1978年州政府购入了这处历史性房产。此后，该建筑经过翻新，并最终作为博物馆对外开放。

Leland作为Central Pacific Railroad的The Big Four之一，建成了FirstTranscontinental Railroad西段。府邸内有十二处火车标示。

Stanford University全称为Leland Stanford JuniorUniversity，即为纪念旅行欧洲时年少去世的Leland jr。

下午去了California Museum，凭借Bank of America的Museum onUs活动免费进入。

看到了一些关于修建第一条横贯大陆铁路、1942年对日裔美国人的囚禁(另，今天12月7日正好是1941年12月7日日本偷袭珍珠港的84周年纪念日)、Chinatown的描述。看到Locke(乐居)小镇的描述后(大约14:46)决定立刻离开Sacramento，驱车前往Locke小镇。

Locke

以下描述很大一部分来自翻译总结Locke Foundation于2023年的一篇描述。 https://locke-foundation.org/wp-content/uploads/2023/09/LF-newsletter-Fall-2023-final.pdf

乐居镇建于1915年，是美国现存规模最大、保存最完整的农村华人社区。(它是Sacramento-San Joaquin Delta三角洲地区现在仅剩下的Chinatown。)它不是传统的"唐人街"，而是华人为华人建造的独立小镇。当年华人怀揣"金山梦"来到加州淘金，却因《外侨土地法》无法拥有土地，只能租地建房。(有些评论称之为美国唯一一个由华人建立、由华人专居、并由华人经营的城镇(独立、而不是某个城市的街区))

后来Locke镇逐渐没落了，可能和以下几点有关

Chinese Exclusion Act (1882)。无法获得海外移民补充。
经济层面的冲击。1930年代大萧条首先重创了这个小镇。禁酒令于1933年结束后，Locke曾因"加州蒙特卡洛"的绰号吸引大量寻欢客前来赌博和消遣，这一客源随即消失。
农业机械化。1940至1950年代大规模农业机械化使得许多农场工人失去工作，小规模佃农也难以维持生计。华人劳工赖以生存的体力劳动需求锐减。
人口外流。二战后，许多华裔青年离开小镇前往城市寻找更好的经济机会。
土地所有权问题。由于加州1913年的《外国人土地法》禁止亚裔购买土地，华人只能租用GeorgeLocke家族的土地。虽然该法于1952年被裁定违宪，但Locke的居民始终未能购买他们建房的土地。

1976年，香港商人ClarenceChu家族购买了乐居庄园。他能说中山话，与老居民建立了信任。最终他推动县政府创建土地分区，让居民以极低价格（每块地仅3000-5000美元）购买了房屋下的土地。2004年，近百年的历史不公终于得到纠正。

乐居管理公司（LMC）负责城镇日常管理，相当于整个小镇的业主协会；乐居镇基金会（LF）专注于教育、保护和推广，运营博物馆、举办节庆活动、颁发奖学金、记录口述历史等。

https://locke-foundation.org/locke-museums/提到四栋建筑和一个公园。

Dai Loy Gambling House大来赌场。过去八座赌场中建筑物仅剩的一座。赌场于1951年关闭。
Boarding House Museum 寄宿公寓
Jan Ying Chinese Association Museum 俊英工商会
Joe Shoong Chinese School 中文学校
Memorial Park

Google maps上结束时间不准确，四栋建筑均于16:00关闭。后来根据视频https://www.youtube.com/watch?v=wtzcOgaMYcQ，关门的工作人员其实就是几栋建筑的主人ClarenceChu！他在1976年从George Locke家族购买了town of Locke。

大来赌场

寄宿公寓

俊英工商会

中文学校

纪念公园

Stack walking: space and time trade-offs

MaskRay

2025年10月26日 15:00

On most Linux platforms (except AArch32, which uses.ARM.exidx), DWARF .eh_frame is required forC++ exceptionhandling and stackunwinding to restore callee-saved registers. While.eh_frame can be used for call trace recording, it is oftencriticized for its runtime overhead. As an alternative, developers canenable frame pointers, or adopt SFrame, a newer format designedspecifically for profiling. This article examines the size overhead ofenabling non-DWARF stack walking mechanisms when building several LLVMexecutables.

Runtime performance analysis will be added in a future update.

Stack walking mechanisms

Here is a survey of mechanisms available for x86-64:

Frame pointers: Fast and simple, but costs a register.
DWARF .eh_frame: Comprehensive but slower, supportsadditional features like C++ exception handling
SFrame: This is a new experimental format only support profiling..eh_frame remains necessary for debugging and C++ exceptionhandling. Check out Remarkson SFrame for details.
LLVM's Compact Unwinding Format: A highly space-efficient format, implemented byApple for Mach-O binaries. This has llvm, lld/MachO, and libunwindimplementation. Supports x86-64 and AArch64. This can mostly replaceDWARF CFI, though some entries need DWARF escape(__eh_frame section would be tiny). OpenVMS modified it fortheir x86-64 port.
x86 Last Branch Record (LBR): A hardware feature that captures alimited history of most recent branches (up to 32 on Skylake+). Whenconfigured to track branches for SamplePGO, the limited depth means itwon't reliably capture deep stack traces. Traditionally Intel only, butAMD Zen 4 has since implemented LastBranch Record Extension Version 2 (LbrExtV2)
Control-flow Enforcement Technology (CET) Shadow Stack: Thishardware security hardening feature can be used to get stack traces.While it introduces some overhead, it offers the flexibility ofprocess-specific enablement.

Space overhead analysis

Frame pointer size impact

For most architectures, GCC defaults to-fomit-frame-pointer in -O compilation to freeup a register for general use. To enable frame pointers, specify-fno-omit-frame-pointer, which reserves the frame pointerregister (e.g., rbp on x86-64) and emits push/popinstructions in function prologues/epilogues.

For leaf functions (those that don't call other functions), while theframe pointer register should still be reserved for consistency, thepush/pop operations are often unnecessary. Compilers provide-momit-leaf-frame-pointer (with target-specific defaults)to reduce code size.

The viability of this optimization depends on the targetarchitecture:

On AArch64, the return address is available in the link register(X30). The immediate caller can be retrieved by inspecting X30, so-momit-leaf-frame-pointer does not compromiseunwinding.
On x86-64, after the prologue instructions execute, the returnaddress is stored at RSP plus an offset. An unwinder needs to know thestack frame size to retrieve the return address, or it must utilizeDWARF information for the leaf frame and then switch to the FP chain forparent frames.

Beyond this architectural consideration, there are additionalpractical reasons to use -momit-leaf-frame-pointer onx86-64:

Many hand-written assembly implementations (including numerous glibcfunctions) don't establish frame pointers, creating gaps in the framepointer chain anyway.
In the prologue sequence push rbp; mov rbp, rsp, afterthe first instruction executes, RBP does not yet reference the currentstack frame. When shrink-wrapping optimizations are enabled, theinstruction region where RBP still holds the old value becomes larger,increasing the window where the frame pointer is unreliable.

Given these trade-offs, three common configurations have emerged:

omitting FP:-fomit-frame-pointer -momit-leaf-frame-pointer (smallestoverhead)
reserving FP, but removing FP push/pop for leaf functions:-fno-omit-frame-pointer -momit-leaf-frame-pointer (framepointer chain omitting the leaf frame)
reserving FP:-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer(complete frame pointer chain, largest overhead)

The size impact varies significantly by program. Here's a Rubyscript section_size.rb that compares section sizes:

% ~/Dev/object-file-size-analyzer/section_size.rb /tmp/out/custom-{none,nonleaf,all}/bin/{llvm-mc,opt}
Filename                            |       .text size |        EH size |  VM size | VM increase
------------------------------------+------------------+----------------+----------+------------
/tmp/out/custom-none/bin/llvm-mc    |  2114687 (23.7%) |  367992 (4.1%) |  8914057 |           -
/tmp/out/custom-nonleaf/bin/llvm-mc |  2124143 (24.0%) |  301688 (3.4%) |  8856713 |       -0.6%
/tmp/out/custom-all/bin/llvm-mc     |  2149535 (24.0%) |  362408 (4.1%) |  8942729 |       +0.3%
/tmp/out/custom-none/bin/opt        | 39018511 (70.2%) | 4561112 (8.2%) | 55583965 |           -
/tmp/out/custom-nonleaf/bin/opt     | 38879897 (71.4%) | 3542288 (6.5%) | 54424789 |       -2.1%
/tmp/out/custom-all/bin/opt         | 38980905 (71.0%) | 3888624 (7.1%) | 54871285 |       -1.3%

For instance, llvm-mc is dominated by read-only data,making the relative .text percentage quite small, so framepointer impact on the VM size is minimal. ("VM size" is a metric used bybloaty, representing the total p_memsz size ofPT_LOAD segments, excluding alignmentpadding.) As expected, llvm-mc grows larger as morefunctions set up the frame pointer chain. However, optactually becomes smaller when -fno-omit-frame-pointer isenabled—a counterintuitive result that warrants explanation.

Without frame pointer, the compiler uses RSP-relative addressing toaccess stack objects. When using the register-indirect + disp8/disp32addresing mode, RSP needs an extra SIB byte while RBP doesn't. Forlarger functions accessing many local variables, the savings fromshorter RBP-relative encodings can outweigh the additionalpush rbp; mov rbp, rsp; pop rbp instructions in theprologues/epilogues.

% echo 'mov rax, [rsp+8]; mov rax, [rbp-8]' | /tmp/Rel/bin/llvm-mc -x86-asm-syntax=intel -output-asm-variant=1 -show-encoding
        mov     rax, qword ptr [rsp + 8]        # encoding: [0x48,0x8b,0x44,0x24,0x08]
        mov     rax, qword ptr [rbp - 8]        # encoding: [0x48,0x8b,0x45,0xf8]

# ModR/M byte 0x44: Mod=01 (register-indirect addressing + disp8), Reg=0 (dest reg RAX), R/M=100 (SIB byte follows)
# ModR/M byte 0x45: Mod=01 (register-indirect addressing + disp8), Reg=0 (dest reg RAX), R/M=101 (RBP)

SFrame vs .eh_frame

Oracle is advocating for SFrame adoption in Linux distributions. TheSFrame implementation is handled by the assembler and linker rather thanthe compiler. Let's build the latest binutils-gdb to test it.

Building test program

We'll use the clang compiler from https://github.com/llvm/llvm-project/tree/release/21.xas our test program.

There are still issues related to garbage collection (object fileformat design issue), so I'll just disable-Wl,--gc-sections.

--- i/llvm/cmake/modules/AddLLVM.cmake
+++ w/llvm/cmake/modules/AddLLVM.cmake
@@ -331,4 +331,4 @@ function(add_link_opts target_name)
         # TODO Revisit this later on z/OS.
-        set_property(TARGET ${target_name} APPEND_STRING PROPERTY
-                     LINK_FLAGS " -Wl,--gc-sections")
+        #set_property(TARGET ${target_name} APPEND_STRING PROPERTY
+        #             LINK_FLAGS " -Wl,--gc-sections")
       endif()

1
2

configure-llvm custom-sframe -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_PROJECTS='clang' -DLLVM_ENABLE_UNWIND_TABLES=on -DLLVM_ENABLE_LLD=off -DCMAKE_{EXE,SHARED}_LINKER_FLAGS=-fuse-ld=bfd -DCMAKE_C_COMPILER=$HOME/opt/gcc-15/bin/gcc -DCMAKE_CXX_COMPILER=$HOME/opt/gcc-15/bin/g++ -DCMAKE_C_FLAGS="-B$HOME/opt/binutils/bin -Wa,--gsframe" -DCMAKE_CXX_FLAGS="-B$HOME/opt/binutils/bin -Wa,--gsframe"
ninja -C /tmp/out/custom-sframe clang

% ~/Dev/bloaty/out/release/bloaty /tmp/out/custom-sframe/bin/clang
    FILE SIZE        VM SIZE
 --------------  --------------
  63.9%  88.0Mi  73.9%  88.0Mi    .text
  11.1%  15.2Mi   0.0%       0    .strtab
   7.2%  9.96Mi   8.4%  9.96Mi    .rodata
   6.4%  8.87Mi   7.5%  8.87Mi    .sframe
   5.1%  7.07Mi   5.9%  7.07Mi    .eh_frame
   2.9%  3.96Mi   0.0%       0    .symtab
   1.4%  1.98Mi   1.7%  1.98Mi    .data.rel.ro
   0.9%  1.23Mi   1.0%  1.23Mi    [LOAD #4 [R]]
   0.7%   999Ki   0.8%   999Ki    .eh_frame_hdr
   0.0%       0   0.5%   614Ki    .bss
   0.2%   294Ki   0.2%   294Ki    .data
   0.0%  23.1Ki   0.0%  23.1Ki    .rela.dyn
   0.0%  8.99Ki   0.0%  8.99Ki    .dynstr
   0.0%  8.77Ki   0.0%  8.77Ki    .dynsym
   0.0%  7.24Ki   0.0%  7.24Ki    .rela.plt
   0.0%  6.73Ki   0.0%       0    [Unmapped]
   0.0%  6.29Ki   0.0%  3.84Ki    [21 Others]
   0.0%  4.84Ki   0.0%  4.84Ki    .plt
   0.0%  3.36Ki   0.0%  3.30Ki    .init_array
   0.0%  2.50Ki   0.0%  2.50Ki    .hash
   0.0%  2.44Ki   0.0%  2.44Ki    .got.plt
 100.0%   137Mi 100.0%   119Mi    TOTAL
% ~/Dev/object-file-size-analyzer/eh_size.rb /tmp/out/custom-sframe/bin/clang
clang: sframe=9303875 eh_frame=7408976 eh_frame_hdr=1023004 eh=8431980 sframe/eh_frame=1.2558 sframe/eh=1.1034

The results show that .sframe (8.87 MiB) isapproximately 10% larger than the combined size of.eh_frame and .eh_frame_hdr (7.07 + 0.99 =8.06 MiB). While SFrame is designed for efficiency during stack walking,it carries a non-trivial space overhead compared to traditional DWARFunwind information.

SFrame vs FP

Having examined SFrame's overhead compared to .eh_frame,let's now compare the two primary approaches for non-hardware-assistedstack walking.

Frame pointer approach: Reserve FP but omitpush/pop for leaf functionsg++ -fno-omit-frame-pointer -momit-leaf-frame-pointer
SFrame approach: Omit FP and use SFrame metadatag++ -fomit-frame-pointer -momit-leaf-frame-pointer -Wa,--gsframe

To conduct a fair comparison, we build LLVM executables using bothapproaches with both Clang and GCC compilers. The following scriptconfigures and builds test binaries with each combination:

#!/bin/zsh
conf() {
  configure-llvm $1 -DCMAKE_EXE_LINKER_FLAGS='-fuse-ld=bfd -pie -Wl,-z,pack-relative-relocs' \
    -DCMAKE_SHARED_LINKER_FLAGS=-fuse-ld=bfd -DLLVM_ENABLE_UNWIND_TABLES=on -DLLVM_ENABLE_LLD=off ${@:2}
}

clang=(-DCMAKE_CXX_COMPILER=/tmp/Rel/bin/clang++ -DCMAKE_C_COMPILER=/tmp/Rel/bin/clang)
gcc=("-DCMAKE_C_COMPILER=$HOME/opt/gcc-15/bin/gcc" "-DCMAKE_CXX_COMPILER=$HOME/opt/gcc-15/bin/g++")

compact="-fomit-frame-pointer -momit-leaf-frame-pointer -B$HOME/opt/binutils/bin -mllvm -elf-compact-unwind -mllvm -x86-epilog-cfi=0"
fp="-fno-omit-frame-pointer -momit-leaf-frame-pointer -B$HOME/opt/binutils/bin -Wa,--gsframe=no"
sframe="-fomit-frame-pointer -momit-leaf-frame-pointer -B$HOME/opt/binutils/bin -Wa,--gsframe"

conf custom-compact -DCMAKE_{C,CXX}_FLAGS="$compact" ${clang[@]} \
  -DCMAKE_EXE_LINKER_FLAGS='-fuse-ld=lld -pie -Wl,-z,pack-relative-relocs' \
  -DCMAKE_SHARED_LINKER_FLAGS=-fuse-ld=lld

conf custom-fp -DCMAKE_{C,CXX}_FLAGS="-fno-integrated-as $fp" ${clang[@]}
conf custom-sframe -DCMAKE_{C,CXX}_FLAGS="-fno-integrated-as $sframe" ${clang[@]}

conf custom-fp-gcc -DCMAKE_{C,CXX}_FLAGS="$fp" ${gcc[@]}
conf custom-sframe-gcc -DCMAKE_{C,CXX}_FLAGS="$sframe" ${gcc[@]}

for i in compact fp sframe  fp-gcc sframe-gcc; do ninja -C /tmp/out/custom-$i llvm-mc opt; done

The results reveal interesting differences between compilerimplementations:

% ~/Dev/object-file-size-analyzer/section_size.rb /tmp/out/custom-{fp,sframe,compact,fp-gcc,sframe-gcc}/bin/{llvm-mc,opt}
Filename                               |       .text size |        EH size |   .sframe size |  VM size | VM increase
---------------------------------------+------------------+----------------+----------------+----------+------------
/tmp/out/custom-fp/bin/llvm-mc         |  2120895 (23.5%) |  301528 (3.3%) |       0 (0.0%) |  9043221 |           -
/tmp/out/custom-sframe/bin/llvm-mc     |  2109231 (22.3%) |  367424 (3.9%) |  348041 (3.7%) |  9474085 |       +4.8%
/tmp/out/custom-compact/bin/llvm-mc    |  2109519 (24.4%) |  106288 (1.2%) |       0 (0.0%) |  8639637 |       -4.5%
/tmp/out/custom-fp-gcc/bin/llvm-mc     |  2744214 (29.2%) |  301836 (3.2%) |       0 (0.0%) |  9389677 |       +3.8%
/tmp/out/custom-sframe-gcc/bin/llvm-mc |  2705860 (27.7%) |  354292 (3.6%) |  356073 (3.6%) |  9780985 |       +8.2%
/tmp/out/custom-fp/bin/opt             | 38769545 (69.9%) | 3547688 (6.4%) |       0 (0.0%) | 55425217 |           -
/tmp/out/custom-sframe/bin/opt         | 38891295 (62.4%) | 4559644 (7.3%) | 4448874 (7.1%) | 62292133 |      +12.4%
/tmp/out/custom-compact/bin/opt        | 38898415 (74.8%) | 1200764 (2.3%) |       0 (0.0%) | 52020449 |       -6.1%
/tmp/out/custom-fp-gcc/bin/opt         | 54654215 (78.1%) | 3631196 (5.2%) |       0 (0.0%) | 70001373 |      +26.3%
/tmp/out/custom-sframe-gcc/bin/opt     | 53644895 (70.4%) | 4857364 (6.4%) | 5263676 (6.9%) | 76206149 |      +37.5%

% ruby ~/Dev/object-file-size-analyzer/eh_size.rb  /tmp/out/custom-compact/bin/opt
opt: sframe=0 eh_frame=267008 eh_frame_hdr=933756 eh=1200764 sframe/eh_frame=0.0 sframe/eh=0.0
% ruby ~/Dev/object-file-size-analyzer/eh_size.rb  /tmp/out/custom-sframe/bin/opt
opt: sframe=4448874 eh_frame=3938448 eh_frame_hdr=621196 eh=4559644 sframe/eh_frame=1.1296 sframe/eh=0.9757

% ~/Dev/object-file-size-analyzer/section_size.rb /tmp/out/custom-{fp-sync,sframe-sync,compact-sync}/bin/{llvm-mc,opt}
Filename                                 |       .text size |        EH size |   .sframe size |  VM size | VM increase
-----------------------------------------+------------------+----------------+----------------+----------+------------
/tmp/out/custom-fp-sync/bin/llvm-mc      |  2120895 (24.1%) |  263396 (3.0%) |       0 (0.0%) |  8802093 |           -
/tmp/out/custom-sframe-sync/bin/llvm-mc  |  2109231 (23.2%) |  291084 (3.2%) |  248654 (2.7%) |  9090325 |       +3.3%
/tmp/out/custom-compact-sync/bin/llvm-mc |  2109519 (24.4%) |  106288 (1.2%) |       0 (0.0%) |  8639637 |       -1.8%
/tmp/out/custom-fp-sync/bin/opt          | 38769545 (72.2%) | 2997572 (5.6%) |       0 (0.0%) | 53706041 |           -
/tmp/out/custom-sframe-sync/bin/opt      | 38891295 (66.9%) | 3425116 (5.9%) | 2951292 (5.1%) | 58091421 |       +8.2%
/tmp/out/custom-compact-sync/bin/opt     | 38898415 (74.8%) | 1200764 (2.3%) |       0 (0.0%) | 52020449 |       -3.1%

SFrame incurs a significant VM size increase.
GCC-built binaries are significantly larger than their Clangcounterparts, probably due to more aggressive inlining or vectorizationstrategies.
/tmp/out/custom-compact has significantly smaller EHsize. See details below.

With Clang-built binaries, the frame pointer configuration produces asmaller opt executable (55.6 MiB) compared to the SFrameconfiguration (62.5 MiB). This reinforces our earlier observation thatRBP addressing can be more compact than RSP-relative addressing forlarge functions with frequent local variable accesses.

Assembly comparison reveals that functions using RBP and RSPaddressing produce quite similar code.

In contrast, GCC-built binaries show the opposite trend: the framepointer version of opt (70.0 MiB) is smaller than theSFrame version (76.2 MiB).

The generated assembly differs significantly between omit-FP andnon-omit-FP builds, I have compared symbol sizes between two GCC builds.

1	nvim -d =(/tmp/Rel/bin/llvm-nm -U --size-sort /tmp/out/custom-fp-gcc/bin/llvm-mc) =(/tmp/Rel/bin/llvm-nm -U --size-sort /tmp/out/custom-sframe-gcc/bin/llvm-mc)

Many functions, such as_ZN4llvm15ELFObjectWriter24executePostLayoutBindingEv, havesignificant more instructions in the keep-FP build. This suggests thatGCC's frame pointer code generation may not be as optimized as itsdefault omit-FP path.

The /tmp/out/custom-compact build uses my llvm-projectbranch (http://github.com/MaskRay/llvm-project/tree/demo-unwind)that ports Mach-O compact unwind to ELF, allowing the majority of.eh_frame FDEs to replace CFI instructions with unwinddescriptors. Linker behavior:

Split FDEs into two groups: descriptor-based (augmentation 'C') andinstruction-based
Generate .eh_frame_hdr version 2 with 12-byte tableentries when compact FDEs are present:(pc_ptr, unwind_descriptor_or_fde_ptr). Compact FDEsdescribed by .eh_frame_hdr inline are removed from theoutput .eh_frame section.

Note: .ARM.exidx and MIPScompact exception tables also describe unwind descriptors inline ina binary search index table.

FDEs not representable by compact unwind (e.g. shrink wrappingoptimization) use the traditional CFI instructions (called DWARF escapein the Mach-O compact unwind information).

This implementation involves several components:

-mllvm -elf-compact-unwind: Emits.eh_frame CIEs with augmentation character 'C' and FDEsusing unwind descriptors.
-mllvm -x86-epilog-cfi=0: Disables epilogue CFI for x86(primarily implemented by D42848 in 2018, notablydisabled for Darwin and Windows). Without this option most frames willnot utilize unwind descriptors because the current Mach-O compact unwindimplementation does not supportpopq %rbp; .cfi_def_cfa %rsp, 8; ret. I believe this isstill fair as we expect to use a 8-byte descriptor, sufficient todescribe epilogue CFI.
lld/ELF changes: FDEs are split into descriptor-based (augmentation'C') and CFI-instruction-based groups. When compact FDEs are present,.eh_frame_hdr version 2 is generated with 12-byte tableentries containing (pc_ptr, unwind_descriptor_or_fde_ptr). The PCpointer remains 4 bytes, while the 8-byte entry indicates either anunwind descriptor (odd value) or an FDE pointer (even value).

With the current implementation, 4937 out of 77648 FDEs (6.36%)require a DWARF escape, while the remaining FDEs can be replaced withunwind descriptors.

.eh_frame_hdr will become even smaller if we implementthe two-level page table structure in Mach-O__unwind_info.

Runtime performance analysis

TODO

perf record overhead with EH

perf record overhead with FP

Here is a benchmarkrun from llvm-compile-time-tracker.com.

The stable2-O3 benchmark is relevant. When enabling FPfor non-leaf functions, the instructions:u metric increasesby +2.44% while wall-time (a noisy metric) increases byjust 0.56%.

Summary

This article examines the space overhead of different stack walkingmechanisms when building LLVM executables.

Frame pointer configurations: Enabling framepointers (-fno-omit-frame-pointer) can paradoxically reducex86-64 binary size when stack object accesses are frequent. This occursbecause RBP-relative addressing produces more compact encodings thanRSP-relative addressing, which requires an extra SIB byte. The savingsfrom shorter instructions can outweigh the prologue/epilogueoverhead.

SFrame vs .eh_frame: For the x86-64clang executable, SFrame metadata is approximately 10%larger than the combined size of .eh_frame and.eh_frame_hdr. Given the significant VM size overhead andthe lack of clear advantages over established alternatives, I amskeptical about SFrame's viability as the future of stack walking foruserspace programs. While SFrame will receive a major revision V3 in theupcoming months, it needs to achieve substantial size reductionscomparable to existing compact unwinding schemes to justify its adoptionover frame pointers. I hope interested folks can implement somethingsimilar to macOS's compact unwind descriptors (with x86-64 support) andOpenVMS's.

ELF compact unwind: My prototype porting Mach-Ocompact unwind to ELF demonstrates significant promise. The approachreduces VM size by 4.5-6.1% compared to frame pointers, achieving thesmallest binaries in my benchmarks. By replacing verbose CFIinstructions with 8-byte unwind descriptors (with DWARF escape forcomplex cases like shrink-wrapped functions), .eh_frameshrinks dramatically—only 6.36% of FDEs require the traditional CFIformat. This approach, once completed, offers a compelling alternativeto SFrame: better compression, compatibility with existing.eh_frame infrastructure, and a clear path toimplementation.

LLVMcommunity: I need your support. I've raised technicalobjections to the SFrame RFC as maintainer. Some engineers dismissedthem. Now they're escalating to Project Council to override technicalreview. This looks OKR-driven, not merit-driven.

GCC's frame pointer code generation appears less optimized than itsdefault omit-frame-pointer path, as evidenced by substantial differencesin generated assembly.

Runtime performance analysis remains to be conducted to complete thetrade-off evaluation.

Appendix:`configure-llvm`

This script specifies common options when configuring llvm-project:https://github.com/MaskRay/Config/blob/master/home/bin/configure-llvm

-DCMAKE_CXX_ARCHIVE_CREATE="$HOME/Stable/bin/llvm-ar qc --thin <TARGET> <OBJECTS>" -DCMAKE_CXX_ARCHIVE_FINISH=::Use thin archives to reduce disk usage
-DLLVM_TARGETS_TO_BUILD=host: Build a singletarget
-DCLANG_ENABLE_OBJC_REWRITER=off -DCLANG_ENABLE_STATIC_ANALYZER=off:Disable less popular components
-DLLVM_ENABLE_PLUGINS=off -DCLANG_PLUGIN_SUPPORT=off:Disable -Wl,--export-dynamic, preventing large.dynsym and .dynstr sections

Appendix: My SFrame build

mkdir -p out/release && cd out/release
../../configure --prefix=$HOME/opt/binutils --disable-multilib
make -j $(nproc) all-ld all-binutils all-gas
make -j $(nproc) install-ld install-binutils install-gas

gcc -B$HOME/opt/binutils/bin andclang -B$HOME/opt/binutils/bin -fno-integrated-as will useas and ld from the install directory.

Appendix: Scripts

Ruby scripts used by this post are available at https://github.com/MaskRay/object-file-size-analyzer/

Remarks on SFrame

MaskRay

2025年9月28日 15:00

SFrame is a new stackwalking format for userspace profiling, inspired by Linux'sin-kernel ORC unwindformat. While SFrame eliminates some .eh_frame CIE/FDEoverhead, it sacrifices functionality (e.g., personality, LSDA,callee-saved registers) and flexibility, and its stack offsets are lesscompact than .eh_frame's bytecode-style CFI instructions.In llvm-project executables I've tested on x86-64, .sframesection is 20% larger than .eh_frame. It also remainssignificantly larger than highly compact schemes like WindowsARM64 unwind codes.

SFrame describes three elements for each function:

Canonical Frame Address (CFA): The base address for stack framecalculations
Return address
Frame pointer

An .sframe section follows a straightforward layout:

Header: Contains metadata and offset information
Auxiliary header (optional): Reserved for future extensions
Function Descriptor Entries (FDEs): Array describing eachfunction
Frame Row Entries (FREs): Arrays of unwinding information perfunction

struct [[gnu::packed]] sframe_header {
  struct {
    uint16_t sfp_magic;
    uint8_t sfp_version;
    uint8_t sfp_flags;
  } sfh_preamble;
  uint8_t sfh_abi_arch;
  int8_t sfh_cfa_fixed_fp_offset;
  // Used by x86-64 to define the return address slot relative to CFA
  int8_t sfh_cfa_fixed_ra_offset;
  // Size in bytes of the auxiliary header, allowing extensibility
  uint8_t sfh_auxhdr_len;
  // Numbers of FDEs and FREs
  uint32_t sfh_num_fdes;
  uint32_t sfh_num_fres;
  // Size in bytes of FREs
  uint32_t sfh_fre_len;
  // Offsets in bytes of FDEs and FREs
  uint32_t sfh_fdeoff;
  uint32_t sfh_freoff;
};

While magic is popular choices for file formats, they deviate fromestablished ELF conventions, which simplifies utilizes the section typefor distinction.

The version field resembles the similar uses within DWARF sectionheaders. SFrame will likely evolve over time, unlike ELF's more stablecontrol structures. This means we'll probably need to keep producers andconsumers evolving in lockstep, which creates a stronger case forinternal versioning. An internal version field would allow linkers toupgrade or ignore unsupported low-version input pieces, providing moreflexibility in handling version mismatches.

Data structures

Function Descriptor Entries(FDEs)

Function Descriptor Entries serve as the bridge between functions andtheir unwinding information. Each FDE describes a function's locationand provides a direct link to its corresponding Frame Row Entries(FREs), which contain the actual unwinding data.

struct [[gnu::packed]] sframe_func_desc_entry {
  int32_t sfde_func_start_address;
  uint32_t sfde_func_size;
  uint32_t sfde_func_start_fre_off;
  uint32_t sfde_func_num_fres;
  // bits 0-3 fretype: sfre_start_address type
  // bit 4 fdetype: SFRAME_FDE_TYPE_PCINC or SFRAME_FDE_TYPE_PCMASK
  // bit 5 pauth_key: (AArch64 only) the signing key for the return address
  uint8_t sfde_func_info;
  // The size of the repetitive code block for SFRAME_FDE_TYPE_PCMASK; used by .plt
  uint8_t sfde_func_rep_size;
  uint16_t sfde_func_padding2;
};

The current design has room for optimization. Thesfde_func_num_fres field uses a full 32 bits, which iswasteful for most functions. We could use uint16_t instead,requiring exceptionally large functions to be split across multipleFDEs.

It's important to note that SFrame's function concept represents coderanges rather than logical program functions. This distinction becomesparticularly relevant with compiler optimizations like hot-coldsplitting, where a single logical function may span multiplenon-contiguous code ranges, each requiring its own FDE.

The padding field sfde_func_padding2 representsunnecessary overhead in modern architectures where unaligned memoryaccess performs efficiently, making the alignment benefitsnegligible.

To enable binary search on sfde_func_start_address, FDEsmust maintain a fixed size, which precludes the use of variable-lengthinteger encodings like PrefixVarInt.

Frame Row Entries (FREs)

Frame Row Entries contain the actual unwinding information forspecific program counter ranges within a function. The template designallows for different address sizes based on the function'scharacteristics.

template <class AddrType>
struct [[gnu::packed]] sframe_frame_row_entry {
  // If the fdetype is SFRAME_FDE_TYPE_PCINC, this is an offset relative to sfde_func_start_address
  AddrType sfre_start_address;
  // bit 0 fre_cfa_base_reg_id: define BASE_REG as either FP or SP
  // bits 1-4 fre_offset_count: typically 1 to 3, describing CFA, FP, and RA
  // bits 5-6 fre_offset_size: byte size of offset entries (1, 2, or 4 bytes)
  sframe_fre_info sfre_info;
};

Each FRE contains variable-length stack offsets stored as trailingdata. The fre_offset_size field determines whether offsetsuse 1, 2, or 4 bytes (uint8_t, uint16_t, oruint32_t), allowing optimal space usage based on stackframe sizes.

Architecture-specific stackoffsets

SFrame adapts to different processor architectures by varying itsoffset encoding to match their respective calling conventions andarchitectural constraints.

x86-64

The x86-64 implementation takes advantage of the architecture'spredictable stack layout:

First offset: Encodes CFA as BASE_REG + offset
Second offset (if present): Encodes FP asCFA + offset
Return address: Computed implicitly asCFA + sfh_cfa_fixed_ra_offset (using the header field)

AArch64

AArch64's more flexible calling conventions require explicit returnaddress tracking:

First offset: Encodes CFA as BASE_REG + offset
Second offset: Encodes return address asCFA + offset
Third offset (if present): Encodes FP asCFA + offset

The explicit return address encoding accommodates AArch64's variablestack layouts and link register usage patterns.

s390x

FP and return address may not be saved at the same time. In leaffunctions GCC might save the return address and FP to floating-pointregisters.

First offset: Encodes CFA as BASE_REG + offset
Second offset (if preset): Encodes the return address as one of
- stack slot:CFA + offset2, if (offset2 & 1 == 0)
- register number:offset2 >> 1, if (offset2 & 1 == 1)
- not saved:if (offset2 == SFRAME_FRE_RA_OFFSET_INVALID)
Third offset (if present)
- FP stack slot = CFA + offset3, if (offset3 & 1 == 0)
- FP register number = offset3 >> 1, if (offset3 & 1 ==1)

The format uses 0 (an invalid SFrame RA offset from CFA value) toindicate that the return address is not saved, while FP is saved.

Toolchain implementation

In the GNU toolchain, the assembler in GNU Binutils reinterprets CFIdirectives and generates the .sframe section, while GCCitself has no knowledge of SFrame.

Some scenarios that cannot be described by .eh_frame inthe absence of the frame pointer are equally inexpressible in SFrame.Additionally, SFrame has extra limitations, as certain CFI directivescannot be re-encoded into the SFrame format. You can take a look atas_warn code in binutils-gdb gas/gen-sframe.cto learn some cases.

On the other hand, the assembler approach allows SFrame to work withhand-written assembly files with CFI directives.

ORC and `.sframe`

TODO

`.eh_frame` and`.sframe`

SFrame reduces header size compared to .eh_frame plus.eh_frame_hdr by:

Eliminating .eh_frame_hdr through sortedsfde_func_start_address fields
Replacing CIE pointers with direct FDE-to-FRE references
Using variable-width sfre_start_address fields (1 or 2bytes) for small functions
Storing start addresses instead of address ranges..eh_frame address ranges
Start addresses in a small function use 1 or 2 byte fields, moreefficient than .eh_frame initial_location, which needs atleast 4 bytes (DW_EH_PE_sdata4).
Hard-coding stack offsets rather than using flexible registerspecifications

However, the bytecode design of .eh_frame can sometimesbe more efficient than .sframe, as demonstrated onx86-64.

SFrame serves as a specialized complement to .eh_framerather than a complete replacement. The current version does not includepersonality routines, Language Specific Data Area (LSDA) information, orthe ability to encode extra callee-saved registers. While theseconstraints make SFrame ideal for profilers, they prevent it fromsupporting C++ exception handling, where libstdc++/libc++abi requiresthe full .eh_frame feature set.

In practice, executables and shared objects will likely contain allthree sections:

.eh_frame: Complete unwinding information for exceptionhandling
.eh_frame_hdr (encompassed by thePT_GNU_EH_FRAME program header): Fast lookup table for.eh_frame
.sframe (encompassed by the PT_GNU_SFRAMEprogram header)

The auxiliary header, currently unused, provides a pathway for futureenhancements. It could potentially accommodate .eh_frameaugmentation data such as personality routines, language-specific dataareas (LSDAs), and signal frame handling, bridging some of the currentfunctionality gaps.

Large text section support

The sfde_func_start_address field uses a signed 32-bitoffset to reference functions, providing a ±2GB addressing range fromthe field's location. This signed encoding offers flexibility in sectionordering-.sframe can be placed either before or after textsections.

However, this approach faces limitations with large binaries,particularly when LLVM generates .ltext sections forx86-64. The typical section layout creates significant gaps between.sframe and .ltext:

.ltext          // Large text section
.lrodata        // Large read-only data
.rodata         // Regular read-only data
// .eh_frame and .sframe position
.text           // Regular text section
.data
.bss
.ldata          // Large data
.lbss           // Large BSS

Object file format designissues

Mandatory index buildingproblems

Currently, Binutils enforces a single-element structure within each.sframe section, regardless of whether it resides in arelocatable object or final executable. While theSFRAME_F_FDE_SORTED flag can be cleared to permit unsortedFDEs, proposed unwinder implementations for the Linux kernel do not seemto support multiple elements in a single section. The design choicemakes linker merging mandatory rather than optional.

This design choice stems from Linux kernel requirements, where kernelmodules are relocatable files created with ld -r. Thepending SFrame support for linux-perf expects each module to contain asingle indexed format for efficient runtime processing. Consequently,GNU ld merges all input .sframe sections into a singleindexed element, even when producing relocatable files. This behaviordeviates from standard relocatable linkingconventions that suppress synthetic section finalization.

This approach differs from almost every metadata section, whichsupport multiple concatenated elements, each with its own header andbody. LLVM supports numerous well-behaved metadata sections(__asan_globals, .stack_sizes,__patchable_function_entries, __llvm_prf_cnts,__sancov_bools, __llvm_covmap,__llvm_gcov_ctr_section, .llvmcmd, andllvm_offload_entries) that concatenate without issues.SFrame stands apart as the only metadata section demandingversion-specific merging as default linker behavior, creatingunprecedented maintenance burden. For optimal portability, unwindersshould support multiple-element structures within a .sframesection.

For optimal portability, we must support object files from diverseorigins—not just those built from a single toolchain. In environmentswhere almost everything is built from source with a single toolchainoffering strong SFrame support, forcing default-on index building may beacceptable. However, we must also accommodate environments with prebuiltobject files using older SFrame versions, or toolchains that don'tsupport old formats. I believe unwinders should support multiple-elementstructures within a .sframe section. When a linker buildsan index for .sframe, it should be viewed as anoptimization that relieves the unwinder from constructing its own indexat runtime. This index construction should remain optional rather thanrequired.

Sectiongroup compliance and garbage collection issues

GNU Assembler generates a single .sframe sectioncontaining relocations to STB_LOCAL symbols from multipletext sections, including those in different section groups.

This creates ELF specification violations when a referenced textsection is discarded by the COMDAT section grouprule. The ELF specification states:

A symbol table entry with STB_LOCAL binding that isdefined relative to one of a group's sections, and that is contained ina symbol table section that is not part of the group, must be discardedif the group members are discarded. References to this symbol tableentry from outside the group are not allowed.

The problem manifests when inline functions are deduplicated:

cat > a.cc <<'eof'
[[gnu::noinline]] inline int inl() { return 0; }
auto *fa = inl;
eof
cat > b.cc <<'eof'
[[gnu::noinline]] inline int inl() { return 0; }
auto *fb = inl;
eof
~/opt/gcc-15/bin/g++ -Wa,--gsframe -c a.cc b.cc

Linkers correctly reject this violation:

% ld.lld a.o b.o
ld.lld: error: relocation refers to a discarded section: .text._Z3inlv
>>> defined in b.o
>>> referenced by b.cc
>>>               b.o:(.sframe+0x1c)

% gold a.o b.o
b.o(.sframe+0x1c): error: relocation refers to local symbol ".text._Z3inlv" [2], which is defined in a discarded section
  section group signature: "inl()"
  prevailing definition is from a.o

(In 2020, I reported a similarissue for GCC -fpatchable-function-entry=.)

Some linkers don't implement this error check. A separate issuearises with garbage collection: by default, an unreferenced.sframe section will be discarded. If the linker implementsa workaround to force-retain .sframe, it mightinadvertently retain all text sections referenced by.sframe, even those that would otherwise be garbagecollected.

The solution requires restructuring the assembler's output strategy.Instead of creating a monolithic .sframe section, theassembler should generate individual SFrame sections corresponding toeach text section. When a text section belongs to a COMDAT group, itsassociated SFrame section must join the same group. For standalone textsections, the SHF_LINK_ORDER flag should establish theproper association.

This approach would create multiple SFrame sections withinrelocatable files, making the size optimization benefits of a simplifiedlinking view format even more compelling. While this comes with theoverhead of additional section headers (where eachElf64_Shdr consumes 64 bytes), it's a cost we should pay tobe a good ELF citizen. This reinforces the value of my sectionheader reduction proposal.

Version compatibilitychallenges

The current design creates significant version compatibilityproblems. When a linker only supports v3 but encounters object fileswith v2 .sframe sections, it faces impossible choices:

Discard v2 sections: Silently losing functionality
Report errors: Breaking builds with mixed-version object files
Concatenate sections: Currently unsupported by unwinders
Upgrade v2 to v3: Requires maintaining version-specific merge logicfor every version

This differs fundamentally from reading a format—each version needsversion-specific merging logic in every linker. Consider thescenario where v2 uses layout A, v3 uses layout B, and v4 uses layout C.A linker receiving objects with all three versions must produce coherentoutput with proper indexing while maintaining version-specific mergelogic for each.

Real-world mixing scenarios include:

Third-party vendor libraries built with older toolchains
Users linking against prebuilt libraries from different sources
Users who don't need SFrame but must handle prebuilt libraries witholder versions
Users updating their linker to a newer version that drops legacySFrame support

Most users will not need stack tracing features—this may changeeventually, but that will take many years. In the meantime, they mustaccept unneeded information while handling the resulting compatibilityissues.

Requiring version-specific merging as default behavior would createmaintenance burden unmatched by any other loadable metadata section.

Proposed format separation

A future version should distinguish between linking and executionviews to resolve the compatibility and maintenance challenges outlinedabove. This separation has precedent in existing debug formats:.debug_pubnames/.gdb_index provides anexcellent model for separate linking and execution views. DWARF v5's.debug_names takes a different approach, unifying bothviews at the cost of larger linking formats—a reasonable tradeoff sincerelocatable files contain only a single .debug_namessection, and debuggers can efficiently load sections with concatenatedname tables.

For SFrame, the separation would work as follows:

Separate linking format. Assemblers produce asimpler format, omitting index-specific metadata fields such assfh_num_fdes, sfh_num_fres,sfh_fdeoff, and sfh_freoff.

Default concatenation behavior. Linkers concatenate.sframe input sections by default, consistent with DWARFand other metadata sections. Linkers can handle mixed-version scenariosgracefully without requiring version-specific merge logic, eliminatingthe impossible maintenance burden of keeping version-specific mergelogic for every SFrame version in every linker implementation.Distributions can roll out SFrame support incrementally withoutrequiring all linkers to support index building immediately.

The unwinder implementation cost is manageable. Stack unwindersalready need to support .sframe sections across the mainexecutable and all shared objects. Supporting multiple concatenatedelements within a single .sframe section presents nofundamental technical barrier—this is a one-time implementation costthat provides forward and backward compatibility.

Optional index construction. When the opt-in option--sframe-index is requested, the linker builds an indexfrom recognized versions while reporting warnings for unrecognized ones.This is analogous to --gdb-indexand --debug-names.

With this approach, the linker builds .sframe_idx frominput .sframe sections. To support the Linux kernelworkflow (ld -r for kernel modules),ld -r --sframe-index must also generate the indexedformat.

The index construction happens before section matching in linkescripts. The output section description.sframe_idx : { *(.sframe_idx) } places the synthesized.sframe_idx into the .sframe_idx outputsection. .sframe input sections have been replaced by thelinker-synthesized .sframe_idx, so we don't write*(.sframe).

Alternative:Deriving SFrame from .eh_frame

An alternative approach could eliminate the need for assemblers togenerate .sframe sections directly. Instead, the linkerwould merge and optimize .eh_frame as usual (which requiresCIE and FDE boundary information), then derive .sframe (or.sframe_idx) from the optimized .eh_frame.

This approach offers a significant advantage: since the linker onlyreads the stable .eh_frame format and produces.sframe or .sframe_idx as output, versioncompatibility concerns disappear entirely.

While CFI instruction decoding introduces additional complexity (astep previously unneeded), this is balanced by the architecturaladvantage of centralizing the conversion logic. Rather than scatteringformat-specific processing code throughout the linker (similar to howSHF_MERGE and .eh_frame require specialinternal representations), the transformation logic remainslocalized.

The counterargument centers on maintenance burden. This fine-grainedknowledge of the SFrame format may expose the linker to more frequentupdates as the format evolves—a serious risk, given that the linker'sfoundational role in the build process demands exceptional stability androbustness.

Post-processing alternative

A more cautious intermediate strategy could leverage existing Linuxdistribution post-processing tools, modifying them to append.sframe sections to executable and shared object filesafter linking completes. While this introduces more friction than nativelinker support and requires integration into package build systems, itoffers several compelling advantages:

Allows .sframe format experimentation without imposinglinker complexity
Provides time for the format to mature and prove its value beforecommitting to linker integration
Enables testing across diverse userspace packages in real-worldscenarios
Post-link tools can optimize and even overwrite sections in-placewithout linker constraints
For cases where optimization significantly shrinks the section,.sframe can be placed at the end of the file (similar toBOLT moving .rodata)

However, this approach faces practical challenges. Post-processingadds build complexity, particularly with features like build-ids andread-only file systems. The success of .gdb_index, wherelinker support (--gdb-index) proved more popular thanpost-link tools, suggests that native linker support eventually becomesnecessary for widespread adoption.

The key question is timing: should linker integration be the startingpoint or the outcome of proven stability?

SHF_ALLOC considerations

The .sframe section carries the SHF_ALLOCflag, meaning it's loaded as part of the program's read-only datasegment. This design choice creates tradeoffs:

With SHF_ALLOC: - .sframe contributesto initial read-only data segment consumption - Can be accessed directlyas part of the memory-mapped area, relying on kernel's page fault ondemand mechanism.

Without SHF_ALLOC: - No upfront memory cost -Tracers must open the file and initiate IO to mmap the section on demand- Runtime cost may not amortize well for frequent tracing

Analysis of 337 files in /usr/bin and /usr/lib/x86_64-linux-gnu/shows .eh_frame typically consumes 5.2% (median: 5.1%) offile size:

EH_Frame size distribution:
  Min: 0.3%    Max: 11.5%    Mean: 5.2%    Median: 5.1%

  0%-1%: 7 files      5%-6%: 62 files
  1%-2%: 17 files     6%-7%: 33 files
  2%-3%: 37 files     7%-8%: 36 files
  3%-4%: 49 files     8%-9%: 20 files
  4%-5%: 50 files     9%-10%: 20 files
                      10%-12%: 6 files

If .sframe size is comparable to .eh_frame,this represents significant overhead for applications that never usestack tracing—likely the majority of users. Most users will not needstack trace features, raising the question of whether having.sframe always loaded is an acceptable overhead fordistributions shipping it by default.

perf supports .debug_frame(tools/perf/util/unwind-libunwind-local.c), which does not haveSHF_ALLOC. While there's a difference between status quoand what's optimal, the non-SHF_ALLOC approach deservesconsideration for scenarios where runtime tracing overhead can beamortized or where memory footprint matters more than immediateaccess.

Kernel challenges

The .sframe section may not be resident in the physicalmemory. SFrame proposers are attempting to defer user stack traces untilsyscall boundaries.

Ian Rogers points out that BPF programs can no longer simply stacktrace user code. This change breaks stack trace deduplication, acommonly used BPF primitive.

Miscellaneous minorconsiderations

Linker relaxation considerations:

Since .sframe carries the SHF_ALLOC flag,it affects text section addresses and consequently influences linkerrelaxation on architectures like RISC-V and LoongArch.

If variable-length encoding is introduced to the format,.sframe would behave as an address-dependent sectionsimilar to .relr.dyn. However, this dependency should notpose significant implementation challenges.

Endianness considerations:

The SFrame format currently supports endianness variants, whichcomplicates toolchain implementation. While runtime consumers typicallytarget a single endianness, development tools must handle both variantsto support cross-compilation workflows.

The endianness discussion in The future of 32-bit support inthe kernel reinforces my belief in preferring universallittle-endian for new formats. A universal little-endian approach wouldreduce implementation complexity by eliminating the need for:

Endianness-aware function calls likeread32le(config, p) where config->endianspecifies the object file's byte order
Template-based abstractions such astemplate <class Endian> that must wrap every dataaccess function

Instead, toolchain code could use straightforward calls likeread32le(p), streamlining both implementation andmaintenance.

This approach remains efficient even on big-endian architectures likeIBM z/Architecture and POWER. z/Architecture's LOAD REVERSEDinstructions, for instance, handle byte swapping with minimal overhead,often requiring no additional instructions beyond normal loads. Whileslight performance differences may exist compared to native endianoperations, the toolchain simplification benefits generally outweighthese concerns.

#define WIDTH(x) \
typedef __UINT##x##_TYPE__ [[gnu::aligned(1)]] uint##x; \
uint##x load_inc##x(uint##x *p) { return *p+1; } \
uint##x load_bswap_inc##x(uint##x *p) { return __builtin_bswap##x(*p)+1; }; \
uint##x load_eq##x(uint##x *p) { return *p==3; } \
uint##x load_bswap_eq##x(uint##x *p) { return __builtin_bswap##x(*p)==3; }; \

WIDTH(16);
WIDTH(32);
WIDTH(64);

However, I understand that my opinion is probably not popular withinthe object file format community and faces resistance from stakeholderswith significant big-endian investments.

Questioned benefits

SFrame's primary benefit centers on enabling frame pointer omissionwhile preserving unwinding capabilities. In scenarios where usersalready omit leaf frame pointers, SFrame could theoretically allowswitching from-fno-omit-frame-pointer -momit-leaf-frame-pointer to-fomit-frame-pointer -momit-leaf-frame-pointer. Thisbenefit appears most significant on x86-64, which has limitedgeneral-purpose registers (without APX). Performance analyses show mixedresults: some studies claim frame pointers degrade performance by lessthan 1%, while others suggest 1-2%. However, this argument overlooks acritical tradeoff—SFrame unwinding itself performs worse than framepointer unwinding, potentially negating any performance gains fromregister availability.

Another claimed advantage is SFrame's ability to provide coverage infunction prologues and epilogues, where frame-pointer-based unwindingmay miss frames. Yet this overlooks a straightforward alternative: framepointer unwinding can be enhanced to detect prologue and epiloguepatterns by disassembling instructions at the program counter.

SFrame also faces a practical consideration: the .sframesection likely requires kernel page-in during unwinding, while theprocess stack is more likely already resident in physical memory. As IanRogers noted in LWN,system-wide profiling encounters limitations when system calls haven'ttransitioned to user code, BPF helpers may return placeholder values,and JIT compilers require additional SFrame support.

Looking ahead, hardware-assisted unwinding through features like x86Shadow Stack and AArch64 Guarded Control Stack may reshape the entirelandscape, potentially reducing the relevance of metadata-basedunwinding formats. Meanwhile, compact unwinding schemes like WindowsARM64 demonstrate that significantly smaller metadata formats remainviable alternatives to both SFrame and .eh_frame. Proposalslike Asynchronous Compact Unwind Descriptors have demonstrated thatcompact unwind formats can work with shrink-wrapping optimizations.There is a feature request for a compact information for AArch64 https://github.com/ARM-software/abi-aa/issues/344

Summary

Beyond these fundamental questions about SFrame's value proposition,the format presents a size improvement to Linux kernel's ORC unwinder.Its design presents several implementation challenges that meritconsideration for future versions:

Object file format design issues (mandatory index building, sectiongroup compliance, version compatibility)
Limited large text section support restricts deployment in modernbinaries
Size issue

These technical concerns, combined with the fundamental valuequestions raised above, suggest that careful consideration is warrantedbefore widespread adoption.

If we proceed, here ishow to do it right

According to thiscomment on llvm-project #64449, "v3 is the version that will besubmitted upstream when the time is right." Please share feedback on theformat before it's finalized, even if you may not be impressed with thedesign.

To ensure rapid SFrame evolution without compatibility concerns, abetter approach is to build a library that parses .eh_frameand generates SFrame. The Linux kernel can then use this library (inobjtool?) to generate SFrame for vmlinux and modules. Relying onassembler/linker output for this critical metadata format requires alevel of stability that is currently concerning.

The ongoing maintenance implications warrant particular attention.Observing the binutils mailing list reveals a significant volume ofSFrame commits. Most linker features stabilize quickly after initialimplementation, but SFrame appears to require continued evolution. Giventhe linker's foundational role in the build process, which demandsexceptional stability and robustness, the long-term maintenance burdendeserves careful consideration.

Early integration into GNU toolchain has provided valuable feedbackfor format evolution, but this comes at the cost of coupling theformat's maturity to linker stability. The SFrame GNU toolchaindevelopers exhibit a concerningtendency to disregard ELF and linker conventions—a serious problemfor all linker maintainers.

Learningfrom existing compact unwind implementations

LLVM has had a battle-tested compact unwind format in production usesince 2009 with OS X 10.6. The efficiency gains are dramatic even if itmight only cover synchronous unwinding needs. OpenVMS's x86-64 port,which is ELF-based, also adopted this format as documented in their "VSIOpenVMS Calling Standard" and their 2018post on LLVM Discourse. This isn't to suggest we should simply adoptthe existing compact unwind format wholesale. The x86-64 design datesback to 2009 or earlier, and there are likely improvements we can make.However, we should aim for similar or better efficiency gains.

On AArch64, there are at least two formats the ELF one can learnfrom: LLVM's compact unwind format (aarch64) and Windows ARM64 FrameUnwind Code.

lld 21 ELF changes

MaskRay

2025年9月7日 15:00

LLVM 21.1 have been released. As usual, I maintain lld/ELF and haveadded some notes to https://github.com/llvm/llvm-project/blob/release/21.x/lld/docs/ReleaseNotes.rst.I've meticulously reviewed nearly all the patches that are not authoredby me. I'll delve into some of the key changes.

Added -z dynamic-undefined-weak to make undefined weaksymbols dynamic when the dynamic symbol table is present. (#143831)
For -z undefs (default for -shared),relocations referencing undefined strong symbols now behave likerelocations referencing undefined weak symbols.
--why-live=<glob> prints for each symbol matching<glob> a chain of items that kept it live duringgarbage collection. This is inspired by the Mach-O LLD feature of thesame name.
--thinlto-distributor= and--thinlto-remote-compiler= options are added to supportIntegrated Distributed ThinLTO. (#142757)
Linker script OVERLAY descriptions now support virtualmemory regions (e.g. >region) andNOCROSSREFS.
When the last PT_LOAD segment is executable andincludes BSS sections, its p_memsz member is now correct.(#139207)
Spurious ASSERT errors before the layout converges arenow fixed.
For ARM and AArch64, --xosegment and--no-xosegment control whether to place executable-only andreadable-executable sections in the same segment. The default option is--no-xosegment. (#132412)
For AArch64, added support for the SHF_AARCH64_PURECODEsection flag, which indicates that the section only contains programcode and no data. An output section will only have this flag set if allinput sections also have it set. (#125689, #134798)
For AArch64 and ARM, added -zexecute-only-report, whichchecks for missing SHF_AARCH64_PURECODE andSHF_ARM_PURECODE section flags on executable sections. (#128883)
For AArch64, -z nopac-plt has been added.
For AArch64 and X86_64, added --branch-to-branch, whichrewrites branches that point to another branch instruction to insteadbranch directly to the target of the second instruction. Enabled bydefault at -O2.
For AArch64, added support for -zgcs-report-dynamic,enabling checks for GNU GCS Attribute Flags in Dynamic Objects when GCSis enabled. Inherits value from -zgcs-report (capped atwarning level) unless user-defined, ensuring compatibilitywith GNU ld linker.
The default Hexagon architecture version in ELF object filesproduced by lld is changed to v68. This change is only effective whenthe version is not provided in the command line by the user and cannotbe inferred from inputs.
For LoongArch, the initial-exec to local-exec TLS optimization hasbeen implemented.
For LoongArch, several relaxation optimizations are supported,including relaxation for R_LARCH_PCALA_HI20/LO12 andR_LARCH_GOT_PC_HI20/LO12 relocations, instructionrelaxation for R_LARCH_CALL36, TLS local-exec(LE)/global dynamic (GD)/ local dynamic(LD) model relaxation, and TLSDESC code sequencerelaxation.
For RISCV, an oscillation bug due to call relaxation is now fixed.(#142899)
For x86-64, the .ltext section is now placed before.rodata.

Link: lld 20 ELFchanges

Benchmarking compression programs

MaskRay

2025年8月31日 15:00

tl;dr https://gist.github.com/MaskRay/74cdaa83c1f44ee105fcebcdff0ba9a7is a single-file Ruby program that downloads and compiles multiplecompression utilities, then benchmarks their compression anddecompression performance on a specified input file, finally generates aHTML file with scatter charts. Scroll to the end to view example HTMLpages.

Compression algorithms can be broadly categorized into three groupsbased on their typical compression ratio and decompression speed:

Low ratio, high speed: lz4, snappy, Oodle Selkie.
Medium ratio, medium speed: zlib, zstd, brotli, OodleKraken.
High ratio, low speed: LZMA, bzip2, bzip3, bsc, zpaq,kanzi, Oodle Leviathan.

Low ratio Codecs in this category prioritize speedabove all else. The compression and compression speeds are comparable.They are designed to decompress so quickly that they don't introduce anoticeable delay when reading data from storage like solid-state drives.These codecs typically producing byte-aligned output and often skip thefinal step of entropy encoding, which, while crucial for highcompression, is computationally intensive. They are excellent choicesfor applications where latency is critical, such as kernel features likezswap.

Medium ratio This is the sweet spot for many tasks.The codecs achieve better compression ratio by employing entropyencoding, usually Huffman coding.

zstd has emerged as a clear leader, gaining popularity andeffectively supplanting older codecs like the venerable DEFLATE(zlib).

High ratio They are designed to squeeze every lastbit of redundancy out of the data, often at the cost of significantlylonger compression and decompression times, and large memory usage. Theyare perfect for archival purposes or data distribution where the filesare compressed once and decompressed infrequently. Codecs typically have3 important components:

Transforms: Codecs typically implement strong transforms to increaseredundancy, even very specific ones like branch/call/jump filters formachine code.
Predication model: This model anticipates the next piece of databased on what has already been processed.
Entropy encoding: Traditional codecs use arithmetic encoder, whichis replaced by the more efficient Range variant of Asymmetric NumeralSystems (rANS).

Some projects apply neural network models, such as Recurrent NeuralNetwork, Long Short-Term Memory, and Transformer, to the predicationmodel. They are usually very slow.

This categorization is loose. Many modern programs offer a wide rangeof compression levels that allow them to essentially span multiplecategories. For example, a high-level zstd compression canachieve a ratio comparable to xz (a high-compression codec) byusing more RAM and CPU. While zstd's compression speed or ratiois generally lower, its decompression speed is often much faster thanthat of xz.

Benchmarking

I want to benchmark the single worker performance of a fewcompression programs:

lz4: Focuses on speed over compression ratio. Memory usageis extremely low. It seems Pareto superior to Google'sSnappy.
zstd: Gained significant traction and obsoleted manyexisting codecs. Its LZ77 variant uses three recent match offsets likeLZX. For entropy encoding, it employs Huffman coding for literals and2-way interleaved Finite State Entropy for Huffman weights, literallengths, match lengths, and offset codes. The large alphabet of literalsmakes Huffman a good choice, as compressing them with FSE provideslittle gain for a speed cost. However, other symbols have a small range,making them a sweet spot for FSE. zstd works on multiple streams at thesame time to utilize instruction-level parallelism. zstd is supported bythe Accept-Encoding: zstd HTTP header. Decompression memoryusage is very low.
brotli: Uses a combination of LZ77, 2nd order contextmodel, Huffman coding, and static dictionary. The decompression speed issimilar to gzip with a higher ratio. At lower levels, its performance isovershadowed by zstd. Compared with DEFLATE, it employs alarger sliding window (from 16KiB-16B to 16MiB-16B) and a smallerminimum match length (2 instead of 3). It has a predefined dictionarythat works well for web content (but feels less elegant) and supports120 transforms. brotli is supported by theAccept-Encoding: br HTTP header. Decompression memory usageis quite low.
bzip3: Combines BWT, RLE, and LZP and uses arithmeticencoder. Memory usage is large.
xz: LZMA2 with a few filters. The filters must be enabledexplicitly.
lzham: Provides a compression ratio similar to LZMA but with fasterdecompression. Compression is slightly slower while memory usage islarger. The build system is not well-polished for Linux. I have forkedit, fixed stdint.h build errors, and installedlzhamtest. The command line program lzhamtestshould really be renamed to lzham.
zpaq: Functions as a command-line archiver supportingmultiple files. It combines context mixing with arithmetic encoder butoperates very slowly.
kanzi: There are a wide variety of transforms and entropyencoders, unusual for a compresion program. For the compression speed ofenwik8, it's Pareto superior to xz, but decompression isslower. Levels 8 and 9 belong to the PAQ8 family and consume substantialmemory.

I'd like to test lzham (not updated for a few years), but I'm havingtrouble getting it to compile due to a cstdio headerissue.

Many modern compressors are parallel by default. I have to disablethis behavior by using options like -T1. Still,zstd uses a worker thread for I/O overlap, but I don't botherwith --single-thread.

To ensure fairness, each program is built with consistent compileroptimizations, such as -O3 -march=native.

Below is a Ruby program that downloads and compiles multiplecompression utilities, compresses then decompress a specified inputfile. It collects performance metrics including execution time, memoryusage, and compression ratio, and finally generates an HTML file withscatter charts visualizing the results. The program has several notablefeatures:

Adding new compressors is easy: just modifyCOMPRESSORS.
Benchmark results are cached in files namedcache_$basename_$digest.json, allowing reuse of previousruns for the same input file.
Adding a new compression level does not invalidate existingbenchmark results for other levels.
The script generates an HTML file with interactive scatter charts.Each compressor is assigned a unique, deterministic color based on ahash of its name (using the hsl function in CSS).

The single file Ruby program is available at https://gist.github.com/MaskRay/74cdaa83c1f44ee105fcebcdff0ba9a7

Limitation

A single run might not be representative.

Running the executable incurs initialization overhead, which would beamortized in a library setup. However, library setup would make updatinglibraries more difficult.

Demo

ruby bench.rb enwik8
# The first iframe below

ruby bench.rb clang
# The second iframe below

Many programs exhibit a stable decompression speed (uncompressed size/ decompression time). There is typically a slightly higherdecompression speed at higher compression levels. If you think of thecompressed content as a form of "byte code", a more highly compressedfile means there are fewer bytes for the decompression algorithm toprocess, resulting in faster decompression. Some programs, likezpaq and kanzi, use different algorithms that canresult in significantly different decompression speeds.

xz -9 doesn't use parallelism on the two files under~100 MiB because their uncompressed size is smaller than the defaultblock size for level 9.

From install/include/lzma/container.h

For each thread, about 3 * block_size bytes of memory will beallocated. This may change in later liblzma versions. If so, the memoryusage will probably be reduced, not increased.

Understanding alignment - from source to object file

MaskRay

2025年8月24日 15:00

Alignment refers to the practice of placing data or code at memoryaddresses that are multiples of a specific value, typically a power of2. This is typically done to meet the requirements of the programminglanguage, ABI, or the underlying hardware. Misaligned memory accessesmight be expensive or will cause traps on certain architectures.

This blog post explores how alignment is represented and managed asC++ code is transformed through the compilation pipeline: from sourcecode to LLVM IR, assembly, and finally the object file. We'll focus onalignment for both variables and functions.

Alignment in C++ source code

C++ [basic.align]specifies

Object types have alignment requirements ([basic.fundamental],[basic.compound]) which place restrictions on the addresses at which anobject of that type may be allocated. An alignment is animplementation-defined integer value representing the number of bytesbetween successive addresses at which a given object can be allocated.An object type imposes an alignment requirement on every object of thattype; stricter alignment can be requested using the alignment specifier([dcl.align]). Attempting to create an object ([intro.object]) instorage that does not meet the alignment requirements of the object'stype is undefined behavior.

alignas can be used to request a stricter alignment. [decl.align]

An alignment-specifier may be applied to a variable or to a classdata member, but it shall not be applied to a bit-field, a functionparameter, or an exception-declaration ([except.handle]). Analignment-specifier may also be applied to the declaration of a class(in an elaborated-type-specifier ([dcl.type.elab]) or class-head([class]), respectively). An alignment-specifier with an ellipsis is apack expansion ([temp.variadic]).

Example:

1 2	alignas(16) int i0; struct alignas(8) S {};

If the strictest alignas on a declaration is weaker thanthe alignment it would have without any alignas specifiers, the programis ill-formed.

% echo 'alignas(2) int v;' | clang -fsyntax-only -xc++ -
<stdin>:1:1: error: requested alignment is less than minimum alignment of 4 for type 'int'
    1 | alignas(2) int v;
      | ^
1 error generated.

However, the GNU extension __attribute__((aligned(1)))can request a weaker alignment.

1	typedef int32_t __attribute__((aligned(1))) unaligned_int32_t;

Further reading: Whatis the Strict Aliasing Rule and Why do we care?

LLVM IR representation

In the LLVM Intermediate Representation (IR), both global variablesand functions can have an align attribute to specify theirrequired alignment.

Globalvariable alignment:

An explicit alignment may be specified for a global, which must be apower of 2. If not present, or if the alignment is set to zero, thealignment of the global is set by the target to whatever it feelsconvenient. If an explicit alignment is specified, the global is forcedto have exactly that alignment. Targets and optimizers are not allowedto over-align the global if the global has an assigned section. In thiscase, the extra alignment could be observable: for example, code couldassume that the globals are densely packed in their section and try toiterate over them as an array, alignment padding would break thisiteration. For TLS variables, the module flag MaxTLSAlign, if present,limits the alignment to the given value. Optimizers are not allowed toimpose a stronger alignment on these variables. The maximum alignment is1 << 32.

Function alignment

An explicit alignment may be specified for a function. If notpresent, or if the alignment is set to zero, the alignment of thefunction is set by the target to whatever it feels convenient. If anexplicit alignment is specified, the function is forced to have at leastthat much alignment. All alignments must be a power of 2.

A backend can override this with a preferred function alignment(STI->getTargetLowering()->getPrefFunctionAlignment()),if that is larger than the specified align value. (https://discourse.llvm.org/t/rfc-enhancing-function-alignment-attributes/88019/3)

In addition, align can be used in parameter attributesto decorate a pointer or vector of pointers.

LLVM back end representation

Global variablesAsmPrinter::emitGlobalVariable determines the alignment forglobal variables based on a set of nuanced rules:

With an explicit alignment (explicit),
- If the variable has a section attribute, returnexplicit.
- Otherwise, compute a preferred alignment for the data layout(getPrefTypeAlign, referred to as pref).Returnpref < explicit ? explicit : max(E, getABITypeAlign).
Without an explicit alignment: returngetPrefTypeAlign.

getPrefTypeAlign employs a heuristic for global variabledefinitions: if the variable's size exceeds 16 bytes and the preferredalignment is less than 16 bytes, it sets the alignment to 16 bytes. Thisheuristic balances performance and memory efficiency for common cases,though it may not be optimal for all scenarios. (See Preferredalignment of globals > 16bytes in 2012)

For assembly output, AsmPrinter emits .p2align (power of2 alignment) directives with a zero fill value (i.e. the padding bytesare zeros).

% echo 'int v0;' | clang --target=x86_64 -S -xc - -o -
        .file   "-"
        .type   v0,@object                      # @v0
        .bss
        .globl  v0
        .p2align        2, 0x0
v0:
        .long   0                               # 0x0
        .size   v0, 4
...

Functions For functions,AsmPrinter::emitFunctionHeader emits alignment directivesbased on the machine function's alignment settings.

void MachineFunction::init() {
...
  Alignment = STI.getTargetLowering()->getMinFunctionAlignment();

  // FIXME: Shouldn't use pref alignment if explicit alignment is set on F.
  if (!F.hasOptSize())
    Alignment = std::max(Alignment,
                         STI.getTargetLowering()->getPrefFunctionAlignment());

The subtarget's minimum function alignment
If the function is not optimized for size (i.e. not compiled with-Os or -Oz), take the maximum of the minimumalignment and the preferred alignment. For example,X86TargetLowering sets the preferred function alignment to16.

% echo 'void f(){} [[gnu::aligned(32)]] void g(){}' | clang --target=x86_64 -S -xc - -o -
        .file   "-"
        .text
        .globl  f                               # -- Begin function f
        .p2align        4
        .type   f,@function
f:                                      # @f
...
        .globl  g                               # -- Begin function g
        .p2align        5
        .type   g,@function
g:                                      # @g

The emitted .p2align directives omits the fill valueargument: for code sections, this space is filled with no-opinstructions.

Assembly representation

GNU Assembler supports multiple alignment directives:

.p2align 3: align to 2**3
.balign 8: align to 8
.align 8: this is identical to .balign onsome targets and .p2align on the others.

Clang supports "direct object emission" (clang -ctypically bypasses a separate assembler), the LLVMAsmPrinter directlyuses the MCObjectStreamer API. This allows Clang to emitthe machine code directly into the object file, bypassing the need toparse and interpret alignment directives and instructions from atext-based assembly file.

These alignment directives has an optional third argument: themaximum number of bytes to skip. If doing the alignment would requireskipping more bytes than the specified maximum, the alignment is notdone at all. GCC's -falign-functions=m:n utilizes thisfeature.

Object file format

In an object file, the section alignment is determined by thestrictest alignment directive present in that section. The assemblersets the section's overall alignment to the maximum of all thesedirectives, as if an implicit directive were at the start.

.section .text.a,"ax"
# implicit alignment max(4, 8)

.long 0
.balign 4
.long 0
.balign 8

This alignment is stored in the sh_addralign fieldwithin the ELF section header table. You can inspect this value usingtools such as readelf -WS (llvm-readelf -S) orobjdump -h (llvm-objdump -h).

Linker considerations

The linker combines multiple object files into a single executable.When it maps input sections from each object file into output sectionsin the final executable, it ensures that section alignments specified inthe object files are preserved.

How the linker handlessection alignment

Output section alignment: This is the maximumsh_addralign value among all its contributing inputsections. This ensures the strictest alignment requirements are met.

Section placement: The linker also uses inputsh_addralign information to position each input sectionwithin the output section. As illustrated in the following example, eachinput section (like a.o:.text.f or b.o:.text)is aligned according to its sh_addralign value before beingplaced sequentially.

output .text
  # align to sh_addralign(a.o:.text). No-op if this is the first section without any preceding DOT assignment or data command.
  a.o:.text
  # align to sh_addralign(a.o:.text.f)
  a.o:.text.f
  # align to sh_addralign(b.o:.text)
  b.o:.text
  # align to sh_addralign(b.o:.text.g)
  b.o:.text.g

Link script control A linker script can override thedefault alignment behavior. The ALIGN keyword enforces astricter alignment. For example .text : ALIGN(32) { ... }aligns the section to at least a 32-byte boundary. This is often done tooptimize for specific hardware or for memory mapping requirements.

The SUBALIGN keyword on an output section overrides theinput section alignments.

Padding: To achieve the required alignment, thelinker may insert padding between sections or before the first inputsection (if there is a gap after the output section start). The fillvalue is determined by the following rules:

If specified, use the =fillexpoutput section attribute (within an output sectiondescription).
If a non-code section, use zero.
Otherwise, use a trap or no-op instructin.

Padding and sectionreordering

Linkers typically preserve the order of input sections from objectfiles. To minimize the padding required between sections, linker scriptscan use a SORT_BY_ALIGNMENT keyword to arrange inputsections in descending order of their alignment requirements. Similarly,GNU ld supports --sort-commonto sort COMMON symbols by decreasing alignment.

While this sorting can reduce wasted space, modern linking strategiesoften prioritize other factors, such as cache locality (for performance)and data similarity (for Lempel–Ziv compression ratio), which canconflict with sorting by alignment. (Search--bp-compression-sort= on Explain GNU stylelinker options).

System page size

The alignment of a variable or function can be as large as the systempage size. Some implementations allow a larger alignment. (Over-alignedsegment)

ABI compliance

Some platforms have special rules. For example,

On SystemZ, the larl (load address relative long)instruction cannot generate odd addresses. To prevent GOT indirection,compilers ensure that symbols are at least aligned by 2. (Toolchainnotes on z/Architecture)
On AIX, the default alignment mode is power: for doubleand long double, the first member of this data type is aligned accordingto its natural alignment value; subsequent members of the aggregate arealigned on 4-byte boundaries. (https://reviews.llvm.org/D79719)
z/OS caps the maximum alignment of static storage variables to 16.(https://reviews.llvm.org/D98864)

The standard representation of the the Itanium C++ ABI requiresmember function pointers to be even, to distinguish between virtual andnon-virtual functions.

In the standard representation, a member function pointer for avirtual function is represented with ptr set to 1 plus the function'sv-table entry offset (in bytes), converted to a function pointer as ifbyreinterpret_cast<fnptr_t>(uintfnptr_t(1 + offset)),where uintfnptr_t is an unsigned integer of the same sizeas fnptr_t.

Conceptually, a pointer to member function is a tuple:

A function pointer or virtual table index, discriminated by theleast significant bit
A displacement to apply to the this pointer

Due to the least significant bit discriminator, members function needa stricter alignment even if __attribute__((aligned(1))) isspecified:

1	virtual void bar1() __attribute__((aligned(1)));

Side note: check out MSVC C++ ABI MemberFunction Pointers for a comparison with the MSVC C++ ABI.

Architecture considerations

Contemporary architectures generally support unaligned memory access,likely with very small performance penalties. However, someimplementations might restrict or penalize unaligned accesses heavily,or require specific handling. Even on architectures supporting unalignedaccess, atomic operations might still require alignment.

On AArch64, a bit in the system control registersctlr_el1 enables alignment check.
On x86, if the AM bit is set in the CR0 register and the AC bit isset in the EFLAGS register, alignment checking of user-mode dataaccessing is enabled.

Linux's RISC-V port supportsprctl(PR_SET_UNALIGN, PR_UNALIGN_SIGBUS); to enable strictalignment.

clang -fsanitize=alignment can detect misaligned memoryaccess. Check out my write-up.

In 1989, US Patent 4814976, which covers "RISC computer withunaligned reference handling and method for the same" (4 instructions:lwl, lwr, swl, and swr), was granted to MIPS Computer Systems Inc. Itcaused a barrier for other RISC processors, see The Lexra Story.

Almost every microprocessor in the world can emulate thefunctionality of unaligned loads and stores in software. MIPSTechnologies did not invent that. By any reasonable interpretation ofthe MIPS Technologies' patent, Lexra did not infringe. In mid-2001 Lexrareceived a ruling from the USPTO that all claims in the the lawsuit wereinvalid because of prior art in an IBM CISC patent. However, MIPSTechnologies appealed the USPTO ruling in Federal court, adding toLexra's legal costs and hurting its sales. That forced Lexra into anunfavorable settlement. The patent expired on December 23, 2006 at whichpoint it became legal for anybody to implement the complete MIPS-Iinstruction set, including unaligned loads and stores.

Aligning code forperformance

GCC offers a family of performance-tuning options named-falign-*, that instruct the compiler to align certain codesegments to specific memory boundaries. These options might improveperformance by preventing certain instructions from crossing cache lineboundaries (or instruction fetch boundaries), which can otherwise causean extra cache miss.

-falign-function=n: Align functions.
-falign-labels=n: Align branch targets.
-falign-jumps=n: Align branch targets, for branchtargets where the targets can only be reached by jumping.
-falign-loops=n: Align the beginning of loops.

Important considerations

Inefficiency with Small Functions: Aligning smallfunctions can be inefficient and may not be worth the overhead. Toaddress this, GCC introduced -flimit-function-alignment in2016. The option sets .p2align directive's max-skip operandto the estimated function size minus one.

% echo 'int add1(int a){return a+1;}' | gcc -O2 -S -fcf-protection=none -xc - -o - -falign-functions=16 | grep p2align
        .p2align 4
% echo 'int add1(int a){return a+1;}' | gcc -O2 -S -fcf-protection=none -xc - -o - -falign-functions=16 -flimit-function-alignment | p2align
        .p2align 4,,3

The max-skip operand, if present, is evaluated at parse time, so youcannot do:

.p2align 4, , b-a
a:
  nop
b:

In LLVM, the x86 backend does not implementTargetInstrInfo::getInstSizeInBytes, making it challengingto implement -flimit-function-alignment.

Cold code: These options don't apply to coldfunctions. To ensure that cold functions are also aligned, use-fmin-function-alignment=n instead.

Benchmarking: Aligning functions can make benchmarksmore reliable. For example, on x86-64, a hot function less than 32 bytesmight be placed in a way that uses one or two cache lines (determined byfunction_addr % cache_line_size), making benchmark resultsnoisy. Using -falign-functions=32 can ensure the functionalways occupies a single cache line, leading to more consistentperformance measurements.

LLVM notes: In clang/lib/CodeGen/CodeGenModule.cpp,-falign-function=N sets the alignment if a function doesnot have the gnu::aligned attribute.

A hardware loop typically consistants of 3 parts:

A low-overhead loop (also called a zero-overhead loop) is ahardware-assisted looping mechanism found in many processorarchitectures, particularly digital signal processors (DSPs). Theprocessor includes dedicated registers that store the loop startaddress, loop end address, and loop count. A hardware loop typicallyconsists of three components:

Loop setup instruction: Sets the loop end address and iterationcount
Loop body: Contains the actual instructions to be repeated
Loop end instruction: Jumps back to the loop body if furtheriterations are required

Here is an example from Arm v8.1-M low-overhead branch extension.

1:
  dls lr, Rn    // Setup loop with count in Rn
  ...           // Loop body instructions
2:
  le lr, 1b     // Loop end - branch back to label 1 if needed

To minimize the number of cache lines used by the loop body, ideallythe loop body (the instruction immediately following DLS) should bealigned to a 64-byte boundary. However, GNU Assembler lacks a directiveto specify alignment like "align DLS to a multiple of 64 plus 60 bytes."Inserting an alignment after the DLS is counterproductive, as it wouldintroduce unwanted NOP instructions at the beginning of the loop body,negating the performance benefits of the low-overhead loopmechanism.

It would be desirable to simulate the functionality with.org ((.+4+63) & -64) - 4 // ensure that .+4 is aligned to 64-byte boundary,but this complex expression involves bitwise AND and is not arelocatable expression. LLVM integrated assembler would reportexpected absolute expression while GNU Assembler has asimilar error.

A potential solution would be to extend the alignment directives withan optional offset parameter:

# Align to 64-byte boundary with 60-byte offset, using NOP padding in code sections
.balign 64, , , 60

# Same alignment with offset, but skip at most 16 bytes of padding
.balign 64, , 16, 60

Xtensa's LOOP instructions has similar alignmentrequirement, but I am not familiar with the detail. The GNU Assembleruses the special alignment as a special machine-dependent fragment. (https://sourceware.org/binutils/docs/as/Xtensa-Automatic-Alignment.html)

LLVM integrated assembler: Improving sections and symbols

MaskRay

2025年8月17日 15:00

In my previous post, LLVMintegrated assembler: Improving expressions and relocations delvedinto enhancements made to LLVM's expression resolving and relocationgeneration. This post covers recent refinements to MC, focusing onsections and symbols.

Sections

Sections are named, contiguous blocks of code or data within anobject file. They allow you to logically group related parts of yourprogram. The assembler places code and data into these sections as itprocesses the source file.

class MCSection {
...
  enum SectionVariant {
    SV_COFF = 0,
    SV_ELF,
    SV_GOFF,
    SV_MachO,
    SV_Wasm,
    SV_XCOFF,
    SV_SPIRV,
    SV_DXContainer,
  };

In LLVM 20, the MCSectionclass used an enum called SectionVariant todifferentiate between various object file formats, such as ELF, Mach-O,and COFF. These subclasses are used in contexts where the section typeis known at compile-time, such as in MCStreamer and MCObjectTargetWriter.This change eliminates the need for runtime type information (RTTI)checks, simplifying the codebase and improving efficiency.

Additionally, the storage for fragments' fixups (adjustments toaddresses and offsets) has been moved into the MCSectionclass.

Symbols

Symbols are names that represent memory addresses or values.

class MCSymbol {
protected:
  /// The kind of the symbol.  If it is any value other than unset then this
  /// class is actually one of the appropriate subclasses of MCSymbol.
  enum SymbolKind {
    SymbolKindUnset,
    SymbolKindCOFF,
    SymbolKindELF,
    SymbolKindGOFF,
    SymbolKindMachO,
    SymbolKindWasm,
    SymbolKindXCOFF,
  };

  /// A symbol can contain an Offset, or Value, or be Common, but never more
  /// than one of these.
  enum Contents : uint8_t {
    SymContentsUnset,
    SymContentsOffset,
    SymContentsVariable,
    SymContentsCommon,
    SymContentsTargetCommon, // Index stores the section index
  };

Similar to sections, the MCSymbolclass also used a discriminator enum, SymbolKind, to distinguishbetween object file formats. This enum has also been removed.

Furthermore, the MCSymbol class had anenum Contents to specify the kind of symbol. This name wasa bit confusing, so it has been renamedto enum Kind for clarity.

regular symbol
equatedsymbol
commonsymbol

A special enumerator, SymContentsTargetCommon, which wasused by AMDGPU for a specific type of common symbol, has also been removed.The functionality it provided is now handled by updatingELFObjectWriter to respect the symbol's section index(SHN_AMDGPU_LDS for this special AMDGPU symbol).

sizeof(MCSymbol) has been reduced to 24 bytes on 64-bitsystems.

The previous blog post LLVMintegrated assembler: Improving expressions and relocationsdescribes other changes:

The MCSymbol::IsUsed flag was a workaround fordetecting a subset of invalid reassignments and is removed.
The MCSymbol::IsResolving flag is added to detectcyclic dependencies of equated symbols.

LLVM integrated assembler: Engineering better fragments

MaskRay

2025年7月27日 15:00

In my previous assembler posts, I've discussed improvements on expressionresolving and relocation generation. Now, let's turn our attentionto recent refinements within section fragments. Understanding how anassembler utilizes these fragments is key to appreciating theimprovements we've made. At a high level, the process unfolds in threemain stages:

Parsing phase: The assembler constructs section fragments. Thesefragments represent sequences of regular instructions or data, span-dependentinstructions, alignment directives, and other elements.
Section layout phase: Once fragments are built, the assemblerassigns offsets to them and finalizes the span-dependent content.
Relocationdecision phase: In the final stage, the assembler evaluates fixupsand, if necessary, updates the content of the fragments.

When the LLVM integrated assembler was introduced in 2009, itssection and fragment design was quite basic. Performance wasn't theconcern at the time. As LLVM evolved, many assembler features added overthe years came to rely heavily on this original design. This created acomplex web that made optimizing the fragment representationincreasingly challenging.

Here's a look at some of the features that added to this complexityover the years:

2010: Mach-O .subsection_via_symbols and atoms
2012: NativeClient's bundle alignment mode. I've created a dedicatedchapter for this.
2015: Hexagon instruction bundle
2016: CodeView variable definition ranges
2018: RISC-V linker relaxation
2020: x86 -mbranches-within-32B-boundaries
2023: LoongArch linker relaxation. This is largely identical toRISC-V linker relaxation. Any refactoring or improvements to the RISC-Vlinker relaxation often necessitate corresponding changes to theLoongArch implementation.
2023: z/OS GOFF(Generalized Object File Format)

I've included the start year for each feature to indicate when it wasinitially introduced, to the best of my knowledge. This doesn't implythat maintenance stopped after that year. On the contrary, many of thesefeatures, like RISC-V linker relaxation, require ongoing, activemaintenance.

Despite the intricate history, I've managed to untangle thesedependencies and implement the necessary fixes. And that, in a nutshell,is what this blog post is all about!

Reducing sizeof(MCFragment)

A significant aspect of optimizing fragment management involveddirectly reducing the memory footprint of the MCFragment object itself.Several targeted changes contributed to makingsizeof(MCFragment) smaller, as mentioned by my previousblog post: Integratedassembler improvements in LLVM 19.

The fragment management system has also been streamlined bytransitioning from a doubly-linked list (llvm::iplist) to asingly-linked list, eliminating unnecessary overhead. A few prerequisitecommits removed backward iterator requirements. It's worth noting thatthe complexities introduced by features like NaCl's bundle alignmentmode, x86's -mbranches-within-32B-boundaries option, andHexagon's instruction bundles presented challenges.

The quest fortrivially destructible fragments

Historically, MCFragment subclasses, specificallyMCDataFragment and MCRelaxableFragment, reliedon SmallVector member variables to store their content andfixups. This approach, while functional, presented two keyinefficiencies:

Inefficient storage of small objects: The content and fixups forindividual fragments are typically very small. Storing a multitude ofthese tiny objects individually within SmallVectors led toless-than-optimal memory utilization.
Non-trivial destructors: When deallocating sections, the~MCSection destructor had to meticulously traverse thefragment list and explicitly destroy each fragment.

In 2024, @aengelke initiated a draft to storefragment content out-of-line. Building upon that foundation, I'veextended this approach to also store fixups out-of-line, and ensuredcompatibility with the aforementioned features that cause complexity(especially RISC-V and LoongArch linker relaxation.)

Furthermore, MCRelaxableFragment previously containedMCInst Inst;, which also necessitated a non-trivialdestructor. To address this, I've redesigned its data structure.operands are now stored within the parent MCSection, and theMCRelaxableFragment itself only holds references:

uint32_t Opcode = 0;
uint32_t Flags = 0; // x86-only for the EVEX prefix
uint32_t OperandStart = 0;
uint32_t OperandSize = 0;

Unfortunately, we still need to encode MCInst::Flags tosupport the x86 EVEX prefix, e.g., {evex} xorw $foo, %ax.My hope is that the x86 maintainers might refactorX86MCCodeEmitter::encodeInstruction to make this flagstorage unnecessary.

The new design of MCFragment and MCSectionis as follows:

class MCFragment {
  ...
  // Track content and fixups for the fixed-size part as fragments are
  // appended to the section. The content remains immutable, except when
  // modified by applyFixup.
  uint32_t ContentStart = 0;
  uint32_t ContentEnd = 0;
  uint32_t FixupStart = 0;
  uint32_t FixupEnd = 0;

  // Track content and fixups for the optional variable-size tail part,
  // typically modified during relaxation.
  uint32_t VarContentStart = 0;
  uint32_t VarContentEnd = 0;
  uint32_t VarFixupStart = 0;
  uint32_t VarFixupEnd = 0;
};

class MCSection {
  ...
  // Content and fixup storage for fragments
  SmallVector<char, 0> ContentStorage;
  SmallVector<MCFixup, 0> FixupStorage;
  SmallVector<MCOperand, 0> MCOperandStorage;
};

(As a side note, the LLVMCamelCase variables are odd. As the MC maintainer, I'dbe delighted to see them refactored to camelBack orsnake_case if people agree on the direction.)

Key changes:

Fewerfragments: fixed-size part and variable tail

Prior to LLVM 21.1, the assembler, operated with a fragment designdating back to 2009, placed every span-dependent instruction into itsown distinct fragment. The x86 code sequencepush rax; jmp foo; nop; jmp foo would be represented withnumerous fragments:MCDataFragment(nop); MCRelaxableFragment(jmp foo); MCDataFragment(nop); MCRelaxableFragment(jmp foo).

A more efficient approach emerged: storing both a fixed-sizepart and an optional variable-size tail within a singlefragment.

The fixed-size part maintains a consistent size throughout theassembly process.
The variable-size tail, if present, encodes elements that can changein size or content, such as a span-dependent instruction, an alignmentdirective, a fill directive, or other similar span-dependentconstructs.

The new design led to significantly fewer fragments:

1 2	MCFragment(fixed: push rax, variable: jmp foo) MCFragment(fixed: nop, variable: jmp foo)

Key changes:

Reducing instructionencoding overhead

Encoding individual instructions is the most performance-criticaloperation within MCObjectStreamer. Recognizing this,significant effort has been dedicated to reducing this overhead sinceMay 2023.

[MC] Always encodeinstruction into SmallVector
[MC]Remove the legacy overload of encodeInstruction with a lot of priorcleanups
[MC][ELF]Emit instructions directly into fragment
[MC][X86]Avoid copying MCInst in emitInstrEnd in 2024-06
X86AsmBackend:Remove some overhead from auto padding feature
X86AsmBackend:Simplify isRightAfterData for the auto-pad feature

It's worth mentioning that x86's instruction padding features,introduced in 2020, have imposed considerable overhead. Specifically,these features are:

-mbranches-within-32B-boundaries. See Align branches within 32-Byteboundary(NOP padding)
[X86] Relax existinginstructions to reduce the number of nops needed for alignmentpurposes
"Enhanced relaxation":The feature allows x86 prefix padding for all instructions, effectivelymaking all instructions span-dependent and requiring its own fragment.My D94542 disabled this bydefault due to concern of -g vs -g0differences.

My recent optimization efforts demanded careful attention to theseparticularly complex and performance-sensitive code.

Eager fragment creation

Encoding an instruction is a far more frequent operation thanappending a variable-size tail to the current fragment. In the previousdesign, the instruction encoder was burdened with an extra check: it hadto determine if the current fragment already had a variable-sizetail.

encodeInstruction:
  if (current fragment has a variable-size tail)
    start a new fragment
  append data to the current fragment

emitValueToAlignment:
  Encode the alignment in the variable-size tail of the current fragment

emitDwarfLocDirective:
  Encode the .loc in the variable-size tail of the current fragment

Our new strategy optimizes this by maintaining a current fragmentthat is guaranteed not to have a variable-size tail. This meansfunctions appending data to the fixed-size part no longer need toperform this check. Instead, any function that sets a variable-size tailwill now immediately start a new fragment.

Here's how the workflow looks with this optimization:

encodeInstruction:
  assert(current fragment doesn't have a variable-size tail)
  append data to the current fragment

emitValueToAlignment:
  Encode the alignment in the variable-size tail of the current fragment
  start a new fragment

emitDwarfLocDirective:
  Encode the .loc in the variable-size tail of the current fragment
  start a new fragment

Key changes:

MC:Simplify fragment reuse determination
MC:Optimize getOrCreateDataFragment

It's worth noting that the first patch was made possible thanks tothe removal of the bundle alignment mode.

Fragment content in trailingdata

Our MCFragment class manages four distinct sets ofappendable data: fixed-size content, fixed-size fixups, variable-sizetail content, and variable-size tail fixups. Of these, the fixed-sizecontent is typically the largest. We can optimize its storage byutilizing it as trailing data, akin to a flexible array member.

This approach offers several compelling advantages:

Improved data locality: Storing the content after the MCFragmentobject enhances cache utility.
Simplified metadata: We can replace the pair ofuint32_t ContentStart = 0; uint32_t ContentEnd = 0; with asingle uint32_t ContentSize;.

This optimization leverages a clever technique made possible by usinga special purpose bump allocator. After allocatingsizeof(MCFragment) bytes for a new fragment, we know thatany remaining space within the current bump allocator block immediatelyfollows the fragment's end. This contiguous space can then beefficiently used for the fragment's trailing data.

However, this design introduces a few important considerations:

Tail fragment appends only: Data can only be appended to the tailfragment of a subsection. Fragments located in the middle of asubsection are immutable in their fixed-size content. Anypost-assembler-layout adjustments must target the variable-sizetail.
Dynamic Allocation Management: When new data needs to be appended, afunction is invoked to ensure the current bump allocator block hassufficient space. If not, the current fragment is closed (its fixed-sizecontent is finalized), and a new fragment is started. For instance, an8-byte sequence could be stored as one single fragment, or, if spaceconstraints dictate, as two fragments each encoding 4 bytes.
New block allocation: If the available space in the current block isinsufficient, a new block large enough to accommodate both an MCFragmentand the required bytes for its trailing data is allocated.
Section/subsection Switching: The previously saved fragment listtail cannot be simply reused. This is because it's tied to the memoryspace of the previous bump allocator block. Instead, a new fragment mustbe allocated using the current bump allocator block and appended to thenew subsection's tail.

I have thought about making the variable-size content immediatelyfollow the fixed-size content, but leb128 and x86's potentially verylong instruction (15 bytes) stopped me from doing it. There is certainlyroom for future improvements, though.

Key changes:

GOFF:Only register sections within MCObjectStreamer::changeSection
MC:Allocate initial fragment and define section symbol inchangeSection
MCFragment: Usetrailing data for fixed-size part

Fragment fixups stored insection

TODO

MCFragment should not hold references to fixups stored in the parentMCSection. Instead, fixups reference the fragment.

The optional variable-size tail of a fragment can have at most onefixup.

Deprecatingcomplexity: NativeClient's bundle alignment mode

Google's now-discontinued Native Client (NaCl) project provided asandboxing environment through a combination of Software Fault Isolation(SFI) and memory segmentation. A distinctive feature of its SFIimplementation was the "bundle alignment mode", which adds NOP paddingto ensure that no instruction crosses a 32-byte alignment boundary. Theverifier's job is to check all instructions starting at 32-byte-multipleaddresses.

While the core concept of aligned bundling is intriguing, itsimplementation within the LLVM assembler proved problematic. Introducedin 2012, this feature imposed noticeable performance penalties on userswho had no need for NaCl, perhaps more critically, significantlyincreased the complexity of MC's internal workings. I was particularlyconcerned by its pervasive modifications toMCObjectStreamer and MCAssembler.

The complexity deepened with the introduction of

2014: MCStreamer's pendinglabels, which led to more complexity:
- 2015: [MC] Ensure thatpending labels are flushed when -mc-relax-all flag is used
- 2019: [MC] Match labels toexisting fragments even when switching sections. by an Appledeveloper. In a nutshell, the pending labels mechanism was causingheadache to Mach-O, requiring additional code to manage.
2015: NaCl'smc-relax-all optimization

In MCObjectStreamer, newly defined labels were put intoa "pending label" list and initially assigned to aMCDummyFragment associated with the current section. Thesymbols would be reassigned to a new fragment when the next instructionor directive was parsed. This pending label system introduced complexityand a missing flushPendingLabels could lead to subtle bugsrelated to incorrect symbol values. flushPendingLabels wascalled by many MCObjectStreamer functions, noticeably oncefor each new fragment, adding overhead. It also complicated the labeldifference evaluation due to MCDummyFragment inMCExpr.cpp:AttemptToFoldSymbolOffsetDifference.

For the following code, aligned bundling requires that .Ltmp isdefined at addl.

$ clang var.c -S -o - -fPIC -m32
...
.bundle_lock align_to_end
  calll   .L0$pb
.bundle_unlock
.L0$pb:
  popl    %eax
.Ltmp0:
  addl    $_GLOBAL_OFFSET_TABLE_+(.Ltmp0-.L0$pb), %eax

Recognizing these long-standing issues, a series of pivotal changeswere undertaken:

2024: [MC]Aligned bundling: remove special handling for RelaxAll removed anoptimization for NaCl in the mc-relax-allmode
2024: [MC]Remove pending labels
2024: [MC]AttemptToFoldSymbolOffsetDifference: remove MCDummyFragment check.NFC
2025: Finally, MC: Removebundle alignment mode, after Derek Schuff agreed to drop NaClsupport from LLVM.

Should future features require a variant of bundle alignment, Ifirmly believe a much cleaner implementation is necessary. This couldpotentially be achieved through a backend hook withinX86AsmBackend::finishLayout, applied after the primaryassembler layout phase, similar to how the-mbranches-within-32B-boundaries option is handled, thougheven that implementation warrants an extensive revisit itself.

Lessons learned

The cost of missing early optimization

Early design choices can have a far-reaching impact on future code.The initial LLVM MC design, while admirably simple in its inception,inadvertently created a rigid foundation. As new features piled on, eachrelying more and more on the specific fragment internals, rectifyingfoundational inefficiencies became incredibly challenging. The Hyrum'sLaw was evident: features built on this foundation inevitably dependedon all its observable behaviors. Optimizing the underlying structurerequired not just a change to the core, but also a thorough fix for allits unsuspecting users. I encountered significant struggles with thedeeply ingrained complexities stemming from NaCl's bundle alignmentmode, x86's -mbranches-within-32B-boundaries option, andthe intricacies of RISC-V linker relaxation.

Cargo cult programming and snowball effect

I observed instances of "cargo cult programming", where existingsolutions were copied without a full understanding of their underlyingrationale or applicability. For example:

The WebAssembly implementation heavily mirrored that of ELF.Consequently, many improvements made to the ELF component oftennecessitated corresponding, sometimes redundant, changes to theWebAssembly implementation. In additin, the WebAssembly implementationcopied ELF-specific code that was irrelevant for WebAssembly'sarchitecture, adding unnecessary bloat and complexity.
LoongArch's RISC-V replication: LoongArch's linker relaxationimplementation directly copied the approach taken for RISC-V.Refactoring or improvements to RISC-V's linker relaxation frequentlyrequire mirrored changes in the LoongArch codebase, creating parallelmaintenance burdens. I am particularly glad that I landed myfoundational [RISCV] Makelinker-relaxable instructions terminate MCDataFragment and [RISCV] Allow delayed decisionfor ADD/SUB relocations in 2023, before the LoongArch teamreplicated the RISC-V approach. This timing, I hope, mitigated somefuture headaches for their implementation.

These patterns illustrate how initial design choices, or theexpedience of copying existing solutions, can lead to a "snowballeffect" of accumulating complexity and redundant code that makes futureoptimization and maintenance significantly harder. On a positive note,I'm also pleased that thestreamlining of the relocation generation framework was completedbefore Apple's upstreaming of their Mach-O support for 32-bit RISC-V.This critical work should provide a more robust and less complex basefor their contributions, and reducing maintenance on my end.

The cost of features

Specific features, particularly those designed for niche orspecialized use cases like NaCl's bundle alignment mode, introduceddisproportionate complexity and performance overhead across the entireassembler. Even though NaCl itself was deprecated in 2020, it took until2025 to finally excise its complex support from LLVM. This highlights acommon challenge in large, open-source projects: while many developersare motivated to add new features, there's often far less incentive ordedicated effort to streamline or remove their underlying implementationcomplexities once they're no longer strictly necessary or have become aperformance drain.

I want to acknowledge the work of individuals like Rafael Ávila deEspíndola, Saleem Abdulrasool, and Nirav Dave, whose improvements toLLVM MC were vital. Without their contributions, the MC layer wouldundoubtedly be in a far less optimized state today.

Epilogue

This extensive work on fragment optimization would not have beenpossible without the invaluable contributions of Alexis Engelke. My sincere thanks go toAlexis for his meticulous reviews of numerous patches, his insightfulsuggestions, and for contributing many significant improvementshimself.

What I have learnd through the process?

Appendix:How GNU Assembler mastered fragments decades ago

After dedicating several paragraphs to explaining the historicalshortcomings of LLVM MC's fragment representation, a natural questionarises: how does GNU Assembler (GAS), arguably the other most popularassembler on Linux systems, approach fragment handling?

Delving into its history reveals a fascinating answer. The earliestcommit I could locate is a cvs2svn-generated record from April 1991.Given the 1987 copyright notice within the code, it's highly probablethat this foundational work on fragments was laid down as early as1987.

You can explore this initial structure in as.h here: https://github.com/bminor/binutils-gdb/commit/3a69b3aca678a3caf3ade7f9d42d18233b097ec6#diff-0771d3312685417eb5061a8f0856da4f0406ca8bd6c7d68b6a50a026a4e48c9dR212.Please check out as.h and frags.c.

Observing the frag struct, a few points stand out:

While the exact purpose of fr_offset isn't immediatelyclear to me, fr_fix and fr_var bear a strikingresemblance to the concepts we've recently introduced in MCFragment. Itmight make the variable-size content immediately follow the fixed-sizecontent, though.
The char fr_literal[1] demonstrates an early use ofwhat we now call a flexible array member. Today, GCC and Clang's-fstrict-flex-arrays=2 would report a warning.
fr_symbol could be more appropriately placed within aunion
fr_pcrel_adjust and fr_bsr would ideallybe architecture-specific data.
Fragments are allocated using obstacks,which appear to be a more sophisticated form of a bump allocator, withadditional bookkeeping overhead.

But truly, I should stop the minor nit-picking. What astonishinglyimpresses me is the sheer foresight demonstrated in GAS's fragmentallocator design. Conceived in 1987 or even earlier, it masterfullyanticipated solutions that LLVM MC, first conceived in 2009, has onlynow achieved decades later. This design held the lead on fragmentarchitecture for nearly four decades!

My greatest tribute goes to the original authors of GNU Assembler forthis remarkable piece of engineering.

/*
 * A code fragment (frag) is some known number of chars, followed by some
 * unknown number of chars. Typically the unknown number of chars is an
 * instruction address whose size is yet unknown. We always know the greatest
 * possible size the unknown number of chars may become, and reserve that
 * much room at the end of the frag.
 * Once created, frags do not change address during assembly.
 * We chain the frags in (a) forward-linked list(s). The object-file address
 * of the 1st char of a frag is generally not known until after relax().
 * Many things at assembly time describe an address by {object-file-address
 * of a particular frag}+offset.

 BUG: it may be smarter to have a single pointer off to various different
notes for different frag kinds. See how code pans 
 */
struct frag                        /* a code fragment */
{
        unsigned long fr_address; /* Object file address. */
        struct frag *fr_next;        /* Chain forward; ascending address order. */
                                /* Rooted in frch_root. */

        long fr_fix;        /* (Fixed) number of chars we know we have. */
                                /* May be 0. */
        long fr_var;        /* (Variable) number of chars after above. */
                                /* May be 0. */
        struct symbol *fr_symbol; /* For variable-length tail. */
        long fr_offset;        /* For variable-length tail. */
        char        *fr_opcode;        /*->opcode low addr byte,for relax()ation*/
        relax_stateT fr_type;   /* What state is my tail in? */
        relax_substateT        fr_subtype;
                /* These are needed only on the NS32K machines */
        char        fr_pcrel_adjust;
        char        fr_bsr;
        char        fr_literal [1];        /* Chars begin here. */
                                /* One day we will compile fr_literal[0]. */
};

GCC 13.3.0 miscompiles LLVM

MaskRay

2025年7月13日 15:00

For years, I've been involved in updating LLVM's MC layer. A recentjourney led me to eliminatethe FK_PCRel_ fixup kinds:

MCFixup: Remove FK_PCRel_The generic FK_Data_ fixup kinds handle both absolute and PC-relativefixups. ELFObjectWriter sets IsPCRel to true for `.long foo-.`, so thebackend has to handle PC-relative FK_Data_.However, the existence of FK_PCRel_ encouraged backends to implement itas a separate fixup type, leading to redundant and error-prone code.Removing FK_PCRel_ simplifies the overall fixup mechanism.

As a prerequisite, I had to update several backends that relied onthe now-deleted fixup kinds. It was during this process that somethingunexpected happened. Contributors reportedthat when built by GCC 13.3.0, the LLVM integrated assembler hadtest failures.

To investigate, I downloaded and built GCC 13.3.0 locally:

1 2	../../configure --prefix=$HOME/opt/gcc-13.3.0 --disable-bootstrap --enable-languages=c,c++ --disable-libsanitizer --disable-multilib make -j 30 && make -j 30 install

I then built a Release build (-O3) of LLVM. Sure enough,the failure was reproducible:

% /tmp/out/custom-gcc-13/bin/llc llvm/test/CodeGen/X86/2008-08-06-RewriterBug.ll -mtriple=i686 -o s -filetype=obj
Unknown immediate size
UNREACHABLE executed at /home/ray/llvm/llvm/lib/Target/X86/MCTargetDesc/X86BaseInfo.h:904!
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: /tmp/out/custom-gcc-13/bin/llc llvm/test/CodeGen/X86/2008-08-06-RewriterBug.ll -mtriple=i686 -o s -filetype=obj
1.      Running pass 'Function Pass Manager' on module 'llvm/test/CodeGen/X86/2008-08-06-RewriterBug.ll'.
2.      Running pass 'X86 Assembly Printer' on function '@foo'
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0  llc 0x0000000002f06bcb
fish: Job 1, '/tmp/out/custom-gcc-13/bin/llc …' terminated by signal SIGABRT (Abort)

Interestingly, a RelWithDebInfo build (-O2 -g) of LLVMdid not reproduce the failure, suggesting either an undefined behavior,or an optimization-related issue within GCC 13.3.0.

The Bisection trail

I built GCC at the releases/gcc-13 branch, and the issuevanished. This strongly indicated that the problem lay somewhere betweenthe releases/gcc-13.3.0 tag and thereleases/gcc-13 branch.

The bisection led me to a specific commit, directing me to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109934#c6.

I developed a workaround at the code block with a typo "RemaningOps".Although I had observed it before, I was hesitant to introduce a commitsolely for a typo fix. However, it became clear this was the perfectopportunity to address both the typo and implement a workaround for theGCC miscompilation. This led to the landing of thiscommit, resolving the miscompilation.

Sam James from Gentoo mentioned that the miscompilation wasintroduced by a commit cherry-picked into GCC 13.3.0. GCC 13.2.0 and GCC13.4.0 are good.

LLVM integrated assembler: Improving expressions and relocations

MaskRay

2025年5月26日 15:00

In my previous post, LLVMintegrated assembler: Improving MCExpr and MCValue delved intoenhancements made to LLVM's internal MCExpr and MCValue representations.This post covers recent refinements to MC, focusing on expressionresolving and relocation generation.

Preventing cyclicdependencies

Equatedsymbols may form a cycle, which is not allowed.

# CHECK: [[#@LINE+2]]:7: error: cyclic dependency detected for symbol 'a'
# CHECK: [[#@LINE+1]]:7: error: expression could not be evaluated
a = a + 1

# CHECK: [[#@LINE+3]]:6: error: cyclic dependency detected for symbol 'b1'
# CHECK: [[#@LINE+1]]:6: error: expression could not be evaluated
b0 = b1
b1 = b2
b2 = b0

Previously, LLVM's interated assembler used an occurs check to detectthese cycles when parsing symbol equating directives.

bool parseAssignmentExpression(StringRef Name, bool allow_redef,
                               MCAsmParser &Parser, MCSymbol *&Sym,
                               const MCExpr *&Value) {
  ...
  // Validate that the LHS is allowed to be a variable (either it has not been
  // used as a symbol, or it is an absolute symbol).
  Sym = Parser.getContext().lookupSymbol(Name);
  if (Sym) {
    // Diagnose assignment to a label.
    //
    // FIXME: Diagnostics. Note the location of the definition as a label.
    // FIXME: Diagnose assignment to protected identifier (e.g., register name).
    if (Value->isSymbolUsedInExpression(Sym))
      return Parser.Error(EqualLoc, "Recursive use of '" + Name + "'");
    ...
  }

isSymbolUsedInExpression implemented occurs check as atree (or more accurately, a DAG) traversal.

bool MCExpr::isSymbolUsedInExpression(const MCSymbol *Sym) const {
  switch (getKind()) {
  case MCExpr::Binary: {
    const MCBinaryExpr *BE = static_cast<const MCBinaryExpr *>(this);
    return BE->getLHS()->isSymbolUsedInExpression(Sym) ||
           BE->getRHS()->isSymbolUsedInExpression(Sym);
  }
  case MCExpr::Target: {
    const MCTargetExpr *TE = static_cast<const MCTargetExpr *>(this);
    return TE->isSymbolUsedInExpression(Sym);
  }
  case MCExpr::Constant:
    return false;
  case MCExpr::SymbolRef: {
    const MCSymbol &S = static_cast<const MCSymbolRefExpr *>(this)->getSymbol();
    if (S.isVariable() && !S.isWeakExternal())
      return S.getVariableValue()->isSymbolUsedInExpression(Sym);
    return &S == Sym;
  }
  case MCExpr::Unary: {
    const MCExpr *SubExpr =
        static_cast<const MCUnaryExpr *>(this)->getSubExpr();
    return SubExpr->isSymbolUsedInExpression(Sym);
  }
  }

  llvm_unreachable("Unknown expr kind!");
}

While generally effective, this routine wasn't universally appliedacross all symbol equating scenarios, such as with .weakrefor some target-specific parsing code, leading to potential undetectedcycles, and therefore infinite loop in assembler execution.

To address this, I adopted a 2-color depth-first search (DFS)algorithm. While a 3-color DFS is typical for DAGs, a 2-color approachsuffices for our trees, although this might lead to more work when asymbol is visited multiple times. Shared subexpressions are very rare inLLVM.

Here is the relevant change toevaluateAsRelocatableImpl. I also need a new bit fromMCSymbol.

@@ -497,13 +498,25 @@ bool MCExpr::evaluateAsRelocatableImpl(MCValue &Res, const MCAssembler *Asm,

   case SymbolRef: {
     const MCSymbolRefExpr *SRE = cast<MCSymbolRefExpr>(this);
-    const MCSymbol &Sym = SRE->getSymbol();
+    MCSymbol &Sym = const_cast<MCSymbol &>(SRE->getSymbol());
     const auto Kind = SRE->getKind();
     bool Layout = Asm && Asm->hasLayout();

     // Evaluate recursively if this is a variable.
+    if (Sym.isResolving()) {
+      if (Asm && Asm->hasFinalLayout()) {
+        Asm->getContext().reportError(
+            Sym.getVariableValue()->getLoc(),
+            "cyclic dependency detected for symbol '" + Sym.getName() + "'");
+        Sym.IsUsed = false;
+        Sym.setVariableValue(MCConstantExpr::create(0, Asm->getContext()));
+      }
+      return false;
+    }
     if (Sym.isVariable() && (Kind == MCSymbolRefExpr::VK_None || Layout) &&
         canExpand(Sym, InSet)) {
+      Sym.setIsResolving(true);
+      auto _ = make_scope_exit([&] { Sym.setIsResolving(false); });
       bool IsMachO =
           Asm && Asm->getContext().getAsmInfo()->hasSubsectionsViaSymbols();
       if (Sym.getVariableValue()->evaluateAsRelocatableImpl(Res, Asm,

Unfortunately, I cannot removeMCExpr::isSymbolUsedInExpression, as it is still used byAMDGPU ([AMDGPU] Avoidresource propagation for recursion through multiple functions).

Revisiting the`.weakref` directive

The .weakref directive had intricate impact on the expressionresolving framework.

.weakref enables the creation of weak aliases withoutdirectly modifying the target symbol's binding. This allows a headerfile in library A to optionally depend on symbols from library B. Whenthe target symbol is otherwise not referenced, the object file affectedby the weakref directive will include an undefined weak symbol. However,when the target symbol is defined or referenced (by the user), it canretain STB_GLOBAL binding to support archive member extraction. GCC's[[gnu::weakref]] attribute, as used in runtime libraryheaders like libgcc/gthr-posix.h, utilizes thisfeature.

I've noticed a few issues:

Unreferenced .weakref alias, target created undefinedtarget.
Crash when alias was already defined.
VK_WEAKREF was mis-reused by the aliasdirective of llvm-ml (MASM replacement).

And addressed them with

[MC]Ignore VK_WEAKREF in MCValue::getAccessVariant (2019-12). Wow, it'sinteresting to realize I'd actually delved into this a few yearsago!
MC:Rework .weakref (2025-05)

Expression resolving andreassignments

= and its equivalents (.set,.equ) allow a symbol to be equatedmultiple times. This means when a symbol is referenced, its currentvalue is captured at that moment, and subsequent reassignments do notalter prior references.

.data
.set x, 0
.long x         // reference the first instance
x = .-.data
.long x         // reference the second instance
.set x,.-.data
.long x         // reference the third instance

The assembly code evaluates to.long 0; .long 4; .long 8.

Historically, the LLVM integrated assembler restricted reassigningsymbols whose value wasn't a parse-time integer constant(MCConstExpr). This was a safeguard against potentiallyunsafe reassignments, as an old value might still be referenced.

% clang -c g.s
g.s:6:8: error: invalid reassignment of non-absolute variable 'x'
.set x,.-.data
       ^

The safeguard was implemented with multiple conditions, aided by a mysterious IsUsedvariable.

// Diagnose assignment to a label.
//
// FIXME: Diagnostics. Note the location of the definition as a label.
// FIXME: Diagnose assignment to protected identifier (e.g., register name).
if (Value->isSymbolUsedInExpression(Sym))
  return Parser.Error(EqualLoc, "Recursive use of '" + Name + "'");
else if (Sym->isUndefined(/*SetUsed*/ false) && !Sym->isUsed() &&
         !Sym->isVariable())
  ; // Allow redefinitions of undefined symbols only used in directives.
else if (Sym->isVariable() && !Sym->isUsed() && allow_redef)
  ; // Allow redefinitions of variables that haven't yet been used.
else if (!Sym->isUndefined() && (!Sym->isVariable() || !allow_redef))
  return Parser.Error(EqualLoc, "redefinition of '" + Name + "'");
else if (!Sym->isVariable())
  return Parser.Error(EqualLoc, "invalid assignment to '" + Name + "'");
else if (!isa<MCConstantExpr>(Sym->getVariableValue()))
  return Parser.Error(EqualLoc,
                      "invalid reassignment of non-absolute variable '" +
                          Name + "'");

Over the past few years, during our work on porting Clang to Linuxkernel ports, we worked around this by modifying the assembly codeitself:

ARM:8971/1: replace the sole use of a symbol with its definition in2020-04
crypto:aesni - add compatibility with IAS in 2020-07
powerpc/64/asm:Do not reassign labels in 2021-12

This prior behavior wasn't ideal. I've since enabled properreassignment by implementing a system where the symbol is cloned uponredefinition, and the symbol table is updated accordingly. Crucially,any existing references to the original symbol remain unchanged, and theoriginal symbol is no longer included in the final emitted symboltable.

Before rolling out this improvement, I discovered problematic uses inthe AMDGPU and ARM64EC backends that required specific fixes orworkarounds. This is a common challenge when making general improvementsto LLVM's MC layer: you often need to untangle and resolve individualbackend-specific "hacks" before a more generic interface enhancement canbe applied.

MCParser:Error when .set reassigns a non-redefinable variable
MC:Allow .set to reassign non-MCConstantExpr expressions

For the following assembly, newer Clang emits relocations referencingfoo, foo, bar, foo like GNU Assembler.

b = a
a = foo
call a
call b
a = bar
call a
call b

Relocation generation

For a deeper dive into the concepts of relocation generation, youmight find my previous post, Relocationgeneration in assemblers, helpful.

Driven by the need to support new RISC-V vendor relocations (e.g.,Xqci extensions from Qualcomm) and my preference against introducing anextra MCAsmBackend hook, I've significantly refactoredLLVM's relocation generation framework. This effort generalized existingRISC-V/LoongArch ADD/SUB relocation logic and enabled its customizationfor other targets like AVR and PowerPC.

MC:Generalize RISCV/LoongArch handleAddSubRelocations and AVRshouldForceRelocation

The linker relaxation framework sometimes generated redundantrelocations that could have been resolved. This occurred in severalscenarios, including:

.option norelax
j label
// For assembly input, RISCVAsmParser::ParseInstruction sets ForceRelocs (https://reviews.llvm.org/D46423).
// For direct object emission, RISCVELFStreamer sets ForceRelocs (#77436)
.option relax
call foo  // linker-relaxable

.option norelax
j label   // redundant relocation due to ForceRelocs
.option relax

label:

And also with label differences within a section withoutlinker-relaxable instructions:

call foo

.section .text1,"ax"
# No linker-relaxable instruction. Label differences should be resolved.
w1:
  nop
w2:

.data
# Redundant R_RISCV_SET32 and R_RISCV_SUB32
.long w2-w1

These issues have now been resolved through a series of patches,significantly revamping the target-neutral relocation generationframework. Key contributions include:

[MC]Refactor fixup evaluation and relocation generation
RISCV,LoongArch:Encode RELAX relocation implicitly
RISCV:Remove shouldForceRelocation and unneeded relocations
MC:Remove redundant relocations for label differences

I've also streamlined relocation generation within the SPARC backend.Given its minimal number of relocations, the SPARC implementation couldserve as a valuable reference for downstream targets seeking tocustomize their own relocation handling.

Simplificationto assembly and machine code emission

For a dive into the core classes involved in LLVM's assembly andmachine code emission, you might read my Noteson LLVM assembly and machine code emission.

The MCAssembler class orchestrates the emission process,managing MCAsmBackend, MCCodeEmitter, andMCObjectWriter. In turn, MCObjectWriteroversees MCObjectTargetWriter.

Historically, many member functions within the subclasses ofMCAsmBackend, MCObjectWriter, andMCObjectTargetWriter accepted a MCAssembler *argument. This was often redundant, as it was typically only used toaccess the MCContext instance. To streamline this, I'veadded a MCAssembler * member variable directly toMCAsmBackend, MCObjectWriter, andMCObjectTargetWriter, along with convenient helperfunctions like getContext. This change cleans up theinterfaces and improves code clarity.

MCAsmBackend:Add member variable MCAssembler * and define getContext
ELFObjectWriter:Remove the MCContext argument from getRelocType
MachObjectWriter:Remove the MCAssembler argument from getSymbolAddress
WinCOFFObjectWriter:Simplify code with member MCAssembler *

Previously, the ARM, Hexagon, and RISC-V backends had uniquerequirements that led to extra arguments being passed to MCAsmBackendhooks. These arguments were often unneeded by other targets. I've sincerefactored these interfaces, replacing those specialized arguments withmore generalized and cleaner approaches.

ELFObjectWriter:Move Thumb-specific condition to ARMELFObjectWriter
MCAsmBackend:Remove MCSubtargetInfo argument
MCAsmBackend,X86:Pass MCValue to fixupNeedsRelaxationAdvanced. NFC
MCAsmBackend,Hexagon:Remove MCRelaxableFragment from fixupNeedsRelaxationAdvanced
MCAsmBackend:Simplify applyFixup

Future plan

The assembler's ARM port has a limitation where only relocations withimplicit addends (REL) are handled. For CREL, weaim to use explicit addends across all targets to simplifylinker/tooling implementation, but this is incompatible withARMAsmBackend's current design. See this ARM CREL assemblerissue https://github.com/llvm/llvm-project/issues/141678.

To address this issue, we should

In MCAssembler::evaluateFixup, generalizeMCFixupKindInfo::FKF_IsAlignedDownTo32Bits (ARM hack, alsoused by other backends) to support more fixups, includingARM::fixup_arm_uncondbl (R_ARM_CALL). Create anew hook in MCAsmBackend.
In ARMAsmBackend, move the Value -= 8 codefrom adjustFixupValue to the new hook.

unsigned ARMAsmBackend::adjustFixupValue(const MCAssembler &Asm,
...
  case ARM::fixup_arm_condbranch:
  case ARM::fixup_arm_uncondbranch:
  case ARM::fixup_arm_uncondbl:
  case ARM::fixup_arm_condbl:
  case ARM::fixup_arm_blx:
    // Check that the relocation value is legal.
    Value -= 8;
    if (!isInt<26>(Value)) {
      Ctx.reportError(Fixup.getLoc(), "Relocation out of range");
      return 0;

Enabling RELA/CREL support requires significant effort and exceeds myexpertise or willingness to address for AArch32. However, I do want toadd a new MCAsmBackend hook to minimize AArch32's invasive modificationsto the generic relocation generation framework.

For reference, the arm-vxworks port in binutils introducedRELA support in 2006.

LLVM integrated assembler: Improving MCExpr and MCValue

MaskRay

2025年4月6日 15:00

In my previous post, RelocationGeneration in Assemblers, I explored some key concepts behindLLVM’s integrated assemblers. This post dives into recent improvementsI’ve made to refine that system.

The LLVM integrated assembler handles fixups and relocatableexpressions as distinct entities. Relocatable expressions, inparticular, are encoded using the MCValue class, whichoriginally looked like this:

class MCValue {
  const MCSymbolRefExpr *SymA = nullptr, *SymB = nullptr;
  int64_t Cst = 0;
  uint32_t RefKind = 0;
};

In this structure:

RefKind acts as an optional relocation specifier,though only a handful of targets actually use it.
SymA represents an optional symbol reference (theaddend).
SymB represents another optional symbol reference (thesubtrahend).
Cst holds a constant value.

While functional, this design had its flaws. For one, the wayrelocation specifiers were encoded varied across architectures:

Targets like COFF, Mach-O, and ELF's PowerPC, SystemZ, and X86 embedthe relocation specifier within MCSymbolRefExpr *SymA aspart of SubclassData.
Conversely, ELF targets such as AArch64, MIPS, and RISC-V store itas a target-specific subclass of MCTargetExpr, and convertit to MCValue::RefKind duringMCValue::evaluateAsRelocatable.

Another issue was with SymB. Despite being typed asconst MCSymbolRefExpr *, itsMCSymbolRefExpr::VariantKind field went unused. This isbecause expressions like add - sub@got are notrelocatable.

Over the weekend, I tackled these inconsistencies and reworked therepresentation into something cleaner:

class MCValue {
  const MCSymbol *SymA = nullptr, *SymB = nullptr;
  int64_t Cst = 0;
  uint32_t Specifier = 0;
};

This updated design not only aligns more closely with the concept ofrelocatable expressions but also shaves off some compiler time in LLVM.The ambiguous RefKind has been renamed toSpecifier for clarity. Additionally, targets thatpreviously encoded the relocation specifier withinMCSymbolRefExpr (rather than usingMCTargetExpr) can now access it directly viaMCValue::Specifier.

To support this change, I made a few adjustments:

IntroducedgetAddSym and getSubSym methods, returningconst MCSymbol *, as replacements for getSymAand getSymB.
Eliminated dependencies on the old accessors,MCValue::getSymA and MCValue::getSymB.
Reworkedthe expression folding code that handles + and -
Storedthe const MCSymbolRefExpr *SymA specifier atMCValue::Specifier
Some targets relied on PC-relative fixups with explicit specifiersforcing relocations. I have definedMCAsmBackend::shouldForceRelocation for SystemZ and cleanedup ARM and PowerPC
Changedthe type of SymA and SymB toconst MCSymbol *
Replacedthe temporary getSymSpecifier withgetSpecifier
Replacedthe legacy getAccessVariant withgetSpecifier

Streamlining Mach-O support

Mach-O assembler support in LLVM has accumulated significanttechnical debt, impacting both target-specific and generic code. Oneparticularly nagging issue was theconst SectionAddrMap *Addrs parameter inMCExpr::evaluateAs* functions. This parameter existed tohandle cross-section label differences, primarily for generating(compact) unwind information in Mach-O. A typical example of this can beseen in assembly like:

        .section        __TEXT,__text,regular,pure_instructions
Leh_func_begin0:
        .section        __TEXT,__eh_frame,coalesced,no_toc+strip_static_syms+live_support
Ltmp3:
Ltmp4 = Leh_func_begin0-Ltmp3
        .long   Ltmp4

The SectionAddrMap *Addrs parameter always felt like aclunky workaround to me. It wasn’t until I dug into the Mach-OAArch64 object writer that I realized this hack wasn't necessary forthat writer. This discovery prompted a cleanup effort to remove thedependency on SectionAddrMap for ARM and X86 and eliminatethe parameter:

[MC,MachO]Replace SectionAddrMap workaround with cleaner variablehandling
MCExpr:Remove unused SectionAddrMap workaround

While I was at it, I also tidied up MCSymbolRefExpr byremovingthe clunky HasSubsectionsViaSymbolsBit, furthersimplifying the codebase.

Stremlining InstPrinter

The MCExpr code also determines how expression operands in assemblyinstructions are printed. I have made improvements in this area aswell:

[MC]Don't print () around $ names
[MC]Simplify MCBinaryExpr/MCUnaryExpr printing by reducingparentheses

Relocation generation in assemblers

MaskRay

2025年3月16日 15:00

This post explores how GNU Assembler and LLVM integrated assemblergenerate relocations, an important step to generate a relocatable file.Relocations identify parts of instructions or data that cannot be fullydetermined during assembly because they depend on the final memorylayout, which is only established at link time or load time. These areessentially placeholders that will be filled in (typically with absoluteaddresses or PC-relative offsets) during the linking process.

Relocation generation: thebasics

Symbol references are the primary candidates for relocations. Forinstance, in the x86-64 instruction movl sym(%rip), %eax(GNU syntax), the assembler calculates the displacement between theprogram counter (PC) and sym. This distance affects theinstruction's encoding and typically triggers aR_X86_64_PC32 relocation, unless sym is alocal symbol defined within the current section.

Both the GNU assembler and LLVM integrated assembler utilize multiplepasses during assembly, with several key phases relevant to relocationgeneration:

Parsing phase

During parsing, the assembler builds section fragments that containinstructions and other directives. It parses each instruction into itsopcode (e.g., movl) and operands (e.g.,sym(%rip), %eax). It identifies registers, immediate values(like 3 in movl $3, %eax), and expressions.

Expressions can be constants, symbol refereces (likesym), or unary and binary operators (-sym,sym0-sym1). Those unresolvable at parse time-potentialrelocation candidates-turn into "fixups". These often skip immediateoperand range checks, as shown here:

% echo 'addi a0, a0, 2048' | llvm-mc -triple=riscv64
<stdin>:1:14: error: operand must be a symbol with %lo/%pcrel_lo/%tprel_lo modifier or an integer in the range [-2048, 2047]
addi a0, a0, 2048
             ^
% echo 'addi a0, a0, %lo(x)' | llvm-mc -triple riscv64 -show-encoding
        addi    a0, a0, %lo(x)                  # encoding: [0x13,0x05,0bAAAA0101,A]
                                        #   fixup A - offset: 0, value: %lo(x), kind: fixup_riscv_lo12_i

A fixup ties to a specific location (an offset within a fragment),with its value being the expression (which must eventually evaluate to arelocatable expression).

Meanwhile, the assembler tracks defined and referenced symbols, andfor ELF, it tracks symbol bindings(STB_LOCAL, STB_GLOBAL, STB_WEAK) from directives like.globl, .weak, or the rarely used.local.

Section layout phase

After parsing, the assembler arranges each section by assigningprecise offsets to its fragments-instructions, data, or other directives(e.g., .line, .uleb128). It calculates sizesand adjusts for alignment. This phase finalizes symbol offsets (e.g.,start: at offset 0x10) while leaving external ones for thelinker.

This phase, which employs a fixed-point iteration, is quite complex.I won't go into details, but you might find Clang's-O0 output: branch displacement and size increase interesting.

Relocation decision phase

Then the assembler evaluates each fixup to determine if it can beresolved directly or requires a relocation entry. This process starts byattempting to convert fixups into relocatable expressions.

Evaluating relocatableexpressions

In their most general form, relocatable expressions follow thepattern relocation_specifier(sym_a - sym_b + offset),where

relocation_specifier: This may or may not be absent. Iwill explain this concept later.
sym_a is a symbol reference (the "addend")
sym_b is an optional symbol reference (the"subtrahend")
offset is a constant value

Most common cases involve only sym_a oroffset (e.g., movl sym(%rip), %eax ormovl $3, %eax). Only a few target architectures support thesubtrahend term (sym_b). Notable exceptions include AVR andRISC-V, as explored in Thedark side of RISC-V linker relaxation.

Attempting to use unsupported expression forms will result inassembly errors:

% echo -e 'movl a+b, %eax\nmovl a-b, %eax' | clang -c -xassembler -
<stdin>:1:1: error: expected relocatable expression
movl a+b, %eax
^
<stdin>:2:1: error: symbol 'b' can not be undefined in a subtraction expression
movl a-b, %eax
^

Let's use some notations from the AArch64 psABI.

S is the address of the symbol.
A is the addend for the relocation.
P is the address of the place being relocated (derivedfrom r_offset).
GOT is the address of the Global Offset Table, thetable of code and data addresses to be resolved at dynamic linktime.
GDAT(S+A) represents a pointer-sized entry in theGOT for address S+A.

PC-relative fixups

PC-relative fixups compute their values assym_a - current_location + offset (S - P + A)and can be seen as a special case that uses sym_b. (I’veskipped - sym_b, since no target I know permits asubtrahend here.)

When sym_a is a non-ifunc local symbol defined withinthe current section, these PC-relative fixups evaluate to constants. Butif sym_a is a global or weak symbol in the same section, arelocation entry is generated. This ensures ELF symbolinterposition stays in play.

In contrast, label differences (e.g. .quad g-f) can beresolved even if f and g are global.

On some targets (e.g., AArch64, PowerPC, RISC-V), the PC-relativeoffset is relative to the start of the instruction (P), while others(e.g., AArch32, x86) are relative to P plus a constant.

Resolution Outcomes

The assembler's evaluation of fixups leads to one of threeoutcomes:

Error: When the expression isn't supported.
Resolved fixups: The assembler updates the relevant bits in theinstruction directly. No relocation entry is needed.
- There are target-specific exceptions that make the fixup unresolved.In AArch64 adrp x0, l0; l0:, the immediate might be either0 or 1, dependant on the instruction address. In RISC-V, linkerrelaxation might make fixups unresolved.
Unresolved fixups: When the fixup evaluates to a relocatableexpression but not a constant, the assembler
- Generates an appropriate relocation (offset, type, symbol,addend).
- For targets that use RELA, usually zeros out the bits in theinstruction field that will be modified by the linker.
- For targets that use REL, leave the addend in the instructionfield.
- If the referenced symbol is defined and local, and the relocationtype is not in exceptions (gas tc_fix_adjustable), therelocation references the section symbol instead of the localsymbol.

Fixup resolution depends on the fixup type:

PC-relative fixups that describe the symbol itself (the relocationoperation looks like S - P + A) resolve to a constant ifsym_a is a non-ifunc local symbol defined in the currentsection.
relocation_specifier(S + A) style fixups resolve whenS refers to an absolute symbol.
Other fixups, including TLS and GOT related ones, remainunresolved.

For ELF targets, if a non-TLS relocation operation references thesymbol itself S (not GDAT), it may be adjustedto reference the section symbol instead.

If you are interested in relocation representations in differentobject file formats, please check out my post Exploring objectfile formats.

If an equated symbol sym is resolved relative to asection, relocations are generated against sym. Otherwise,if it resolves to a constant or an undefined symbol, relocations aregenerated against that constant or undefined symbol.

Fixup overflow check

For .long x, GAS accepts x if its value isin the range (-2**32, 2**32). This design allows.long x to work regardless of signedness. When a symbol isinvolved, GAS supports both .long sym-0xffffffff and.long sym+1, as well as .long sym+0xffffffffand .long sym-1. However,.long sym+0x100000000 is rejected in favor of.long sym+0.

The underlying check asks: "can this value be truncated to 32 bitswithout losing bit-pattern information?" The accepted range is the unionof:

uint32_t values: [0, 2**32)
int32_t values: [-2**31, 2**31)
Negative values that fit in 33 bits:(-2**32, -2**31)

The union gives (-2**32, 2**32).

Note: the union of just int32_t anduint32_t is [-2**31, 2**32), which matchescheckIntUInt in lld/ELF (https://reviews.llvm.org/D63690).

Examples in action

Branches

% echo -e 'call fun\njmp fun' | clang -c -xassembler - -o - | fob -dr -
...
       0: e8 00 00 00 00                callq   0x5 <.text+0x5>
                0000000000000001:  R_X86_64_PLT32       fun-0x4
       5: e9 00 00 00 00                jmp     0xa <.text+0xa>
                0000000000000006:  R_X86_64_PLT32       fun-0x4
% echo -e 'bl fun\nb fun' | clang --target=aarch64 -c -xassembler - -o - | fob -dr -
...
       0: 94000000      bl      0x0 <.text>
                0000000000000000:  R_AARCH64_CALL26     fun
       4: 14000000      b       0x4 <.text+0x4>
                0000000000000004:  R_AARCH64_JUMP26     fun

Absolute and PC-relative symbol references

% echo -e 'movl a, %eax\nmovl a(%rip), %eax' | clang -c -xassembler - -o - | llvm-objdump -dr -
...
       0: 8b 04 25 00 00 00 00          movl    0x0, %eax
                0000000000000003:  R_X86_64_32S a
       7: 8b 05 00 00 00 00             movl    (%rip), %eax            # 0xd <.text+0xd>
                0000000000000009:  R_X86_64_PC32        a-0x4

(a-.)(%rip) would probably be more semantically correctbut is not adopted by GNU Assembler.

Relocation specifiers

Relocation specifiers guide the assembler on how to resolve andencode expressions into instructions. They specify details like:

Whether to reference the symbol itself, its Procedure Linkage Table(PLT) entry, or its Global Offset Table (GOT) entry.
Which part of a symbol's address to use (e.g., lower or upperbits).
Whether to use an absolute address or a PC-relative one.

This concept appears across various architectures but withinconsistent terminology. The Arm architecture refers to elements like:lo12: and :lower16: as "relocationspecifiers". IBM's AIX documentation also uses this term. Many GNUBinutils target documents simply call these "modifiers", while AVRdocumentation uses "relocatable expression modifiers".

Picking the right term was tricky. "Relocatable expression modifier"nails the idea of tweaking relocatable expressions but feels overlyverbose. "Relocation modifier", though concise, suggests adjustmentshappen during the linker's relocation step rather than the assembler'sexpression evaluation. I landed on "relocation specifier" as the winner.It's clear, aligns with Arm and IBM’s usage, and fits the assembler'srole seamlessly.

For example, RISC-V addi can be used with either anabsolute address or a PC-relative address. Relocation specifiers%lo and %pcrel_lo could differentiate the twouses. Similarly, %hi, %pcrel_hi, and%got_pcrel_hi could differentiate the uses oflui and auipc.

# Position-dependent code (PDC) - absolute addressing
lui     a0, %hi(var)                    # Load upper immediate with high bits of symbol address
addi    a0, a0, %lo(var)                # Add lower 12 bits of symbol address

# Position-independent code (PIC) - PC-relative addressing
auipc   a0, %pcrel_hi(var)              # Add upper PC-relative offset to PC
addi    a0, a0, %pcrel_lo(.Lpcrel_hi1)  # Add lower 12 bits of PC-relative offset

# Position-independent code via Global Offset Table (GOT)
auipc   a0, %got_pcrel_hi(var)          # Calculate address of GOT entry relative to PC
ld      a0, %pcrel_lo(.Lpcrel_hi1)(a0)  # Load var's address from GOT

Why use %hi with lui if it's always paired?It's about clarify and explicitness. %hi ensuresconsistency with %lo and cleanly distinguishes it from from%pcrel_hi. Since both lui andauipc share the U-type instruction format, tying relocationspecifiers to formats rather than specific instructions is a smart,flexible design choice.

Relocation specifier flavors

Assemblers use various syntaxes for relocation specifiers, reflectingarchitectural quirks and historical conventions. Below, we explore themain flavors, their usage across architectures, and some of theirpeculiarities.

expr@specifier

This is likely the most widespread syntax, adopted by many binutilstargets, including ARC, C-SKY, Power, M68K, SuperH, SystemZ, and x86,among others. It's also used in Mach-O object files, e.g.,adrp x8, _bar@GOTPAGE.

This suffix style puts the specifier after an @. It'sintuitive—think sym@got. In PowerPC, operators can getelaborate, such as sym@toc@l(9). Here, @toc@lis a single, indivisible operator-not two separate @pieces-indicating a TOC-relative reference with a low 16-bitextraction.

Parsing is loose: while both expr@specifier+expr andexpr+expr@specifier are accepted (by many targets),conceptually it's just specifier(expr+expr). For example,x86 accepts sym@got+4 or sym+4@got, but don'tmisread—@got applies to sym+4, not justsym.

%specifier(expr)

MIPS, SPARC, RISC-V, and LoongArch favor this prefix style, wrappingthe expression in parentheses for clarity. In MIPS, parentheses areoptional, and operators can nest, like

# MIPS
addiu   $2, $2, %lo(0x12345)
addiu   $2, $2, %lo 0x12345
lui     $1, %hi(%neg(%gp_rel(main)))
ld      $1, %got_page($.str)($gp)

Like expr@specifier, the specifier applies to the wholeexpression. Don't misinterpret %lo(3)+sym-it resolves assym+3 with an R_MIPS_LO16 relocation.

1
2
3

# MIPS
addiu   $2, $2, %lo(3)+sym  # R_MIPS_LO16  sym+0x3
addiu   $2, $2, %lo 3+sym   # R_MIPS_LO16  sym+0x3

SPARC has an anti-pattern. Its %lo and %hiexpand to different relocation types depending on whether gas's-KPIC option (llvm-mc -position-independent)is specified.

expr(specifier)

A simpler suffix style, this is used by AArch32 for data directives.It's less common but straightforward, placing the operator inparentheses after the expression.

.word sym(gotoff)
.long f(FUNCDESC)

.long f(got)+3    // allowed b GNU assembler and LLVM integrated assembler, but probably not used in the wild

:specifier:expr

AArch32 and AArch64 adopt this colon-framed prefix notation, avoidingthe confusion that parentheses might introduce.

// AArch32
movw    r0, :lower16:x

// AArch64
add     x8, x8, :lo12:sym

adrp    x0, :got:var
ldr     x0, [x0, :got_lo12:var]

Applying this syntax to data directives or instructions' firstoperands, however, could create parsing ambiguity. In both GNU Assemblerand LLVM, .word :plt:fun would be interpreted as.word: plt: fun, treating .word andplt as labels, rather than achieving the intendedmeaning.

One idea is to # for disambiguitation:

1	.word #:gotpcrel:var

Recommendation

For new architectures, I'd suggest adopting%specifier(expr), and never use @specifier.The % symbol works seamlessly with data directives, andduring operand parsing, the parser can simply peek at the first token tocheck for a relocation specifier.

I favor %specifier(expr) over%specifier expr because it provides clearer scoping,especially in data directives with multiple operands, such as.long %lo(a), %lo(b).

( %specifier(...) resembles % expansion inGNU Assembler's altmacro mode.

1
2
3

.altmacro
.macro m arg; .long \arg; .endm
.data; m %(1+2)

)

Inelegance

RISC-V favors %specifier(expr) but clings tocall sym@plt for legacyreasons.

AArch64 uses :specifier:expr, yetR_AARCH64_PLT32 (.word foo@plt - .) and PAuthABI (.quad (g + 7)@AUTH(ia,0)) cannot use :after data directives due to parsing ambiguity. https://github.com/llvm/llvm-project/issues/132570

TLS symbols

When a symbol is defined in a section with the SHF_TLSflag (Thread-Local Storage), GNU assembler assigns it the typeSTT_TLS in the symbol table. For undefined TLS symbols, theprocess differs: GCC and Clang don’t emit explicit labels. Instead,assemblers identify these symbols through TLS-specific relocationspecifiers in the code, deduce their thread-local nature, and set theirtype to STT_TLS accordingly.

// AArch64
add     x8, x8, :tprel_hi12:tls

// x86
movl    %fs:tls@TPOFF, %eax

Composed relocations

Most instructions trigger zero or one relocation, but some generatetwo. Often, one acts as a marker, paired with a standard relocation. Forexample:

PPC64 bl __tls_get_addr(x@tlsgd)pairs a marker R_PPC64_TLSGD withR_PPC64_REL24
PPC64's link-time GOT-indirect to PC-relative optimization (withPower10's prefixed instruction) generates aR_PPC64_PCREL_OPT relocation following a GOT relocation. https://reviews.llvm.org/D79864
RISC-V linker relaxation uses R_RISCV_RELAX alongsideanother relocation, andR_RISCV_ADD*/R_RISCV_SUB* pairs.
Mach-O scattered relocations for label differences.
XCOFF represents a label difference with a pair of R_POS andR_NEG relocations.

These marker cases tie into "composed relocations", as outlined inthe Generic ABI:

If multiple consecutive relocation records are applied to the samerelocation location (r_offset), they are composed insteadof being applied independently, as described above. By consecutive, wemean that the relocation records are contiguous within a singlerelocation section. By composed, we mean that the standard applicationdescribed above is modified as follows:

In all but the last relocation operation of a composed sequence,the result of the relocation expression is retained, rather than havingpart extracted and placed in the relocated field. The result is retainedat full pointer precision of the applicable ABI processorsupplement.

In all but the first relocation operation of a composed sequence,the addend used is the retained result of the previous relocationoperation, rather than that implied by the relocation type.

Note that a consequence of the above rules is that the locationspecified by a relocation type is relevant for the first element of acomposed sequence (and then only for relocation records that do notcontain an explicit addend field) and for the last element, where thelocation determines where the relocated value will be placed. For allother relocation operands in a composed sequence, the location specifiedis ignored.

An ABI processor supplement may specify individual relocation typesthat always stop a composition sequence, or always start a new one.

Implicit addends

ELF SHT_REL and Mach-O utilize implicit addends.TODO

R_MIPS_HI16 (https://reviews.llvm.org/D101773)

GNU Assembler internals

GNU Assembler utilizes struct fixup to represent boththe fixup and the relocatable expression.

struct fix {
  ...
  /* NULL or Symbol whose value we add in.  */
  symbolS *fx_addsy;

  /* NULL or Symbol whose value we subtract.  */
  symbolS *fx_subsy;

  /* Absolute number we add in.  */
  valueT fx_offset;
};

The relocation specifier is part of the instruction instead of partof struct fix. Targets have different internalrepresentations of instructions.

// gas/config/tc-aarch64.c
struct reloc
{
  bfd_reloc_code_real_type type;
  expressionS exp;
  int pc_rel;
  enum aarch64_opnd opnd;
  uint32_t flags;
  unsigned need_libopcodes_p : 1;
};

struct aarch64_instruction
{
  aarch64_inst base;
  aarch64_operand_error parsing_error;
  int cond;
  struct reloc reloc;
  unsigned gen_lit_pool : 1;
};

// gas/config/tc-ppc.c
struct ppc_fixup
 {
   expressionS exp;
   int opindex;
   bfd_reloc_code_real_type reloc;
 };

The 2002 message stageone of gas reloc rewrite describes the passes.

In PPC, the result of @l and @ha can beeither signed or unsigned, determined by the instruction opcode.

In md_apply_fix, TLS-related relocation specifiers callS_SET_THREAD_LOCAL (fixP->fx_addsy);.

LLVM internals

LLVM integrated assembler encodes fixups and relocatable expressionsseparately.

class MCFixup {
  /// The value to put into the fixup location. The exact interpretation of the
  /// expression is target dependent, usually it will be one of the operands to
  /// an instruction or an assembler directive.
  const MCExpr *Value = nullptr;

  /// The byte index of start of the relocation inside the MCFragment.
  uint32_t Offset = 0;

  /// The target dependent kind of fixup item this is. The kind is used to
  /// determine how the operand value should be encoded into the instruction.
  MCFixupKind Kind = FK_NONE;

  /// The source location which gave rise to the fixup, if any.
  SMLoc Loc;
};

LLVM encodes relocatable expressions as MCValue,

class MCValue {
  const MCSymbol *SymA = nullptr, *SymB = nullptr;
  int64_t Cst = 0;
  uint32_t Specifier = 0;
};

with:

Specifier as an optional relocation specifier (namedRefKind before LLVM 21)
SymA as an optional symbol reference (addend)
SymB as an optional symbol reference (subtrahend)
Cst as a constant value

This mirrors the relocatable expression concept, butSpecifier—addedin 2014 for AArch64 as RefKind—remains rare amongtargets. (I've recently made some cleanup to some targets. For instance,I migrated PowerPC's @l and @ha folding to useSpecifier.)

AArch64 implements a clean approach to select the relocation type. Itdispatches on the fixup kind (an operand within a specific instructionformat), then refines it with the relocation specifier.

// AArch64ELFObjectWriter::getRelocType
unsigned Kind = Fixup.getTargetKind();
switch (Kind) {
// Handle generic MCFixupKind.
case FK_Data_1:
case FK_Data_2:
  ...

// Handle target-specific MCFixupKind.
case AArch64::fixup_aarch64_add_imm12:
  if (RefKind == AArch64::S_DTPREL_HI12)
    return R_CLS(TLSLD_ADD_DTPREL_HI12);
  if (RefKind == AArch64::S_TPREL_HI12)
    return R_CLS(TLSLE_ADD_TPREL_HI12);
  ...
}

MCAssembler::evaluateFixup andELFObjectWriter::recordRelocation record a relocation.

// MCAssembler::evaluateFixup
Evaluate `const MCExpr *Fixup::Value` to a relocatable expression.
Determine the fixup value. Adjust the value if FKF_IsPCRel.
If the relocatable expression is a constant, treat this fixup as resolved.

if (IsResolved && is_reloc_directive)
  IsResolved = false;
Backend.applyFixup(...)



// applyFixup
if (...)
  IsResolved = false;
if (!IsResolved) {
  // For exposition I've inlined ELFObjectWriter::recordRelocation here.
  // the function roughly maps to GNU Assembler's `md_apply_fix` and `tc_gen_reloc`,
  Type = TargetObjectWriter->getRelocType(Ctx, Target, Fixup, IsPCRel)
  Determine whether SymA can be converted to a section symbol.
  Relocations.push_back(...)
}
// Write a value to the relocated location. When using relocations with explicit addends, the function is a no-op when `IsResolved` is true.

FKF_IsPCRel applies to fixups whose relocationoperations look like S - P + A, like branches andPC-relative operations, but not to GOT-related operations (e.g.,GDAT - P + A).

`MCSymbolRefExpr` issues

The expression structure follows a traditional object-orientedhierarchy:

MCExpr
  MCConstantExpr: Value
  MCSymbolRefExpr: VariantKind, Symbol
  MCUnaryExpr: Op, Expr
  MCBinaryExpr: Op, LHS, RHS
  MCTargetExpr:
    X86MCExpr: x86 register
  MCSpecifierExpr: expression with a relocation specifier

MCSymbolRefExpr::VariantKind enums the relocationspecifier, but it's a poor fit:

Other expressions, like MCConstantExpr (e.g., PPC4@l) and MCBinaryExpr (e.g., PPC(a+1)@l), also need it.
Semantics blur when folding expressions with @, whichis unavoidable when @ can occur at any position within thefull expression.
The generic MCSymbolRefExpr lacks target-specifichooks, cluttering the interface with any target-specific logic.

Consider what happens with addition or subtraction:

1
2
3

MCBinaryExpr
  LHS(MCSymbolRefExpr): VariantKind, SymA
  RHS(MCSymbolRefExpr): SymB

Here, the specifier attaches only to the LHS, leaving the full resultuncovered. This awkward design demands workarounds.

Parsing a+4@got exposes clumsiness. AfterAsmParser::parseExpression processes a+4, itdetects @got and retrofits it ontoMCSymbolRefExpr(a), which feels hacked together.
PowerPC's @l @ha optimization needsPPCAsmParser::extractSpecifier andPPCAsmParser::applySpecifier to convert aMCSymbolRefExpr to a MCSpecifierExpr.

Worse, leaky abstractions that MCSymbolRefExpr isaccessed widely in backend code introduces another problem: whileMCBinaryExpr with a constant RHS mimicsMCSymbolRefExpr semantically, code often handles only thelatter.

MCFixupshould store MCValue instead of MCExpr

The const MCExpr *MCFixup::getValue() method feelsinconvenient and less elegant compared to GNU Assembler's unifiedfixup/relocatable expression for these reasons:

Relocation specifier can be encoded by every sub-expression in theMCExpr tree, rather than the fixup itself (or theinstruction, as in GNU Assembler). Supporting all ofa+4@got, a@got+4, (a+4)@got requires extensive hacks inLLVM MCParser.
evaluateAsRelocatable converts an MCExpr to an MCValuewithout updating the MCExpr itself. This leads to redundant evaluations,as MCAssembler::evaluateFixup is called multiple times,such as in MCAssembler::fixupNeedsRelaxation andMCAssembler::layout.

Storing a MCValue directly in MCFixup, or adding a relocationspecifier member, could eliminate the need for many target-specificMCTargetFixup classes that manage relocation specifiers.However, target-specific evaluation hooks would still be needed forspecifiers like PowerPC @l or RISC-V%lo().

Computing label differences will be simplified as we can utilizeSymA and SymB.

Our long-term goal is to encode the relocation specifier withinMCFixup. (https://github.com/llvm/llvm-project/issues/135592)

MCSymbolRefExpr::VariantKind as the legacy way to encoderelocations should be completely removed (probably in a distant futureas many cleanups are required).

AsmParser:`expr@specifier`

In LLVM's assembly parser library (LLVMMCParser), the parsing ofexpr@specifier was supported for all targets until Iupdated it to be anopt-in feature in March 2025.

AsmParser's @specifier parsing is suboptimal,necessitating lexer workarounds.

The @ symbol can appear after a symbol or an expression(via parseExpression) and may occur multiple times within asingle operand, making it challenging to validate and reject invalidcases.

In the GNU Assembler, COFF targets permit @ withinidentifier names, and MinGW supports constructs like.long ext24@secrel32. It appears that a recognized suffixis treated as a specifier, while an unrecognized suffix results in asymbol that includes the @.

The PowerPC AsmParser(llvm/lib/Target/PowerPC/AsmParser/PPCAsmParser.cpp) parsesan operand and then calls PPCAsmParser::extractSpecifier toextract the optional @ specifier. When the @specifier is detected and removed, it generates aPPCMCExpr. This functionality is currently implemented for@l and @ha`,and it would be beneficial to extend this to include all specifiers.

AsmPrinter

In llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp,AsmPrinter::lowerConstant outlines how LLVM handles theemission of a global variable initializer. When processingConstantExpr elements, this function may generate datadirectives in the assembly code that involve differences betweensymbols.

One significant use case for this intricate code isclang++ -fexperimental-relative-c++-abi-vtables. Thisfeature produces a PC-relative relocation that points to either the PLT(Procedure Linkage Table) entry of a function or the function symboldirectly.

Compiling C++ with the Clang API

MaskRay

2025年3月9日 16:00

This post describes how to compile a single C++ source file to anobject file with the Clang API. Here is the code. It behaves like asimplified clang executable that handles -cand -S.

1	cat > main.cc <<eof

#include <clang/CodeGen/CodeGenAction.h> // EmitObjAction
#include <clang/Driver/Compilation.h>
#include <clang/Driver/Driver.h>
#include <clang/Frontend/CompilerInstance.h>
#include <clang/Frontend/FrontendOptions.h>
#include <llvm/Config/llvm-config.h>   // LLVM_VERSION_MAJOR
#include <llvm/Support/TargetSelect.h> // LLVMInitialize*
#include <llvm/Support/VirtualFileSystem.h>

using namespace clang;

constexpr llvm::StringRef kTargetTriple = "x86_64-unknown-linux-gnu";

namespace {
struct DiagsSaver : DiagnosticConsumer {
  std::string message;
  llvm::raw_string_ostream os{message};

  void HandleDiagnostic(DiagnosticsEngine::Level diagLevel, const Diagnostic &info) override {
    DiagnosticConsumer::HandleDiagnostic(diagLevel, info);
    const char *level;
    switch (diagLevel) {
    default:
      return;
    case DiagnosticsEngine::Note:
      level = "note";
      break;
    case DiagnosticsEngine::Warning:
      level = "warning";
      break;
    case DiagnosticsEngine::Error:
    case DiagnosticsEngine::Fatal:
      level = "error";
      break;
    }

    llvm::SmallString<256> msg;
    info.FormatDiagnostic(msg);
    auto &sm = info.getSourceManager();
    auto loc = info.getLocation();
    auto fileLoc = sm.getFileLoc(loc);
    os << sm.getFilename(fileLoc) << ':' << sm.getSpellingLineNumber(fileLoc)
       << ':' << sm.getSpellingColumnNumber(fileLoc) << ": " << level << ": "
       << msg << '\n';
    if (loc.isMacroID()) {
      loc = sm.getSpellingLoc(loc);
      os << sm.getFilename(loc) << ':' << sm.getSpellingLineNumber(loc) << ':'
         << sm.getSpellingColumnNumber(loc) << ": note: expanded from macro\n";
    }
  }
};
}

static std::pair<bool, std::string> compile(int argc, char *argv[]) {
  auto fs = llvm::vfs::getRealFileSystem();
  DiagsSaver dc;
  std::vector<const char *> args{"clang"};
  args.insert(args.end(), argv + 1, argv + argc);
  auto diags = CompilerInstance::createDiagnostics(
#if LLVM_VERSION_MAJOR >= 20
      *fs,
#endif
      new DiagnosticOptions, &dc, false);
  driver::Driver d(args[0], kTargetTriple, *diags, "cc", fs);
  d.setCheckInputsExist(false);
  std::unique_ptr<driver::Compilation> comp(d.BuildCompilation(args));
  const auto &jobs = comp->getJobs();
  if (jobs.size() != 1)
    return {false, "only support one job"};
  const llvm::opt::ArgStringList &ccArgs = jobs.begin()->getArguments();

  auto invoc = std::make_unique<CompilerInvocation>();
  CompilerInvocation::CreateFromArgs(*invoc, ccArgs, *diags);
  auto ci = std::make_unique<CompilerInstance>();
  ci->setInvocation(std::move(invoc));
  ci->createDiagnostics(*fs, &dc, false);
  // Disable CompilerInstance::printDiagnosticStats, which might display "2 warnings generated."
  ci->getDiagnostics().getDiagnosticOptions().ShowCarets = false;
  ci->createFileManager(fs);
  ci->createSourceManager(ci->getFileManager());

  // Clang calls BuryPointer on the internal AST and CodeGen-related elements like TargetMachine.
  // This will cause memory leaks if `compile` is executed many times.
  ci->getCodeGenOpts().DisableFree = false;
  ci->getFrontendOpts().DisableFree = false;

  LLVMInitializeX86AsmParser();
  LLVMInitializeX86AsmPrinter();
  LLVMInitializeX86Target();
  LLVMInitializeX86TargetInfo();
  LLVMInitializeX86TargetMC();

  switch (ci->getFrontendOpts().ProgramAction) {
  case frontend::ActionKind::EmitObj: {
    EmitObjAction action;
    ci->ExecuteAction(action);
  } break;
  case frontend::ActionKind::EmitAssembly: {
    EmitAssemblyAction action;
    ci->ExecuteAction(action);
  } break;
  default:
    return {false, "unhandled action"};
  }
  return {true, std::move(dc.message)};
}

int main(int argc, char *argv[]) {
  auto [ok, err] = compile(argc, argv);
  llvm::errs() << err;
}

eof

Building the code with CMake

Let's write a CMakeLists.txt that links against theneeded Clang and LLVM libraries.

cat > CMakeLists.txt <<eof
project(cc)
cmake_minimum_required(VERSION 3.16)
find_package(LLVM REQUIRED CONFIG)
find_package(Clang REQUIRED CONFIG)

include_directories(${LLVM_INCLUDE_DIRS} ${CLANG_INCLUDE_DIRS})
add_executable(cc main.cc)

if(NOT LLVM_ENABLE_RTTI)
  target_compile_options(cc PRIVATE -fno-rtti)
endif()

if(CLANG_LINK_CLANG_DYLIB)
  target_link_libraries(cc PRIVATE clang-cpp)
else()
  target_link_libraries(cc PRIVATE
    clangAST
    clangBasic
    clangCodeGen
    clangDriver
    clangFrontend
    clangLex
    clangParse
    clangSema
  )
endif()

if(LLVM_LINK_LLVM_DYLIB)
  target_link_libraries(cc PRIVATE LLVM)
else()
  target_link_libraries(cc PRIVATE LLVMOption LLVMSupport LLVMTarget
    LLVMX86AsmParser LLVMX86CodeGen LLVMX86Desc LLVMX86Info)
endif()
eof

We need an LLVM and Clang installation that provides bothlib/cmake/llvm/LLVMConfig.cmake andlib/cmake/clang/ClangConfig.cmake. You can grab these fromsystem packages (dev versions may be required) or build LLVMyourself-I'll skip the detailed steps here. For a DIY build, use:

1
2
3

# cmake ... -DLLVM_ENABLE_PROJECTS='clang'

ninja -C out/stable clang-cmake-exports clang

No install step is needed. Next, create a builddirectory with the CMake configuration above:

1 2	cmake -S. -Bout/debug -G Ninja -DCMAKE_BUILD_TYPE=Debug -DCMAKE_CXX_COMPILER=$HOME/Stable/bin/clang++ -DCMAKE_PREFIX_PATH="$HOME/llvm/out/stable" ninja -C out/debug

I've set a prebuilt Clang as CMAKE_CXX_COMPILER-just ahabit of mine. llvm-project isn't guaranteed to build warning-free withGCC, since GCC -Wall -Wextra has many false positives andLLVM developers avoid cluttering the codebase.

% echo 'void f() {}' > a.cc
% out/debug/cc -S a.cc && head -n 5 a.s
        .file   "a.cc"
        .text
        .globl  _Z1fv                           # -- Begin function _Z1fv
        .p2align        4
        .type   _Z1fv,@function
% out/debug/cc -c a.cc && ls a.o
a.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped

Anonymous files

The input source file and the output ELF file are stored in thefilesystem. We could create a temporary file and delete it with a RAIIclass llvm::FileRemover:

1
2
3

std::error_code ec = llvm::sys::fs::createTemporaryFile("clang", "cc", fdIn, tempPath);
llvm::raw_fd_stream osIn(fdIn, /*ShouldClose=*/true);
llvm::FileRemover remover(tempPath);

On Linux, we could utilzie memfd_create to create a filein RAM with a volatile backing storage.

int fdIn = memfd_create("input", MFD_CLOEXEC);
if (fdIn < 0)
  return {"", "failed to create input memfd"};
int fdOut = memfd_create("output", MFD_CLOEXEC);
if (fdOut < 0) {
  close(fdIn);
  return {"", "failed to create output memfd"};
}

std::string pathIn = "/proc/self/fd/" + std::to_string(fdIn);
std::string pathOut = "/proc/self/fd/" + std::to_string(fdOut);

// clang -c -xc++ /proc/self/fd/3 -o /proc/self/fd/4

`LLVMInitialize*`

To generate x86 code, we need a few LLVM X86 libraries defined byllvm/lib/Target/X86/**/CMakeLists.txt files.

LLVMInitializeX86AsmPrinter();
LLVMInitializeX86Target();
LLVMInitializeX86TargetInfo();
LLVMInitializeX86TargetMC();

If inline assembly is used, we will also need the AsmParserlibrary:

1	LLVMInitializeX86AsmParser();

We could also call LLVMInitializeAll* functions instead,which initialize all supported targets (build-timeLLVM_TARGETS_TO_BUILD).

Here are some notes about the LLVMX86 libraries:

LLVMX86Info: llvm/lib/Target/X86/TargetInfo/
LLVMX86Desc: llvm/lib/Target/X86/MCTargetDesc/ (dependson LLVMX86Info)
LLVMX86AsmParser: llvm/lib/Target/X86/AsmParser(depends on LLVMX86Info and LLVMX86Desc)
LLVMX86CodeGen: llvm/lib/Target/X86/ (depends onLLVMX86Info and LLVMX86Desc)

`EmitAssembly` and`EmitObj`

The code supports two frontend actions, EmitAssembly(-S) and EmitObj (-c).

You could also utilize the API inclang/include/clang/FrontendTool/Utils.h, but that wouldpull in another library clangFrontendTool (different fromclangFrontend).

Diagnostics

The diagnostics system is quite complex. We haveDiagnosticConsumer, DiagnosticsEngine, andDiagnosticOptions.

DiagnosticsEngine
├─ DiagnosticIDs (defines diagnostics)
├─ SourceManager (provides locations)
├─ DiagnosticOptions (configures output)
└─ DiagnosticConsumer (handles output)
   └─ Diagnostic (individual message)

We define a simple DiagnosticConsumer that handlesnotes, warnings, errors, and fatal errors. When macro expansion comesinto play, we report two key locations:

The physical location (fileLoc), where the expandedtoken triggers an issue-matching Clang's error line, and
The spelling location within the macro's replacement list(sm.getSpellingLoc(loc)).

Although Clang also highlights intermediate locations for chainedexpansions, our simple approach offers a solid approximation.

% cat a.h
#define FOO(x) x + 1
% cat a.cc
#include "a.h"
#define BAR FOO
void f() {
  int y = BAR("abc");
}
% out/debug/cc -c -Wall a.cc
a.cc:4:11: warning: adding 'int' to a string does not append to the string
./a.h:1:18: note: expanded from macro
a.cc:4:11: note: use array indexing to silence this warning
./a.h:1:18: note: expanded from macro
a.cc:4:7: error: cannot initialize a variable of type 'int' with an rvalue of type 'const char *'
% clang -c -Wall a.cc
a.cc:4:11: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
    4 |   int y = BAR("abc");
      |           ^~~~~~~~~~
a.cc:2:13: note: expanded from macro 'BAR'
    2 | #define BAR FOO
      |             ^
./a.h:1:18: note: expanded from macro 'FOO'
    1 | #define FOO(x) x + 1
      |                ~~^~~
a.cc:4:11: note: use array indexing to silence this warning
a.cc:2:13: note: expanded from macro 'BAR'
    2 | #define BAR FOO
      |             ^
./a.h:1:18: note: expanded from macro 'FOO'
    1 | #define FOO(x) x + 1
      |                  ^
a.cc:4:7: error: cannot initialize a variable of type 'int' with an rvalue of type 'const char *'
    4 |   int y = BAR("abc");
      |       ^   ~~~~~~~~~~
1 warning and 1 error generated.

We call a convenience functionCompilerInstance::ExecuteAction, which wraps lower-levelAPI like BeginSource, Execute, andEndSource. However, it will print1 warning and 1 error generated. unless we setShowCarets to false.

`clang::createInvocation`

clang::createInvocation, renamed from createInvocationFromCommandLinein 2022, combines clang::Driver::BuildCompilation andclang::CompilerInvocation::CreateFromArgs. While it saves afew lines for certain tasks, it lacks the flexibility we need for ourspecific use cases.

Migrating comments to giscus

MaskRay

2025年2月17日 16:00

Followed this guide: https://www.patrickthurmond.com/blog/2023/12/11/commenting-is-available-now-thanks-to-giscus

Add the following to layout/_partial/article.ejs

<% if (!index && post.comments) { %>
<section class="giscus"></section>
<script src="https://giscus.app/client.js"
 data-repo="MaskRay/maskray.me"
 data-repo-id="FILL IT UP"
 data-category="Blog Post Comments"
 data-category-id="FILL IT UP"
 data-mapping="pathname"
 data-strict="0"
 data-reactions-enabled="1"
 data-emit-metadata="0"
 data-input-position="bottom"
 data-theme="preferred_color_scheme"
 data-lang="en"
 data-loading="lazy"
 crossorigin="anonymous"
 async>
</script>
<% } %>

Unfortunately comments from Disqus have not been migrated yet. Ifyou've left comments in the past, thank you. Apologies they are nowgone.

While you can create Github Discussions via GraphQL API, I haven'tfound a solution that works out of the box. https://www.davidangulo.xyz/posts/dirty-ruby-script-to-migrate-comments-from-disqus-to-giscus/provides a Ruby solution, which is promising but no longer works.

Failed to define value method for :name, because EnterpriseOrderField already responds to that method. Use `value_method:` to override the method name or `value_method: false` to disable Enum value me
thod generation.
Failed to define value method for :name, because EnvironmentOrderField already responds to that method. Use `value_method:` to override the method name or `value_method: false` to disable Enum value m
ethod generation.
Failed to define value method for :name, because LabelOrderField already responds to that method. Use `value_method:` to override the method name or `value_method: false` to disable Enum value method
generation.
...
.local/share/gem/ruby/3.3.0/gems/graphql-client-0.25.0/lib/graphql/client.rb:338:in `query': wrong number of arguments (given 2, expected 1) (ArgumentError)
        from g.rb:42:in `create_discussion'

lld 20 ELF changes

MaskRay

2025年2月2日 16:00

LLVM 20 will be released. As usual, I maintain lld/ELF and have addedsome notes to https://github.com/llvm/llvm-project/blob/release/20.x/lld/docs/ReleaseNotes.rst.I've meticulously reviewed nearly all the patches that are not authoredby me. I'll delve into some of the key changes.

-z nosectionheader has been implemented to omit thesection header table. The operation is similar tollvm-objcopy --strip-sections. (#101286)
--randomize-section-padding=<seed> is introducedto insert random padding between input sections and at the start of eachsegment. This can be used to control measurement bias in A/Bexperiments. (#117653)
The reproduce tarball created with --reproduce= nowexcludes directories specified in the --dependency-fileargument (used by Ninja). This resolves an error where non-existentdirectories could cause issues when invokingld.lld @response.txt.
--symbol-ordering-file= and call graph profile can nowbe used together.
When --call-graph-ordering-file= is specified,.llvm.call-graph-profile sections in relocatable files areno longer used.
--lto-basic-block-sections=labels is deprecated infavor of --lto-basic-block-address-map. (#110697)
In non-relocatable links, a .note.GNU-stack sectionwith the SHF_EXECINSTR flag is now rejected unless-z execstack is specified. (#124068)
In relocatable links, the sh_entsize member of aSHF_MERGE section with relocations is now respected in theoutput.
Quoted names can now be used in output section phdr, memory regionnames, OVERLAY, the LHS of --defsym, andINSERT AFTER.
Section CLASS linker script syntax binds input sectionsto named classes, which are referenced later one or more times. Thisprovides access to the automatic spilling mechanism of--enable-non-contiguous-regions without globally changingthe semantics of section matching. It also independently increases theexpressive power of linker scripts. (#95323)
INCLUDE cycle detection has been fixed. A linker scriptcan now be included twice.
The archivename: syntax when matching input sections isnow supported. (#119293)
To support Arm v6-M, short thunks using B.w are no longer generated.(#118111)
For AArch64, BTI-aware long branch thunks can now be created to adestination function without a BTI instruction. (#108989) (#116402)
Relocations related to GOT and TLSDESC for the AArch64 PointerAuthentication ABI are now supported.
Supported relocation types for x86-64 target:
- R_X86_64_CODE_4_GOTPCRELX (#109783) (#116737)
- R_X86_64_CODE_4_GOTTPOFF (#116634)
- R_X86_64_CODE_4_GOTPC32_TLSDESC (#116909)
- R_X86_64_CODE_6_GOTTPOFF (#117675)
Supported relocation types for LoongArch target:R_LARCH_TLS_{LD,GD,DESC}_PCREL20_S2. (#100105)

Linker scripts

The CLASS keyword, which separates section matching andreferring, is a noteworthy new feature to the linker script support.Here is the GNU ld featurerequest.

Section layout

If --symbol-ordering-file= is specified,--symbol-ordering-file= specified sections are placedfirst. In LLD 20, SHT_LLVM_CALL_GRAPH_PROFILE sections inrelocatable files are still used for other sections.

The next release will support options--bp-compression-sort=both and--bp-startup-sort=function --irpgo-profile=a.profdata thatimproves Lempel-Ziv compression and reduces page faults during programstartup for mobile applications.

`.dynsym` computation

The purpose of Symbol::includeInDynsym was somewhatambiguous, as it was used both to determine if a symbol should beexported to .dynsym and to conservatively suppresstransformations in other contexts like MarkLive and ICF. LLD 20clarifies this by introducing Symbol::isExportedspecifically for indicating whether a defined symbol should be exported.All previous uses of Symbol::includeInDynsym have beenupdated to use Symbol::isExported instead. The oldconfusing Symbol::exportDynamic has been removed.

A special case within Symbol::includeInDynsym checkedfor isUndefWeak() && ctx.arg.noDynamicLinker. (Thiscould be generalized toisUndefined() && ctx.arg.noDynamicLinker, asnon-weak undefined symbols led to errors. Nonetheless,noDynamicLinker has been removed to improve consistency.)This condition ensures that undefined symbols are not included in.dynsym for statically linked ET_DYNexecutables (created with clang -static-pie).

This condition has been generalized in LLD 20 to(ctx.arg.shared || !ctx.sharedFiles.empty()) && (sym->isUndefined() || sym->isExported).This means undefined symbols are excluded from .dynsym inboth ld.lld -pie a.o andld.lld -pie --no-dynamic-linker a.o, but notld.lld -pie a.o b.so. This change brings LLD's behaviormore in line with GNU ld.

Symbol::isPreemptible, indicating whether a symbol couldbe bound to another component, was calculated before relocation scanningand, in LLD 19, also during Identical Code Folding (ICF). In LLD 20, theICF-related calculation has been moved to the symbol versioning parsingstage.

In LLD 20, isExported and isPreemptible arecomputed in the following passes.

Scan input files, interleaved with symbol resolution: setisExported when defined or referenced by sharedobjects
Clear isExported if influenced by--exclude-libs
parseVersionAndComputeIsPreemptible
- Clear isExported if localized due to hiddenvisibility.
- For undefined symbols, compute isPreemptible
- For defined symbols in relocatable files, or bitcode files when!ltoCanOmit, set isExported and computeisPreemptible
compileBitcodeFiles
Scan LTO compiled relocatable files
Clear isExported if influenced by--exclude-libs
finalizeSections: recomputeisPreemptible
isPreemptible and isExported determinewhether a symbol should be exported to .dynsym.

for (Symbol *sym : ctx.symtab->getSymbols()) {
  if (!sym->isUsedInRegularObj || !includeInSymtab(ctx, *sym))
    continue;
  if (!ctx.arg.relocatable)
    sym->binding = sym->computeBinding(ctx);
  if (ctx.in.symTab)
    ctx.in.symTab->addSymbol(sym);

  // computeBinding might localize a linker-synthesized hidden symbol
  // that was considered exported.
  if ((sym->isExported || sym->isPreemptible) && !sym->isLocal()) {
    ctx.partitions[sym->partition - 1].dynSymTab->addSymbol(sym);
    if (auto *file = dyn_cast<SharedFile>(sym->file))
      if (file->isNeeded && !sym->isUndefined())
        addVerneed(ctx, *sym);
  }
}

Link: lld 19 ELFchanges

Natural loops

MaskRay

2025年1月20日 13:00

A dominator tree can beused to compute natural loops.

For every node H in a post-order traversal of thedominator tree (or the original CFG), find all predecessors that aredominated by H. This identifies all back edges.
Each back edge T->H identifies a natural loop withH as the header.
- Perform a flood fill starting from T in the reverseddominator tree (from exiting block to header)
- All visited nodes reachable from the root belong to the natural loopassociated with the back edge. These nodes are guaranteed to bereachable from H due to the dominator property.
- Visited nodes unreachable from the root should be ignored.
- Loops associated with visited nodes are considered subloops.

Here is an C++ implementation:

#include <cstdio>
#include <deque>
#include <numeric>
#include <vector>
using namespace std;

vector<vector<int>> e, ee, edom;
vector<int> dfn, dfn2, rdfn, uf, best, sdom, idom;
int tick;

void dfs(int u) {
  dfn[u] = tick;
  rdfn[tick++] = u;
  for (int v : e[u])
    if (dfn[v] < 0) {
      uf[v] = u;
      dfs(v);
    }
}

int eval(int v, int cur) {
  if (dfn[v] <= cur)
    return v;
  int u = uf[v], r = eval(u, cur);
  if (dfn[best[u]] < dfn[best[v]])
    best[v] = best[u];
  return uf[v] = r;
}

void semiNca(int n, int r) {
  idom.assign(n, -1);
  dfn.assign(n, -1);
  rdfn.resize(n); // initial values are unused
  uf.resize(n); // initial values are unused
  sdom.resize(n); // initial values are unused
  tick = 0;
  dfs(r);
  best.resize(n);
  iota(best.begin(), best.end(), 0);
  for (int i = tick; --i; ) {
    int v = rdfn[i];
    sdom[v] = v;
    for (int u : ee[v])
      if (~dfn[u]) {
        eval(u, i);
        if (dfn[best[u]] < dfn[sdom[v]])
          sdom[v] = best[u];
      }
    best[v] = sdom[v];
    idom[v] = uf[v];
  }
  edom.assign(n, vector<int>());
  for (int i = 1; i < tick; i++) {
    int v = rdfn[i];
    while (dfn[idom[v]] > dfn[sdom[v]])
      idom[v] = idom[idom[v]];
    edom[idom[v]].push_back(v);
  }
}

struct Loop {
  int idx, header;
  Loop *parent = nullptr, *child = nullptr, *next = nullptr;
  vector<int> nodes;
};
deque<Loop> loops;

void postorder(int u) {
  dfn[u] = tick;
  for (int v : edom[u])
    if (dfn[v] < 0)
      postorder(v);
  rdfn[tick++] = u;
  dfn2[u] = tick;
}

void identifyLoops(int n, int r) {
  vector<int> worklist;
  vector<Loop *> to_loop(n);
  dfn.assign(n, -1);
  dfn2.assign(n, -1);
  tick = 0;
  postorder(r);
  loops.clear();
  for (int i = 0; i < tick; i++) {
    int header = rdfn[i];
    for (int u : ee[header])
      if (dfn[header] <= dfn[u] && dfn2[u] <= dfn2[header])
        worklist.push_back(u);
    if (worklist.empty())
      continue;
    loops.push_back(Loop{(int)loops.size(), header});
    Loop *lp = &loops.back();
    while (worklist.size()) {
      int v = worklist.back();
      worklist.pop_back();
      if (!to_loop[v]) {
        if (dfn[v] < 0) // Skip unreachable node
          continue;
        // Find a node not in a loop.
        to_loop[v] = lp;
        lp->nodes.push_back(v);
        if (v == header)
          continue;
        for (int u : ee[v])
          worklist.push_back(u);
      } else {
        // Find a subloop.
        Loop *sub = to_loop[v];
        while (sub->parent)
          sub = sub->parent;
        if (sub == lp)
          continue;
        sub->parent = lp;
        sub->next = lp->child;
        lp->child = sub;
        for (int u : ee[sub->header])
          if (to_loop[u] != sub)
            worklist.push_back(u);
      }
    }
  }
}

int main() {
  int n, m;
  scanf("%d%d", &n, &m);
  e.resize(n);
  ee.resize(n);
  for (int i = 0; i < m; i++) {
    int u, v;
    scanf("%d%d", &u, &v);
    e[u].push_back(v);
    ee[v].push_back(u);
  }
  semiNca(n, 0);
  for (int i = 0; i < n; i++)
    printf("%d: %d\n", i, idom[i]);

  identifyLoops(n, 0);
  for (Loop &lp : loops) {
    printf("loop %d:", lp.idx);
    for (int v : lp.nodes)
      printf(" %d", v);
    for (Loop *c = lp.child; c; c = c->next)
      printf(" (loop %d)", c->idx);
    puts("");
  }
}

The code iterates over the dominator tree in post-order.Alternatively, a post-order traversal of the original control flow graphcould be used.

worklist may contain duplicate elements. This isacceptable. You could also deduplicate elements.

Importantly, the header predecessor of a subloop can be anothersubloop.

In the final loops array, parent loops are listed aftertheir child loops.

This example examines multiple subtle details: a self-loop (node 6),an unreachable node (node 8), and a scenario where the headerpredecessor of one subloop (nodes 2 and 3) leads to another subloop(nodes 4 and 5).

Useawk 'BEGIN{print "digraph G{"} NR>1{print $1"->"$2} END{print "}"}'to generate a graphviz dot file.

阅读视图

Blogging

llvm-project

Linux kernel

ccls

ELF specification

Misc

旅行

Blogging

llvm-project

Linux kernel

ccls

ELF specification

Misc

旅行

Weak AVL Tree

Insertion

Deletion

Implementation

Misc

周六

周日

Locke

Stack walking mechanisms

Space overhead analysis

Frame pointer size impact

SFrame vs .eh_frame

SFrame vs FP

Runtime performance analysis

Summary

Appendix:configure-llvm

Appendix: My SFrame build

Appendix: Scripts

Data structures

Function Descriptor Entries(FDEs)

Frame Row Entries (FREs)

Architecture-specific stackoffsets

x86-64

AArch64

s390x

Toolchain implementation

ORC and .sframe

.eh_frame and.sframe

Large text section support

Object file format designissues

Mandatory index buildingproblems

Sectiongroup compliance and garbage collection issues

Version compatibilitychallenges

Proposed format separation

Alternative:Deriving SFrame from .eh_frame

Post-processing alternative

SHF_ALLOC considerations

Kernel challenges

Miscellaneous minorconsiderations

Questioned benefits

Summary

If we proceed, here ishow to do it right

Learningfrom existing compact unwind implementations

Benchmarking

Limitation

Demo

Alignment in C++ source code

LLVM IR representation

LLVM back end representation

Assembly representation

Object file format

Linker considerations

How the linker handlessection alignment

Padding and sectionreordering

System page size

ABI compliance

Architecture considerations

Aligning code forperformance

Sections

Symbols

Reducing sizeof(MCFragment)

The quest fortrivially destructible fragments

Fewerfragments: fixed-size part and variable tail

Reducing instructionencoding overhead

Eager fragment creation

Appendix:`configure-llvm`

ORC and `.sframe`

`.eh_frame` and`.sframe`

Revisiting the`.weakref` directive

`MCSymbolRefExpr` issues

AsmParser:`expr@specifier`

`LLVMInitialize*`

`EmitAssembly` and`EmitObj`

`clang::createInvocation`

`.dynsym` computation