LLVM integrated assembler: Engineering better fragments
In my previous assembler posts, I've discussed improvements on
- Parsing phase: The assembler constructs section fragments. Thesefragments represent sequences of regular instructions or data,
span-dependentinstructions, alignment directives, and other elements. - Section layout phase: Once fragments are built, the assemblerassigns offsets to them and finalizes the span-dependent content.
Relocationdecision phase: In the final stage, the assembler evaluates fixupsand, if necessary, updates the content of the fragments.
When the LLVM integrated assembler was introduced in 2009, itssection and fragment design was quite basic. Performance wasn't theconcern at the time. As LLVM evolved, many assembler features added overthe years came to rely heavily on this original design. This created acomplex web that made optimizing the fragment representationincreasingly challenging.
Here's a look at some of the features that added to this complexityover the years:
- 2010: Mach-O
.subsection_via_symbols
and atoms - 2012: NativeClient's bundle alignment mode. I've created a dedicatedchapter for this.
- 2015: Hexagon instruction bundle
- 2016: CodeView variable definition ranges
- 2018: RISC-V linker relaxation
- 2020: x86
-mbranches-within-32B-boundaries
- 2023: LoongArch linker relaxation. This is largely identical toRISC-V linker relaxation. Any refactoring or improvements to the RISC-Vlinker relaxation often necessitate corresponding changes to theLoongArch implementation.
- 2023: z/OS GOFF(Generalized Object File Format)
I've included the start year for each feature to indicate when it wasinitially introduced, to the best of my knowledge. This doesn't implythat maintenance stopped after that year. On the contrary, many of thesefeatures, like RISC-V linker relaxation, require ongoing, activemaintenance.
Despite the intricate history, I've managed to untangle thesedependencies and implement the necessary fixes. And that, in a nutshell,is what this blog post is all about!
Reducing sizeof(MCFragment)
A significant aspect of optimizing fragment management involveddirectly reducing the memory footprint of the MCFragment object itself.Several targeted changes contributed to makingsizeof(MCFragment)
smaller, as mentioned by my previousblog post:
- MCInst:decrease inline element count to 6
- [MC]Reduce size of MCDataFragment by 8 bytes by @aengelke
- [MC] MoveMCFragment::Atom to MCSectionMachO::Atoms
The fragment management system has also been streamlined bytransitioning from a doubly-linked list (llvm::iplist
) to asingly-linked list, eliminating unnecessary overhead. A few prerequisitecommits removed backward iterator requirements. It's worth noting thatthe complexities introduced by features like NaCl's bundle alignmentmode, x86's -mbranches-within-32B-boundaries
option, andHexagon's instruction bundles presented challenges.
The quest fortrivially destructible fragments
Historically, MCFragment
subclasses, specificallyMCDataFragment
and MCRelaxableFragment
, reliedon SmallVector
member variables to store their content andfixups. This approach, while functional, presented two keyinefficiencies:
- Inefficient storage of small objects: The content and fixups forindividual fragments are typically very small. Storing a multitude ofthese tiny objects individually within
SmallVectors
led toless-than-optimal memory utilization. - Non-trivial destructors: When deallocating sections, the
~MCSection
destructor had to meticulously traverse thefragment list and explicitly destroy each fragment.
In 2024, @aengelke initiated a draft to storefragment content out-of-line. Building upon that foundation, I'veextended this approach to also store fixups out-of-line, and ensuredcompatibility with the aforementioned features that cause complexity(especially RISC-V and LoongArch linker relaxation.)
Furthermore, MCRelaxableFragment
previously containedMCInst Inst;
, which also necessitated a non-trivialdestructor. To address this, I've redesigned its data structure.operands are now stored within the parent MCSection, and theMCRelaxableFragment
itself only holds references:
1 |
uint32_t Opcode = 0; |
Unfortunately, we still need to encode MCInst::Flags
tosupport the x86 EVEX prefix, e.g., {evex} xorw $foo, %ax
.My hope is that the x86 maintainers might refactorX86MCCodeEmitter::encodeInstruction
to make this flagstorage unnecessary.
The new design of MCFragment
and MCSection
is as follows:
1 |
class MCFragment { |
(As a side note, CamelCase
variables are odd. As the MC maintainer, I'dbe delighted to see them refactored to camelBack
orsnake_case
if people agree on the direction.)
Key changes:
- MC: Storefragment content and fixups out-of-line
- CodeView:Move MCCVDefRangeFragment storage to MCContext/MCFragment. NFC
- MC: StoreMCRelaxableFragment MCInst out-of-line
Fewerfragments: fixed-size part and variable tail
Prior to LLVM 21.1, the assembler, operated with a fragment designdating back to 2009, placed every span-dependent instruction into itsown distinct fragment. The x86 code sequencepush rax; jmp foo; nop; jmp foo
would be represented withnumerous fragments:MCDataFragment(nop); MCRelaxableFragment(jmp foo); MCDataFragment(nop); MCRelaxableFragment(jmp foo)
.
A more efficient approach emerged: storing both a fixed-sizepart and an optional variable-size tail within a singlefragment.
- The fixed-size part maintains a consistent size throughout theassembly process.
- The variable-size tail, if present, encodes elements that can changein size or content, such as a span-dependent instruction, an alignmentdirective, a fill directive, or other similar span-dependentconstructs.
The new design led to significantly fewer fragments:
1 |
MCFragment(fixed: push rax, variable: jmp foo) |
Key changes:
- MC:Restructure MCFragment as a fixed part and a variable tail.
- MC:Encode FT_Align in fragment's variable-size tail
Reducing instructionencoding overhead
Encoding individual instructions is the most performance-criticaloperation within MCObjectStreamer
. Recognizing this,significant effort has been dedicated to reducing this overhead sinceMay 2023.
- [MC] Always encodeinstruction into SmallVector
[MC]Remove the legacy overload of encodeInstruction with a lot of priorcleanups - [MC][ELF]Emit instructions directly into fragment
- [MC][X86]Avoid copying MCInst in emitInstrEnd in 2024-06
X86AsmBackend:Remove some overhead from auto padding feature X86AsmBackend:Simplify isRightAfterData for the auto-pad feature
It's worth mentioning that x86's instruction padding features,introduced in 2020, have imposed considerable overhead. Specifically,these features are:
-
-mbranches-within-32B-boundaries
. SeeAlign branches within 32-Byteboundary(NOP padding) - [X86] Relax existinginstructions to reduce the number of nops needed for alignmentpurposes
- "Enhanced relaxation":The feature allows x86 prefix padding for all instructions, effectivelymaking all instructions span-dependent and requiring its ownfragment.
My recent optimization efforts demanded careful attention to theseparticularly complex and performance-sensitive code.
Eager fragment creation
Encoding an instruction is a far more frequent operation thanappending a variable-size tail to the current fragment. In the previousdesign, the instruction encoder was burdened with an extra check: it hadto determine if the current fragment already had a variable-sizetail.
1 |
encodeInstruction: |
Our new strategy optimizes this by maintaining a current fragmentthat is guaranteed not to have a variable-size tail. This meansfunctions appending data to the fixed-size part no longer need toperform this check. Instead, any function that sets a variable-size tailwill now immediately start a new fragment.
Here's how the workflow looks with this optimization:
1 |
encodeInstruction: |
Key changes:
- MC:Simplify fragment reuse determination
MC:Optimize getOrCreateDataFragment
It's worth noting that the first patch was made possible thanks tothe removal of the bundle alignment mode.
Fragment content in trailingdata
Our MCFragment
class manages four distinct sets ofappendable data: fixed-size content, fixed-size fixups, variable-sizetail content, and variable-size tail fixups. Of these, the fixed-sizecontent is typically the largest. We can optimize its storage byutilizing it as trailing data, akin to a flexible array member.
This approach offers several compelling advantages:
- Improved data locality: Storing the content after the MCFragmentobject enhances cache utility.
- Simplified metadata: We can replace the pair of
uint32_t ContentStart = 0; uint32_t ContentEnd = 0;
with asingleuint32_t ContentSize;
.
This optimization leverages a clever technique made possible by usinga special purpose bump allocator. After allocatingsizeof(MCFragment)
bytes for a new fragment, we know thatany remaining space within the current bump allocator block immediatelyfollows the fragment's end. This contiguous space can then beefficiently used for the fragment's trailing data.
However, this design introduces a few important considerations:
- Tail fragment appends only: Data can only be appended to the tailfragment of a subsection. Fragments located in the middle of asubsection are immutable in their fixed-size content. Anypost-assembler-layout adjustments must target the variable-sizetail.
- Dynamic Allocation Management: When new data needs to be appended, afunction is invoked to ensure the current bump allocator block hassufficient space. If not, the current fragment is closed (its fixed-sizecontent is finalized), and a new fragment is started. For instance, an8-byte sequence could be stored as one single fragment, or, if spaceconstraints dictate, as two fragments each encoding 4 bytes.
- New block allocation: If the available space in the current block isinsufficient, a new block large enough to accommodate both an MCFragmentand the required bytes for its trailing data is allocated.
- Section/subsection Switching: The previously saved fragment listtail cannot be simply reused. This is because it's tied to the memoryspace of the previous bump allocator block. Instead, a new fragment mustbe allocated using the current bump allocator block and appended to thenew subsection's tail.
I have thought about making the variable-size content immediatelyfollow the fixed-size content, but leb128 and x86's potentially verylong instruction (15 bytes) stopped me from doing it. There is certainlyroom for future improvements, though.
Key changes:
- GOFF:Only register sections within MCObjectStreamer::changeSection
- MC:Allocate initial fragment and define section symbol inchangeSection
MCFragment: Usetrailing data for fixed-size part
Fragment fixups stored insection
TODO
MCFragment should not hold references to fixups stored in the parentMCSection. Instead, fixups reference the fragment.
The optional variable-size tail of a fragment can have at most onefixup.
Google's now-discontinued Native Client (NaCl) project provided asandboxing environment through a combination of Software Fault Isolation(SFI) and memory segmentation. A distinctive feature of its SFIimplementation was the "bundle alignment mode", which adds NOP paddingto ensure that no instruction crosses a 32-byte alignment boundary. Theverifier's job is to check all instructions starting at 32-byte-multipleaddresses.
While the core concept of aligned bundling is intriguing, itsimplementation within the LLVM assembler proved problematic. Introducedin 2012, this feature imposed noticeable performance penalties on userswho had no need for NaCl, perhaps more critically, significantlyincreased the complexity of MC's internal workings. I was particularlyconcerned by its pervasive modifications toMCObjectStreamer
and MCAssembler
.
The complexity deepened with the introduction of
- 2014: MCStreamer's pendinglabels, which led to more complexity:
- 2015: [MC] Ensure thatpending labels are flushed when -mc-relax-all flag is used
- 2019: [MC] Match labels toexisting fragments even when switching sections. by an Appledeveloper. In a nutshell, the pending labels mechanism was causingheadache to Mach-O, requiring additional code to manage.
- 2015: NaCl's
mc-relax-all
optimization
In MCObjectStreamer
, newly defined labels were put intoa "pending label" list and initially assigned to aMCDummyFragment
associated with the current section. Thesymbols would be reassigned to a new fragment when the next instructionor directive was parsed. This pending label system introduced complexityand a missing flushPendingLabels
could lead to subtle bugsrelated to incorrect symbol values. flushPendingLabels
wascalled by many MCObjectStreamer
functions, noticeably oncefor each new fragment, adding overhead. It also complicated the labeldifference evaluation due to MCDummyFragment
inMCExpr.cpp:AttemptToFoldSymbolOffsetDifference
.
For the following code, aligned bundling requires that .Ltmp isdefined at addl.
1 |
$ clang var.c -S -o - -fPIC -m32 |
Recognizing these long-standing issues, a series of pivotal changeswere undertaken:
- 2024: [MC]Aligned bundling: remove special handling for RelaxAll removed anoptimization for NaCl in the
mc-relax-all
mode - 2024:
[MC]Remove pending labels - 2024:
[MC]AttemptToFoldSymbolOffsetDifference: remove MCDummyFragment check.NFC - 2025: Finally,
MC: Removebundle alignment mode, after Derek Schuff agreed to drop NaClsupport from LLVM.
Should future features require a variant of bundle alignment, Ifirmly believe a much cleaner implementation is necessary. This couldpotentially be achieved through a backend hook withinX86AsmBackend::finishLayout
, applied after the primaryassembler layout phase, similar to how the-mbranches-within-32B-boundaries
option is handled, thougheven that implementation warrants an extensive revisit itself.
Lessons learned
The cost of missing early optimization
Early design choices can have a far-reaching impact on future code.The initial LLVM MC design, while admirably simple in its inception,inadvertently created a rigid foundation. As new features piled on, eachrelying more and more on the specific fragment internals, rectifyingfoundational inefficiencies became incredibly challenging. The Hyrum'sLaw was evident: features built on this foundation inevitably dependedon all its observable behaviors. Optimizing the underlying structurerequired not just a change to the core, but also a thorough fix for allits unsuspecting users. I encountered significant struggles with thedeeply ingrained complexities stemming from NaCl's bundle alignmentmode, x86's -mbranches-within-32B-boundaries
option, andthe intricacies of RISC-V linker relaxation.
Cargo cult programming and snowball effect
I observed instances of "cargo cult programming", where existingsolutions were copied without a full understanding of their underlyingrationale or applicability. For example:
- The WebAssembly implementation heavily mirrored that of ELF.Consequently, many improvements made to the ELF component oftennecessitated corresponding, sometimes redundant, changes to theWebAssembly implementation. In additin, the WebAssembly implementationcopied ELF-specific code that was irrelevant for WebAssembly'sarchitecture, adding unnecessary bloat and complexity.
- LoongArch's RISC-V replication: LoongArch's linker relaxationimplementation directly copied the approach taken for RISC-V.Refactoring or improvements to RISC-V's linker relaxation frequentlyrequire mirrored changes in the LoongArch codebase, creating parallelmaintenance burdens. I am particularly glad that I landed myfoundational [RISCV] Makelinker-relaxable instructions terminate MCDataFragment and
[RISCV] Allow delayed decisionfor ADD/SUB relocations in 2023, before the LoongArch teamreplicated the RISC-V approach. This timing, I hope, mitigated somefuture headaches for their implementation.
These patterns illustrate how initial design choices, or theexpedience of copying existing solutions, can lead to a "snowballeffect" of accumulating complexity and redundant code that makes futureoptimization and maintenance significantly harder. On a positive note,I'm also pleased that
The cost of features
Specific features, particularly those designed for niche orspecialized use cases like NaCl's bundle alignment mode, introduceddisproportionate complexity and performance overhead across the entireassembler. Even though NaCl itself was deprecated in 2020, it took until2025 to finally excise its complex support from LLVM. This highlights acommon challenge in large, open-source projects: while many developersare motivated to add new features, there's often far less incentive ordedicated effort to streamline or remove their underlying implementationcomplexities once they're no longer strictly necessary or have become aperformance drain.
I want to acknowledge the work of individuals like Rafael Ávila deEspíndola, Saleem Abdulrasool, and Nirav Dave, whose improvements toLLVM MC were vital. Without their contributions, the MC layer wouldundoubtedly be in a far less optimized state today.
Epilogue
This extensive work on fragment optimization would not have beenpossible without the invaluable contributions of
What I have learnd through the process?
After dedicating several paragraphs to explaining the historicalshortcomings of LLVM MC's fragment representation, a natural questionarises: how does GNU Assembler (GAS), arguably the other most popularassembler on Linux systems, approach fragment handling?
Delving into its history reveals a fascinating answer. The earliestcommit I could locate is a cvs2svn-generated record from April 1991.Given the 1987 copyright notice within the code, it's highly probablethat this foundational work on fragments was laid down as early as1987.
You can explore this initial structure in as.h here: as.h
and frags.c
.
Observing the frag
struct, a few points stand out:
- While the exact purpose of
fr_offset
isn't immediatelyclear to me,fr_fix
andfr_var
bear a strikingresemblance to the concepts we've recently introduced in MCFragment. Itmight make the variable-size content immediately follow the fixed-sizecontent, though. - The
char fr_literal[1]
demonstrates an early use ofwhat we now call a flexible array member. Today, GCC and Clang's-fstrict-flex-arrays=2
would report a warning. -
fr_symbol
could be more appropriately placed within aunion -
fr_pcrel_adjust
andfr_bsr
would ideallybe architecture-specific data. - Fragments are allocated using
obstacks,which appear to be a more sophisticated form of a bump allocator, withadditional bookkeeping overhead.
But truly, I should stop the minor nit-picking. What astonishinglyimpresses me is the sheer foresight demonstrated in GAS's fragmentallocator design. Conceived in 1987 or even earlier, it masterfullyanticipated solutions that LLVM MC, first conceived in 2009, has onlynow achieved decades later. This design held the lead on fragmentarchitecture for nearly four decades!
My greatest tribute goes to the original authors of GNU Assembler forthis remarkable piece of engineering.
1 |
/* |