普通视图

发现新文章,点击刷新页面。
昨天以前MaskRay

Removing global state from LLD

作者 MaskRay
2024年11月17日 16:00

LLD, the LLVM linker, is a matureand fast linker supporting multiple binary formats (ELF, Mach-O,PE/COFF, WebAssembly). Designed as a standalone program, the code baserelies heavily on global state, making it less than ideal for libraryintegration. As outlined in RFC:Revisiting LLD-as-a-library design, two main hurdles exist:

  • Fatal errors: they exit the process without returning control to thecaller. This was actually addressed for most scenarios in 2020 byutilizing llvm::sys::Process::Exit(val, /*NoCleanup=*/true)and CrashRecoveryContext (longjmp under thehood).
  • Global variable conflicts: shared global variables do not allow twoconcurrent invocation.

I understand that calling a linker API could be convenient,especially when you want to avoid shipping another executable (which canbe large when you link against LLVM statically). However, I believe thatinvoking LLD as a separate process remains the recommended approach.There are several advantages:

  • Build system control: Build systems gain greater control overscheduling and resource allocation for LLD. In an edit-compile-linkcycle, the link could need more resources and threading is moreuseful.
  • Better parallelism management
  • Global state isolation: LLVM's global state (primarilycl::opt and ManagedStatic) is isolated.

While spawning a new process offers build system benefits, the issueof global state usage within LLD remains a concern. This is a factor toconsider, especially for advanced use cases. Here are global variablesin the LLD 15 code base.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
% rg '^extern [^(]* \w+;' lld/ELF
lld/ELF/SyntheticSections.h
1290:extern InStruct in;

lld/ELF/Symbols.h
51:extern SmallVector<SymbolAux, 0> symAux;

lld/ELF/SymbolTable.h
87:extern std::unique_ptr<SymbolTable> symtab;

lld/ELF/InputSection.h
33:extern std::vector<Partition> partitions;
403:extern SmallVector<InputSectionBase *, 0> inputSections;
408:extern llvm::DenseSet<std::pair<const Symbol *, uint64_t>> ppc64noTocRelax;

lld/ELF/OutputSections.h
156:extern llvm::SmallVector<OutputSection *, 0> outputSections;

lld/ELF/InputFiles.h
43:extern std::unique_ptr<llvm::TarWriter> tar;

lld/ELF/Driver.h
23:extern std::unique_ptr<class LinkerDriver> driver;

lld/ELF/LinkerScript.h
366:extern std::unique_ptr<LinkerScript> script;

lld/ELF/Config.h
372:extern std::unique_ptr<Configuration> config;
406:extern std::unique_ptr<Ctx> ctx;

Some global states exist as static member variables.

Cleaning up global variables

LLD has been undergoing a transformation to reduce its reliance onglobal variables. This improves its suitability for libraryintegration.

  • In 2020, [LLD][COFF] Coverusage of LLD as a library enabled running the LLD driver multipletimes even if there is a fatal error.
  • In 2021, global variables were removed fromlld/Common.
  • The COFF port followed suite, eliminating most of its globalvariables.

Inspired by theseadvancements, I conceived a plan to eliminate globalvariables from the ELF port. In 2022, as part of the work to enableparallel section initialization, I introduced a classstruct Ctx to lld/ELF/Config.h. Here is myplan:

  • Global variables will be migrated into Ctx.
  • Functions will be modified to accept a new Ctx &ctxparameter.
  • The previously global variable lld::elf::ctx will be transformedinto a local variable within lld::elf::link.

Encapsulating globalvariables into Ctx

Over the past two years and a half, I have migrated global variablesinto the Ctx class, e.g..

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
diff --git a/lld/ELF/Config.h b/lld/ELF/Config.h
index 590c19e6d88d..915c4d94e870 100644
--- a/lld/ELF/Config.h
+++ b/lld/ELF/Config.h
@@ -382,2 +382,10 @@ struct Ctx {
std::atomic<bool> hasSympart{false};
+ // A tuple of (reference, extractedFile, sym). Used by --why-extract=.
+ SmallVector<std::tuple<std::string, const InputFile *, const Symbol &>, 0>
+ whyExtractRecords;
+ // A mapping from a symbol to an InputFile referencing it backward. Used by
+ // --warn-backrefs.
+ llvm::DenseMap<const Symbol *,
+ std::pair<const InputFile *, const InputFile *>>
+ backwardReferences;
};
diff --git a/lld/ELF/Driver.cpp b/lld/ELF/Driver.cpp
index 8315d43c776e..2ab698c91b01 100644
--- a/lld/ELF/Driver.cpp
+++ b/lld/ELF/Driver.cpp
@@ -1776,3 +1776,3 @@ static void handleUndefined(Symbol *sym, const char *option) {
if (!config->whyExtract.empty())
- driver->whyExtract.emplace_back(option, sym->file, *sym);
+ ctx->whyExtractRecords.emplace_back(option, sym->file, *sym);
}
@@ -1812,3 +1812,3 @@ static void handleLibcall(StringRef name) {

-void LinkerDriver::writeArchiveStats() const {
+static void writeArchiveStats() {
if (config->printArchiveStats.empty())
@@ -1834,3 +1834,3 @@ void LinkerDriver::writeArchiveStats() const {
++extracted[CachedHashStringRef(file->archiveName)];
- for (std::pair<StringRef, unsigned> f : archiveFiles) {
+ for (std::pair<StringRef, unsigned> f : driver->archiveFiles) {
unsigned &v = extracted[CachedHashString(f.first)];

I did not do anything thing with the global variables in 2024. Thework was resumed in July 2024. I moved TarWriter,SymbolAux, Out, ElfSym,outputSections, etc into Ctx.

1
2
3
4
5
6
7
struct Ctx {
Config arg;
LinkerDriver driver;
LinkerScript *script;
std::unique_ptr<TargetInfo> target;
...
};

The config variable, used to store command-line options,was pervasive throughout lld/ELF. To enhance code clarity andmaintainability, I renamed it to ctx.arg (mold naming).

I've removed other instances of static storage variables throughtlld/ELF, e.g.

  • staticmember LinkerDriver::nextGroupId
  • staticmember SharedFile::vernauxNum
  • sectionMapin lld/ELF/Arch/ARM.cpp

Passing Ctx &ctxas parameters

The subsequent phase involved adding Ctx &ctx as aparameter to numerous functions and classes, gradually eliminatingreferences to the global ctx.

I incorporated Ctx &ctx as a member variable to afew classes (e.g. SyntheticSection,OutputSection) to minimize the modifications to memberfunctions. This approach was not suitable for Symbol andInputSection, since even a single word could increasememory consumption significantly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Writer.cpp
template <class ELFT> class Writer {
public:
LLVM_ELF_IMPORT_TYPES_ELFT(ELFT)

Writer(Ctx &ctx) : ctx(ctx), buffer(ctx.e.outputBuffer) {}
...

template <class ELFT> void elf::writeResult(Ctx &ctx) {
Writer<ELFT>(ctx).run();
}
...

bool elf::includeInSymtab(Ctx &ctx, const Symbol &b) {
if (auto *d = dyn_cast<Defined>(&b)) {
// Always include absolute symbols.
SectionBase *sec = d->section;
if (!sec)
return true;
assert(sec->isLive());

if (auto *s = dyn_cast<MergeInputSection>(sec))
return s->getSectionPiece(d->value).live;
return true;
}
return b.used || !ctx.arg.gcSections;
}

Eliminating the globalctx variable

Once the global ctx variable's reference count reachedzero, it was time to remove it entirely. I implemented the change onNovember 16, 2024.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
diff --git a/lld/ELF/Config.h b/lld/ELF/Config.h
index 72feeb9d49cb..a9b7a98e5b54 100644
--- a/lld/ELF/Config.h
+++ b/lld/ELF/Config.h
@@ -539,4 +539,2 @@ struct InStruct {
std::unique_ptr<SymtabShndxSection> symTabShndx;
-
- void reset();
};
@@ -664,3 +662,2 @@ struct Ctx {
Ctx();
- void reset();

@@ -671,4 +668,2 @@ struct Ctx {

-LLVM_LIBRARY_VISIBILITY extern Ctx ctx;
-
// The first two elements of versionDefinitions represent VER_NDX_LOCAL and
diff --git a/lld/ELF/Driver.cpp b/lld/ELF/Driver.cpp
index 334dfc0e3ba1..631051c27381 100644
--- a/lld/ELF/Driver.cpp
+++ b/lld/ELF/Driver.cpp
@@ -81,4 +81,2 @@ using namespace lld::elf;

-Ctx elf::ctx;
-
static void setConfigs(Ctx &ctx, opt::InputArgList &args);
@@ -165,2 +114,3 @@ bool link(ArrayRef<const char *> args, llvm::raw_ostream &stdoutOS,
llvm::raw_ostream &stderrOS, bool exitEarly, bool disableOutput) {
+ Ctx ctx;
// This driver-specific context will be freed later by unsafeLldMain().
@@ -169,7 +119,2 @@ bool link(ArrayRef<const char *> args, llvm::raw_ostream &stdoutOS,
context->e.initialize(stdoutOS, stderrOS, exitEarly, disableOutput);
- context->e.cleanupCallback = []() {
- Ctx &ctx = elf::ctx;
- ctx.reset();
- ctx.partitions.emplace_back(ctx);
- };
context->e.logName = args::getFilenameWithoutExe(args[0]);

Prior to this modification, the cleanupCallback function wasessential for resetting the global ctx when lld::elf::link was calledmultiple times.

Previously, cleanupCallback was essential for resettingthe global ctx when lld::elf::link was invokedmultiple times. With the removal of the global variable, this callbackis no longer necessary. We can now rely on the constructor to initializeCtx and avoid the need for a resetfunction.

Removing global state fromlld/Common

While significant progress has been made to lld/ELF,lld/Common needs a lot of work as well. A lot of sharedutility code (diagnostics, bump allocator) utilizes the globallld::context().

1
2
3
4
5
6
7
8
9
10
/// Returns the default error handler.
ErrorHandler &errorHandler();

void error(const Twine &msg);
void error(const Twine &msg, ErrorTag tag, ArrayRef<StringRef> args);
[[noreturn]] void fatal(const Twine &msg);
void log(const Twine &msg);
void message(const Twine &msg, llvm::raw_ostream &s = outs());
void warn(const Twine &msg);
uint64_t errorCount();

Although thread-local variables are an option, worker threads spawnedby llvm/lib/Support/Parallel.cpp don't inherit their valuesfrom the main thread. Given our direct access toCtx &ctx, we can leverage context-aware APIs asreplacements.

https://github.com/llvm/llvm-project/pull/112319introduced context-aware diagnostic utilities:

  • log("xxx") =>Log(ctx) << "xxx"
  • message("xxx") =>Msg(ctx) << "xxx"
  • warn("xxx") =>Warn(ctx) << "xxx"
  • errorOrWarn(toString(f) + "xxx") =>Err(ctx) << f << "xxx"
  • error(toString(f) + "xxx") =>ErrAlways(ctx) << f << "xxx"
  • fatal("xxx") =>Fatal(ctx) << "xxx"

As of Nov 16, 2024, I have eliminatedlog/warn/error/fatal from lld/ELF.

The underlying functions lld::ErrorHandler::fatal, andlld::ErrorHandler::error when the error limit is hit andexitEarly is true, call exitLld(1).

lld::make from lld/include/lld/Common/Memory.his an allocation function that uses the global context. When theownership is clear, std::make_unique might be a betterchoice.

Guideline:

  • Avoid lld::saver
  • Avoidvoid message(const Twine &msg, llvm::raw_ostream &s = outs());,which utilizes lld::outs()
  • Avoid lld::make from lld/include/lld/Common/Memory.h
  • Avoid fatal error in a half-initialized object, e.g. fatal error ina base class constructor (ELFFileBase::init) ([LLD][COFF] When usingLLD-as-a-library, always prevent re-entrance on failures)

Global state in LLVM

LTO link jobs utilize LLVM. Understanding its global state iscrucial.

While LLVM allows for multiple LLVMContext instances tobe allocated and used concurrently, it's important to note that theseinstances share certain global states, such as cl::opt andManagedStatic. Specifically, it's not possible to run twoconcurrent LLVM compilations (including LTO link jobs) with distinctsets of cl::opt option values. To link with distinctcl::opt values, even after removing LLD's global state,you'll need to spawn a new LLD process.

Any proposal that moves away from global state seems to complicatecl::opt usage, making it impractical.

LLD also utilizes functions from llvm/Support/Parallel.hfor parallelism. These functions rely on global state likegetDefaultExecutor andllvm::parallel::strategy. Ongoing work by Alexandre Ganeaaims to make these functions context-aware. (It's nice to meet you inperson in LLVM Developers' Meeting last month)

Supported library usagescenarios

You can repeatedly call lld::lldMain from lld/Common/Driver.h.If fatal has been invoked, it will not be safe to calllld::lldMain again in certain rare scenarios. Runninglld::lldMain concurrently in two threads is notsupported.

The command LLD_IN_TEST=3 lld-link ... runs the linkprocess three times, but only the final invocation outputs diagnosticsto stdout/stderr. lld/test/lit.cfg.py has configured theCOFF port to run tests twice ([lld] Add test suite mode forrunning LLD main twice). Other ports need work to make this modework.

Keeping pace with LLVM: compatibility strategies

作者 MaskRay
2024年11月10日 16:00

LLVM's C++ API doesn't offer a stability guarantee. This meansfunction signatures can change or be removed between versions, forcingprojects to adapt.

On the other hand, LLVM has an extensive API surface. When a librarylike llvm/lib/Y relies functionality from another library,the API is often exported in header files underllvm/include/llvm/X/, even if it is not intended to beuser-facing.

To be compatible with multiple LLVM versions, many projects rely on#if directives based on the LLVM_VERSION_MAJORmacro. This post explores the specific techniques used by ccls to ensurecompatibility with LLVM versions 7 to 19. For the latest release (ccls0.20241108), support for LLVM versions 7 to 9 has beendiscontinued.

Given the tight coupling between LLVM and Clang, theLLVM_VERSION_MAJOR macro can be used for both versiondetection. There's no need to checkCLANG_VERSION_MAJOR.


Changed namespaces

In Oct 2018, https://reviews.llvm.org/D52783 moved the namespaceclang::vfs to llvm::vfs. To remaincompatibility, I renamed clang::vfs uses and added aconditional namespace alias:

1
2
3
4
5
6
#if LLVM_VERSION_MAJOR < 8
// D52783 Lift VFS from clang to llvm
namespace llvm {
namespace vfs = clang::vfs;
}
#endif

Removed functions

In March 2019, https://reviews.llvm.org/D59377 removed the membervariable VirtualFileSystem and removedsetVirtualFileSystem. To adapt to this change, ccls employsan #if.

1
2
3
4
5
6
#if LLVM_VERSION_MAJOR >= 9 // rC357037
Clang->createFileManager(FS);
#else
Clang->setVirtualFileSystem(FS);
Clang->createFileManager();
#endif

Changed function parameters

In April 2020, the LLVM monorepo integrated a new subproject: flang.flang developers made many changes to clangDriver to reuse it for flang.https://reviews.llvm.org/D86089 changed the constructorclang::driver::Driver. I added

1
2
3
4
5
#if LLVM_VERSION_MAJOR < 12 // llvmorg-12-init-5498-g257b29715bb
driver::Driver d(args[0], llvm::sys::getDefaultTargetTriple(), *diags, vfs);
#else
driver::Driver d(args[0], llvm::sys::getDefaultTargetTriple(), *diags, "ccls", vfs);
#endif

In November 2020, https://reviews.llvm.org/D90890 changed an argument ofComputePreambleBounds fromconst llvm::MemoryBuffer *Buffer toconst llvm::MemoryBufferRef &Buffer.

1
2
3
4
5
6
7
std::unique_ptr<llvm::MemoryBuffer> buf =
llvm::MemoryBuffer::getMemBuffer(content);
#if LLVM_VERSION_MAJOR >= 12 // llvmorg-12-init-11522-g4c55c3b66de
auto bounds = ComputePreambleBounds(*ci.getLangOpts(), *buf, 0);
#else
auto bounds = ComputePreambleBounds(*ci.getLangOpts(), buf.get(), 0);
#endif

https://reviews.llvm.org/D91297 made a similar changeand I adapted it similarly.

In Jan 2022, https://reviews.llvm.org/D116317 added a new parameterbool Braced toCodeCompleteConsumer::ProcessOverloadCandidates.

1
2
3
4
5
6
7
8
9
10
11
12
  void ProcessOverloadCandidates(Sema &s, unsigned currentArg,
OverloadCandidate *candidates,
unsigned numCandidates
#if LLVM_VERSION_MAJOR >= 8
,
SourceLocation openParLoc
#endif
#if LLVM_VERSION_MAJOR >= 14
,
bool braced
#endif
) override {

In late 2022 and early 2023, there were many changes to migrate fromllvm::Optional to std::optional.

1
2
3
4
5
6
7
8
#if LLVM_VERSION_MAJOR >= 16 // llvmorg-16-init-12589-ge748db0f7f09
std::array<std::optional<StringRef>, 3>
#else
std::array<Optional<StringRef>, 3>
#endif
redir{StringRef(stdinPath), StringRef(path), StringRef()}; 0 ref
std::vector<StringRef> args{g_config->compilationDatabaseCommand, root}; 0 ref
if (sys::ExecuteAndWait(args[0], args, {}, redir, 0, 0, &err_msg) < 0) {

In Sep 2023, https://github.com/llvm/llvm-project/pull/65647 changedCompilerInvocationRefBase toCompilerInvocationBase. I duplicated the code with..

1
2
3
4
5
6
7
8
9
10
11
#if LLVM_VERSION_MAJOR >= 18
ci->getLangOpts().SpellChecking = false;
ci->getLangOpts().RecoveryAST = true;
ci->getLangOpts().RecoveryASTType = true;
#else
ci->getLangOpts()->SpellChecking = false;
#if LLVM_VERSION_MAJOR >= 11
ci->getLangOpts()->RecoveryAST = true;
ci->getLangOpts()->RecoveryASTType = true;
#endif
#endif

In April 2024, https://github.com/llvm/llvm-project/pull/89548/ removedllvm::StringRef::startswith in favor ofstarts_with. starts_with has been available since Oct 2022 andstartswith had been deprecated. I added the followingsnippet:

1
2
3
4
#if LLVM_VERSION_MAJOR >= 19
#define startswith starts_with
#define endswith ends_with
#endif

It's important to note that the converse approach

1
2
#define starts_with startswith
#define ends_with endswith

could break code that callsstd::string_view::starts_with.

Changed enumerators

In November 2023, https://github.com/llvm/llvm-project/pull/71160 changedan unnamed enumeration to a scoped enumeration. To keep the followingsnippet compiling,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
switch (tag_d->getTagKind()) {
case TTK_Struct:
tag = "struct";
break;
case TTK_Interface:
tag = "__interface";
break;
case TTK_Union:
tag = "union";
break;
case TTK_Class:
tag = "class";
break;
case TTK_Enum:
tag = "enum";
break;
}

I introduced macros.

1
2
3
4
5
6
7
#if LLVM_VERSION_MAJOR >= 18 // llvmorg-18-init-10631-gedd690b02e16
#define TTK_Class TagTypeKind::Class
#define TTK_Enum TagTypeKind::Enum
#define TTK_Interface TagTypeKind::Interface
#define TTK_Struct TagTypeKind::Struct
#define TTK_Union TagTypeKind::Union
#endif

In April 2024, https://github.com/llvm/llvm-project/pull/89639 renamedan enumerator. I have made the following adaptation:

1
2
3
4
5
6
7
#if LLVM_VERSION_MAJOR >= 19 // llvmorg-19-init-9465-g39adc8f42329
case BuiltinType::ArraySection:
#else
case BuiltinType::OMPArraySection:
return "<OpenMP array section type>";
#endif
return "<array section type>";

Build system changes

In Dec 2022, https://reviews.llvm.org/D137838 added a new LLVMlibrary LLVMTargetParser. I adjusted ccls's CMakeLists.txt:

1
2
3
4
target_link_libraries(ccls PRIVATE LLVMOption LLVMSupport)
if(LLVM_VERSION_MAJOR GREATER_EQUAL 16) # llvmorg-16-init-15123-gf09cf34d0062
target_link_libraries(ccls PRIVATE LLVMTargetParser)
endif()

Summary

The above examples illustrate how to adapt to changes in the LLVM andClang APIs. It's important to remember that API changes are a naturalpart of software development, and testing with different releases iscrucial for maintaining compatibility with a wide range of LLVMversions.

When introducing new interfaces, we should pay a lot of attention toreduce the chance that the interface will be changed in a way thatcauses disruption to the downstream. That said, changes are normal. Whenan API change is justified, do it.

Downstream projects should be mindful of the stability guarantees ofdifferent LLVM APIs. Some API may be more prone to change than others.It's essential to write code in a way that can easily adapt to changesin the LLVM API.

LLVM C API

While LLVM offers a C API with an effort made towards compatibility,its capabilities often fall short.

Clang provides a C API called libclang. Whilehighly stable, libclang's limited functionality makes it unsuitable formany tasks.

In 2018, when creating ccls (a fork of cquery), I encounteredmultiple limitations in libclang's ability to handle code completion andindexing. This led to rewriting the relevant code to leverage the ClangC++ API for a more comprehensive solution. The following commits offerinsights into how the C API and the mostly equivalent but better C++ APIworks:

  • Firstdraft: replace libclang indexer with clangIndex
  • UseClang C++ for completion and diagnostics

Tinkering with Neovim

作者 MaskRay
2024年11月2日 15:00

After migrating fromVim to Emacs as my primary C++ editor in 2015, I switched from Vimto Neovim for miscellaneous non-C++ tasks as it is more convenient in aterminal. Customizing the editor with a language you are comfortablewith is important. I found myself increasingly drawn to Neovim'sterminal-based simplicity for various tasks. Recently, I've refined myNeovim setup to the point where I can confidently migrate my entire C++workflow away from Emacs.

This post explores the key improvements I've made to achieve thistransition. My focus is on code navigation.

Key mapping

I've implemented custom functions that simplify key mappings.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
local function map(mode, lhs, rhs, opts)
local options = {}
if opts then
if type(opts) == 'string' then
opts = {desc = opts}
end
options = vim.tbl_extend('force', options, opts)
end
vim.keymap.set(mode, lhs, rhs, options)
end
local function nmap(lhs, rhs, opts)
map('n', lhs, rhs, opts)
end
local function tmap(lhs, rhs, opts)
map('t', lhs, rhs, opts)
end

I've swapped ; and : for easier access toEx commands, especially since leap.nvim renders ; lessuseful for repeating ftFT.

1
2
map({'n', 'x'}, ':', ';')
map({'n', 'x'}, ';', ':')

Cross references

Like many developers, I spend significantly more time reading codethan writing it. Efficiently navigating definitions and references iscrucial for productivity.

While the built-in LSP client's C-] is functional (see:h lsp-defaults tagfunc), I found it lessconvenient. Many Emacs and Neovim configurations advocate forgd. However, both G and D are placed on the left half ofthe QWERTY keyboard, making it slow to press them using the lefthand.

For years, I relied on M-j to quickly jump todefinitions.

To avoid a conflict with my recent zellij change (I adoptedM-hjkl for pane navigation), I've reassigned Jto trigger definition jumps. Although I've lost the originalJ (join lines) functionality, vJ provides asuitable workaround.

1
2
nmap('J', '<cmd>Telescope lsp_definitions<cr>', 'Definitions')
nmap('<M-,>', '<cmd>Telescope lsp_references<CR>', 'References')

After making a LSP-based jump, the jump list can quickly fill withirrelevant entries as I navigate the codebase. Thankfully, Telescope'sLSP functionality sets push_tagstack_on_edit to push anentry to the tag stack (see :h tag-stack). To efficientlyreturn to my previous position, I've mapped H to:pop and L to :tag.

1
2
nmap('H', '<cmd>pop<cr>', 'Tag stack backward')
nmap('L', '<cmd>tag<cr>', 'Tag stack forward')

I've adopted x as a prefix key for cross-referencingextensions. dl provide a suitable alternative forx's original functionality.

1
2
3
4
5
6
7
8
9
10
11
12
13
nmap('xB', '<cmd>CclsBaseHierarchy<cr>')
nmap('xC', '<cmd>CclsOutgoingCalls<cr>', 'callee')
nmap('xD', '<cmd>CclsDerivedHierarchy<cr>')
nmap('xM', '<cmd>CclsMemberHierarchy<cr>', 'member')
nmap('xb', '<cmd>CclsBase<cr>')
nmap('xc', '<cmd>CclsIncomingCalls<cr>', 'caller')
nmap('xd', '<cmd>CclsDerived<cr>')
nmap('xi', '<cmd>lua vim.lsp.buf.implementation()<cr>', 'Implementation')
nmap('xm', '<cmd>CclsMember<cr>', 'member')
nmap('xn', function() M.lsp.words.jump(vim.v.count1) end, 'Next reference')
nmap('xp', function() M.lsp.words.jump(-vim.v.count1) end, 'Prev reference')
nmap('xt', '<cmd>lua vim.lsp.buf.type_definition()<cr>', 'Type definition')
nmap('xv', '<cmd>CclsVars<cr>', 'vars')

I utilize xn and xp to find the next orprevious reference. The implementation, copied from from LazyVim, onlyworks with references within the current file. I want to enable thexn map to automatically transition to the next file whenreaching the last reference in the current file.

While using Emacs, I created a hydra with x as the prefix key tocycle through next references. Unfortunately, I haven't been able toreplicate this behavior in Neovim.

1
2
3
4
5
6
7
8
9
10
11
12
;; This does not work.
local Hydra = require('hydra')
Hydra({
name = 'lsp xref',
mode = 'n',
body = 'x',
heads = {
{'n', function() M.lsp.words.jump(1) end},
{'p', function() M.lsp.words.jump(-1) end},
{ "q", nil, { exit = true, nowait = true } },
},
})

Movement

I use leap.nvim to quickly jump to specific identifiers(s{char1}{char2}), followed by telescope.nvim to exploredefinitions and references. Somtimes, I use the following binding:

1
2
3
4
nmap('U', function()
require'hop'.hint_words()
require'telescope.builtin'.lsp_definitions()
end, 'Hop+definition')

Semantic highlighting

I've implemented rainbow semantic highlighting using ccls. Pleaserefer to cclsand LSP Semantic Tokens for my setup.

Other LSP features

I have configured the CursorHold event to triggertextDocument/documentHighlight. When using Emacs,lsp-ui-doc automatically requests textDocument/hover, whichI now lose.

Additionally, the LspAttach and BufEnterevents trigger textDocument/codeLens.

Window navigation

While I've been content with the traditional C-w + hjklmapping for years, I've recently opted for the more efficientC-hjkl approach.

1
2
3
4
5
6
7
8
9
nmap('<C-h>', '<C-w>h')
nmap('<C-j>', '<C-w>j')
nmap('<C-k>', '<C-w>k')
nmap('<C-l>', '<C-w>l')

tmap('<C-h>', '<cmd>wincmd h<cr>')
tmap('<C-j>', '<cmd>wincmd j<cr>')
tmap('<C-k>', '<cmd>wincmd k<cr>')
tmap('<C-l>', '<cmd>wincmd l<cr>')

The keys mirror my pane navigation preferences in tmux and zellij,where I utilize M-hjkl.

1
2
3
4
5
# tmux select pane or window
bind -n M-h if -F '#{pane_at_left}' 'select-window -p' 'select-pane -L'
bind -n M-j if -F '#{pane_at_bottom}' 'select-window -p' 'select-pane -D'
bind -n M-k if -F '#{pane_at_top}' 'select-window -n' 'select-pane -U'
bind -n M-l if -F '#{pane_at_right}' 'select-window -n' 'select-pane -R'
1
2
3
4
5
6
7
8
9
10
// zellij M-hjkl
keybinds {
normal clear-defaults=true {
...
bind "Alt h" { MoveFocusOrTab "Left"; }
bind "Alt j" { MoveFocus "Down"; }
bind "Alt k" { MoveFocus "Up"; }
bind "Alt l" { MoveFocusOrTab "Right"; }
}
}

To accommodate this change, I've shifted my tmux prefix key fromC-l to C-Space. Consequently, I've alsoadjusted my input method toggling from C-Space toC-S-Space.

Debugging

For C++ debugging, I primarily rely on cgdb. I find it superior toGDB's single-key mode and significantly more user-friendly than LLDB'sgui command.

1
2
3
4
cgdb --args ./a.out args

rr record ./a.out args
rr replay -d cgdb

I typically arrange Neovim and cgdb side-by-side in tmux or zellij.During single-stepping, when encountering interesting code snippets, Ioften need to manually input filenames into Neovim. While Telescope aidsin this process, automatic file and line updates would be ideal.

Given these considerations, nvim-dap appears to be a promisingsolution. However, I haven't yet determined the configuration forintegrating rr with nvim-dap.

Live grep

I've defined mappings to streamline directory and project-widesearches using Telescope's live grep functionality:

1
2
nmap('<leader>sd', '<cmd>lua require("telescope.builtin").live_grep({cwd=vim.fn.expand("%:p:h")})<cr>', 'Search directory')
nmap('<leader>sp', '<cmd>lua require("telescope.builtin").live_grep({cwd=MyProject()})<cr>', 'Search project')

Additionally, I've mapped M-n to insert the word underthe cursor, mimicking Emacs Ivy'sM-n (ivy-next-history-element) behavior.

Task runner

I use overseer.nvim torun build commands like ninja -C /tmp/Debug llc llvm-mc.This plugin allows me to view build errors directly in Neovim's quickfixwindow.

Following LazyVim, I use <leader>oo to run buildsand <leader>ow to toggle the overseer window. Tonavigate errors, I use trouble.nvim with the ]q and[q keys.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
nmap('<leader>oo', '<cmd>OverseerRun<cr>')
nmap('<leader>ow', '<cmd>OverseerToggle<cr>')

nmap('[q', function()
if require('trouble').is_open() then
require('trouble').prev({ skip_groups = true, jump = true })
else
local ok, err = pcall(vim.cmd.cprev)
if not ok then
vim.notify(err, vim.log.levels.ERROR)
end
end
end)
nmap(']q', function()
if require('trouble').is_open() then
require('trouble').next({ skip_groups = true, jump = true })
else
local ok, err = pcall(vim.cmd.cnext)
if not ok then
vim.notify(err, vim.log.levels.ERROR)
end
end
end)

Reducing reliance onterminal multiplexer

As https://rutar.org/writing/from-vim-and-tmux-to-neovim/nicely summarizes, running Neovim under tmux has some annoyance. I'vebeen experimenting with reducing my reliance on zellij. Instead, I'llutilize more Neovim's terminal functionality.

toggleterm.nvim is a particularly useful plugin that allows me toeasily split windows, open terminals, and hide them when not in use.

The default command <C-\><C-n> (switch tothe Normal mode) is clumsy. I've mapped it to <C-s>(useless feature pausetransmission, fwd-i-search in zsh).

1
2
3
4
5
6
7
8
9
nmap('<leader>tf', function() require'toggleterm'.toggle(vim.v.count, nil, MyProject(), 'float', nil) end)
nmap('<leader>th', function() require'toggleterm'.toggle(vim.v.count, 10, MyProject(), 'horizontal', nil) end)
nmap('<leader>tv', function() require'toggleterm'.toggle(vim.v.count, 80, MyProject(), 'vertical', nil) end)

tmap('<C-s>', '<C-\\><C-n>')
-- Binding C-/ doesn't work in tmux/zellij
map({'n', 't'}, '<C-/>', '<cmd>ToggleTerm<cr>')
-- This actually binds C-/ in tmux/zellij
map({'n', 't'}, '<C-_>', '<cmd>ToggleTerm<cr>')

neovim-remoteallows me to open files without starting a nested Neovim process.

I use mini.sessions tomanage sessions.

Config switcher

Neovim's NVIM_APPNAMEfeature is fantastic for exploring pre-configured distributions to getinspiration.

Lua

Neovim embraces Lua 5.1 as a preferred scripting language. WhileLua's syntax is lightweight and easy to learn, it doesn't shy away fromconvenience features like func 'arg' andfunc {a=42}.

LuaJIT offers exceptional performance.

LuaJIT with the JIT enabled is much faster than all of the otherlanguages benchmarked, including Wren, because Mike Pall is a robot fromthe future. -- wren.io

This translates into noticeably smoother editing with LSP, especiallyfor hefty C++ files – a significant advantage over Emacs. With Emacs,I've always felt that editing a large C++ file is slow.

The non-default local variables and 1-based indexing(shared with languages like Awk and Julia) are annoyances that I canlive with when using a configuration language. So far, I've only neededindex-sensitive looping in one specific location.

1
2
3
4
5
6
-- For LSP semantic tokens
for type, colors in pairs(all_colors) do
for i = 1,#colors do
vim.api.nvim_set_hl(0, string.format('@lsp.typemod.%s.id%s.cpp', type, i-1), {fg=colors[i]})
end
end

ccls and LSP Semantic Tokens

作者 MaskRay
2024年10月20日 15:00

I've spent countless hours writing and reading C++ code. For manyyears, Emacs has been my primary editor, and I leverage ccls' (my C++ languageserver) rainbow semantic highlighting feature.

The feature relies on two custom notification messages$ccls/publishSemanticHighlight and$ccls/publishSkippedRanges.$ccls/publishSemanticHighlight provides a list of symbols,each with kind information (function, type, or variable) of itself andits semantic parent (e.g. a member function's parent is a class),storage duration, and a list of ranges.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
struct CclsSemanticHighlightSymbol {
int id = 0;
SymbolKind parentKind;
SymbolKind kind;
uint8_t storage;
std::vector<std::pair<int, int>> ranges;

std::vector<lsRange> lsRanges; // Only used by vscode-ccls
};

struct CclsSemanticHighlight {
DocumentUri uri;
std::vector<CclsSemanticHighlightSymbol> symbols;
};

An editor can use consistent colors to highlight differentoccurrences of a symbol. Different colors can be assigned to differentsymbols.

Tobias Pisani created emacs-cquery (the predecessor to emacs-ccls) inNov 2017. Despite not being a fan of Emacs Lisp, I added the rainbowsemantic highlighting feature for my own use in early 2018. My setupalso relied heavily on these two settings:

  • Bolding and underlining variables of static duration storage
  • Italicizing member functions and variables
1
2
(setq ccls-sem-highlight-method 'font-lock)
(ccls-use-default-rainbow-sem-highlight)

Key symbol properties (member, static) were visually prominent in myEmacs environment.

My Emacs hacking days are a distant memory – beyond basicconfiguration tweaks, I haven't touched elisp code since 2018. As myElisp skills faded, I increasingly turned to Neovim for various editingtasks. Naturally, I wanted to migrate my C++ development workflow toNeovim as well. However, a major hurdle emerged: Neovim lacked thebeloved rainbow highlighting I enjoyed in Emacs.

Thankfully, Neovim supports "semantic tokens" from LSP 3.16, astandardized approach adopted by many editors.

I've made changes to ccls (available on abranch; PR)to support semantic tokens. This involves adapting the$ccls/publishSemanticHighlight code to additionally supporttextDocument/semanticTokens/full andtextDocument/semanticTokens/range.

I utilize a few token modifiers (static,classScope, functionScope,namespaceScope) for highlighting:

1
2
3
4
5
vim.cmd([[
hi @lsp.mod.classScope.cpp gui=italic
hi @lsp.mod.static.cpp gui=bold
hi @lsp.typemod.variable.namespaceScope.cpp gui=bold,underline
]])

While this approach is a significant improvement over relying solelyon nvim-treesitter, I'm still eager to implement rainbow semantictokens. Although LSP semantic tokens don't directly distinguish symbols,we can create custom modifiers to achieve similar results.

1
2
3
4
5
tokenModifiers: {
"declaration", "definition", "static", ...

"id0", "id1", ... "id9",
}

In the user-provided initialization options, I sethighlight.rainbow to 10.

ccls assigns the same modifier ID to tokens belonging to the samesymbol, aiming for unique IDs for different symbols. While we only havea few predefined IDs (each linked to a specific color), there's a slightpossibility of collisions. However, this is uncommon and generallyacceptable.

For a token with type variable, Neovim's built-in LSPplugin assigns a highlight group@lsp.typemod.variable.id$i.cpp where $i is aninteger between 0 and 9. This allows us to customize a unique foregroundcolor for each modifier ID.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
local func_colors = {
'#e5b124', '#927754', '#eb992c', '#e2bf8f', '#d67c17',
'#88651e', '#e4b953', '#a36526', '#b28927', '#d69855',
}
local type_colors = {
'#e1afc3', '#d533bb', '#9b677f', '#e350b6', '#a04360',
'#dd82bc', '#de3864', '#ad3f87', '#dd7a90', '#e0438a',
}
local param_colors = {
'#e5b124', '#927754', '#eb992c', '#e2bf8f', '#d67c17',
'#88651e', '#e4b953', '#a36526', '#b28927', '#d69855',
}
local var_colors = {
'#429921', '#58c1a4', '#5ec648', '#36815b', '#83c65d',
'#419b2f', '#43cc71', '#7eb769', '#58bf89', '#3e9f4a',
}
local all_colors = {
class = type_colors,
constructor = func_colors,
enum = type_colors,
enumMember = var_colors,
field = var_colors,
['function'] = func_colors,
method = func_colors,
parameter = param_colors,
struct = type_colors,
typeAlias = type_colors,
typeParameter = type_colors,
variable = var_colors
}
for type, colors in pairs(all_colors) do
for i = 1,#colors do
for _, lang in pairs({'c', 'cpp'}) do
vim.api.nvim_set_hl(0, string.format('@lsp.typemod.%s.id%s.%s', type, i-1, lang), {fg=colors[i]})
end
end
end

vim.cmd([[
hi @lsp.mod.classScope.cpp gui=italic
hi @lsp.mod.static.cpp gui=bold
hi @lsp.typemod.variable.namespaceScope.cpp gui=bold,underline
]])

Now, let's analyze the C++ code above using this configuration.

While the results are visually pleasing, I need help implementingcode lens functionality.

Inactive code highlighting

Inactive code regions (skipped ranges in Clang) are typicallydisplayed in grey. While this can be helpful for identifying unusedcode, it can sometimes hinder understanding the details. I simplydisabled the inactive code feature.

1
2
3
4
5
#ifdef X
... // colorful
#else
... // normal instead of grey
#endif

Refresh

When opening a large project, the initial indexing or cache loadingprocess can be time-consuming, often leading to empty lists of semantictokens for the initially opened files. While ccls prioritizes indexingthese files, it's unclear how to notify the client to refresh the files.The existing workspace/semanticTokens/refresh request,unfortunately, doesn't accept text document parameters.

In contrast, with $ccls/publishSemanticHighlight, cclsproactively sends the notification after an index update (seemain_OnIndexed).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
void main_OnIndexed(DB *db, WorkingFiles *wfiles, IndexUpdate *update) {
...

db->applyIndexUpdate(update);

// Update indexed content, skipped ranges, and semantic highlighting.
if (update->files_def_update) {
auto &def_u = *update->files_def_update;
if (WorkingFile *wfile = wfiles->getFile(def_u.first.path)) {
wfile->setIndexContent(g_config->index.onChange ? wfile->buffer_content
: def_u.second);
QueryFile &file = db->files[update->file_id];
// Publish notifications to the file.
emitSkippedRanges(wfile, file);
emitSemanticHighlight(db, wfile, file);
// But how do we send a workspace/semanticTokens/refresh request?????
}
}
}

While the semantic token request supports partial results in thespecification, Neovim lacks this implementation. Even if it were, Ibelieve a notification message with a text document parameter would be amore efficient and direct approach.

1
2
3
4
5
6
7
export interface SemanticTokensParams extends WorkDoneProgressParams,
PartialResultParams {
/**
* The text document.
*/
textDocument: TextDocumentIdentifier;
}

Other clients

emacs-ccls

Once this feature branch is merged, Emacs users can simply remove thefollowing lines:

1
2
(setq ccls-sem-highlight-method 'font-lock)
(ccls-use-default-rainbow-sem-highlight)

How to change lsp-semantic-token-modifier-faces tosupport rainbow semantic tokens in lsp-mode and emacs-ccls?

The general approach is similar to the following, but we need afeature from lsp-mode (https://github.com/emacs-lsp/lsp-mode/issues/4590).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
(setq lsp-semantic-tokens-enable t)
(defface lsp-face-semhl-namespace-scope
'((t :weight bold)) "highlight for namespace scope symbols" :group 'lsp-semantic-tokens)
(cl-loop for color in '("#429921" "#58c1a4" "#5ec648" "#36815b" "#83c65d"
"#417b2f" "#43cc71" "#7eb769" "#58bf89" "#3e9f4a")
for i = 0 then (1+ i)
do (custom-declare-face (intern (format "lsp-face-semhl-id%d" i))
`((t :foreground ,color))
"" :group 'lsp-semantic-tokens))
(setq lsp-semantic-token-modifier-faces
`(("declaration" . lsp-face-semhl-interface)
("definition" . lsp-face-semhl-definition)
("implementation" . lsp-face-semhl-implementation)
("readonly" . lsp-face-semhl-constant)
("static" . lsp-face-semhl-static)
("deprecated" . lsp-face-semhl-deprecated)
("abstract" . lsp-face-semhl-keyword)
("async" . lsp-face-semhl-macro)
("modification" . lsp-face-semhl-operator)
("documentation" . lsp-face-semhl-comment)
("defaultLibrary" . lsp-face-semhl-default-library)
("classScope" . lsp-face-semhl-member)
("namespaceScope" . lsp-face-semhl-namespace-scope)
,@(cl-loop for i from 0 to 10
collect (cons (format "id%d" i)
(intern (format "lsp-face-semhl-id%d" i))))
))

vscode-ccls

We require assistance to eliminate the$ccls/publishSemanticHighlight feature and adopt built-insemantic tokens support. Due to the lack of active maintenance forvscode-ccls, I'm unable to maintain this plugin for an editor I don'tfrequently use.

Misc

I use a trick to switch ccls builds without changing editorconfigurations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#!/bin/zsh
#export CCLS_TRACEME=s
export LD_PRELOAD=/usr/lib/libmimalloc.so

type=
[[ -f /tmp/ccls-build ]] && type=$(</tmp/ccls-build)

case $type in
strace)
exec strace -s999 -e read,write -o /tmp/strace.log -f ~/ccls/out/debug/ccls --log-file=/tmp/cc.log -v=1 "$@";;
debug)
exec ~/ccls/out/debug/ccls --log-file=/tmp/cc.log -v=2 "$@";;
release)
exec ~/ccls/out/release/ccls --log-file=/tmp/cc.log -v=1 "$@";;
*)
exec /usr/bin/ccls --log-file=/tmp/cc.log -v=1 "$@";;
esac

Usage:

1
2
echo debug > /tmp/ccls-build
nvim # out/debug/ccls is now used

My involvement with LLVM 19

作者 MaskRay
2024年8月18日 15:00

LLVM 19.1 will soon be released. This post provides a summary of mycontributions in this release cycle to record my learning progress.

LLVM binary utilities

  • [llvm-readobj,ELF]Support --decompress/-z
  • [llvm-objcopy]Improve help messages
  • [llvm-readelf]Print a blank line for the first hex/string dump
  • [llvm-objcopy]Add --compress-sections
  • [llvm-readelf]Print more information for RELR

Hashing

I optimized the bit mixer used byllvm::DenseMap<std::pair<X, Y>> andllvm::DenseMap<std::tuple<X...>>.llvm/ADT/Hashing.h, used by StringRef hashingand DenseMap, was supposed to be non-deterministic. Despitethis, a lot of code relied on a specific iteration order. I mademultiple fixes across the code base and landed [Hashing] Use anon-deterministic seed if LLVM_ENABLE_ABI_BREAKING_CHECKS to improvetest coverage (e.g. assertion builds) and ensure future flexibility toreplace the algorithm.

The change has a noticeable code size reduction

1
2
3
4
5
6
7
8
9
# old
movq _ZN4llvm7hashing6detail19fixed_seed_overrideE@GOTPCREL(%rip), %rax
movq (%rax), %rax
testq %rax, %rax
movabsq $-49064778989728563, %rcx # imm = 0xFF51AFD7ED558CCD
cmoveq %rcx, %rax

# new
movabsq $-49064778989728563, %rcx

... and significantcompile time improvement.

I optimizedDenseMap::{find,erase}, yielding compile timeimprovement.

Optimizations to the bit mixer in Hashing.h and theDenseMap code have yielded significant benefits, reducingboth compile time and code size. This suggests there's further potentialfor improvement in this area.

However, the reduced code size also highlights potential significantcode size increase when considering faster unordered map implementationslike boost::unordered_flat_map,Abseil's SwissTable, and Folly'sF14. While these libraries may offer better performance, they oftencome with a significant increase in code complexity and size.

Introducing a new container alongside DenseMap toselectively replace performance-critical instances could lead tosubstantial code modifications. This approach requires carefulconsideration to balance potential performance gains with the additionalcomplexity.

NumericalStabilitySanitizer

NumericalStabilitySanitizer is a new feature for the 19.x releases. Ihave made many changes on the compiler-rt part.

Clang

Driver maintenance

Options used by the LLVM integrated assembler are currently handledin an ad-hoc way. There is deduplication with and without LTO.Eventually we might want to adopt TableGen for these -Wa,options.

Others:

Code review

I reviewed a wide range of patches, including areas like ADT/Support,binary utilities, MC, lld, clangDriver, LTO, sanitizers, LoongArch,RISC-V, and new features like NumericalStabilitySanitizer andRealTimeSanitizer.

To quantify my involvement, a search for patches I commented on(repo:llvm/llvm-project is:pr -author:MaskRay commenter:MaskRay created:>2024-01-23)yields 780 results.

Link: Myinvolvement with LLVM 18

lld 19 ELF changes

作者 MaskRay
2024年8月4日 15:00

LLVM 19 will be released. As usual, I maintain lld/ELF and have addedsome notes to https://github.com/llvm/llvm-project/blob/release/19.x/lld/docs/ReleaseNotes.rst.I've meticulously reviewed nearly all the patches that are not authoredby me. I'll delve into some of the key changes.

  • Experimental CREL relocations with explicit addends are nowsupported using the temporary section type code 0x40000020(clang -c -Wa,--crel,--allow-experimental-crel). LLVM willchange the code and break compatibility (Clang and lld of differentversions are not guaranteed to cooperate, unlike other features). CRELwith implicit addends are not supported. (#98115)
  • EI_OSABI in the output is now inferred from inputobject files. (#97144)
  • --compress-sections <section-glib>={none,zlib,zstd}[:level]is added to compress matched output sections without theSHF_ALLOC flag. (#84855) (#90567)
  • The default compression level for zlib is now independent of linkeroptimization level (Z_BEST_SPEED).
  • zstd compression parallelism no longer requiresZSTD_MULITHREAD build.
  • GNU_PROPERTY_AARCH64_FEATURE_PAUTH notes,R_AARCH64_AUTH_ABS64 andR_AARCH64_AUTH_RELATIVE relocations are now supported. (#72714)
  • --no-allow-shlib-undefined now rejects non-exporteddefinitions in the def-hidden.so ref.so case. (#86777)
  • --debug-names is added to create a merged.debug_names index from input .debug_namessections. Type units are not handled yet. (#86508)
  • --enable-non-contiguous-regions option allowsautomatically packing input sections into memory regions byautomatically spilling to later matches if a region would overflow. Thisreduces the toil of manually packing regions (typical for embedded). Italso makes full LTO feasible in such cases, since IR merging currentlyprevents the linker script from referring to input files. (#90007)
  • --default-script/-dT is implemented tospecify a default script that is processed if--script/-T is not specified. (#89327)
  • --force-group-allocation is implemented to discardSHT_GROUP sections and combine relocation sections if theirrelocated section group members are placed to the same output section.(#94704)
  • --build-id now defaults to generating a 20-byte digest("sha1") instead of 8-byte ("fast"). This improves compatibility withRPM packaging tools. (#93943)
  • -z lrodata-after-bss is implemented to place.lrodata after .bss. (#81224)
  • --export-dynamic no longer creates dynamic sections for-no-pie static linking.
  • --lto-emit-asm is now added as the canonical spellingof --plugin-opt=emit-llvm.
  • --lto-emit-llvm now uses the pre-codegen module. (#97480)
  • When AArch64 PAuth is enabled, -z pack-relative-relocsnow encodes R_AARCH64_AUTH_RELATIVE relocations in.rela.auth.dyn. (#96496)
  • -z gcs and -z gcs-report are now supportedfor AArch64 Guarded Control Stack extension.
  • -r now forces -Bstatic.
  • Thumb2 PLT is now supported for Cortex-M processors. (#93644)
  • DW_EH_sdata4 of addresses larger than 0x80000000 is nowsupported for MIPS32. (#92438)
  • Certain unknown section types are rejected. (#85173)
  • PROVIDE(lhs = rhs) PROVIDE(rhs = ...), lhsis now defined only if rhs is needed. (#74771) (#87530)
  • OUTPUT_FORMAT(binary) is now supported. (#98837)
  • NOCROSSREFS and NOCRFOSSREFS_TO commandsnow supported to prohibit cross references between certain outputsections. (#98773)
  • Orphan placement is refined to prefer the last similar section whenits rank <= orphan's rank. (#94099)Non-alloc orphan sections are now placed at the end. (#94519)
  • R_X86_64_REX_GOTPCRELX of the addq form is no longerincorrectly optimized when the address is larger than 0x80000000.

CREL

I've developed CREL (compact relocations) to reduce relocatable filetremendously for LLVM 19. LLD now supports CREL with explicit addends.Clang and lld of different versions are not guaranteed to cooperate,unlike other features.

See Integratedassembler improvements in LLVM 19 for details.

--compress-sections

The --compress-sections option has been enhanced. Youcan choose between zlib and zstd for compression, along with specifyingthe desired compression level. Looking ahead, zlib is deprecated infavor of zstd. While zstd offers additional tuning options, we onlyprovide the compression level.

My Compressedarbitrary sections has analyzed potential use cases.

Orphan sections

My Understandingorphan sections explains the changes in detail.

Linker scripts

There are quite a few enhancements to the linker script support.NOCROSSREFS and--enable-non-contiguous-regions are noteworthy newfeatures. There is now an increasing demand of features for embeddedprogramming.

The world of embedded programming is a fascinating mix of open andclosed ecosystems. Developers of proprietary hardware and closed-sourcesoftware are increasingly interested in migrating their toolchains tothe LLVM Linker (LLD). The allure of faster link speeds, a cleancodebase, and seamless LTO integration is undeniable. However, as LLD'smaintainer, I must tread carefully. While accommodating these users isnice for LLD's growth, incorporating custom linker extensions riskscompromising the project's code quality and maintainability. Strikingthe right balance between flexibility and code integrity is essential toensure LLD remains a robust and efficient linker for a wide range ofusers.

GNU ld also supports extensions for embedded programming. Icategorize these extensions into two groups: mature and experimental.Many of the established extensions exhibit well-defined semantics andhave been incorporated into LLD. However, some newer extensions in GNUld appear less thoughtfully designed and inflexible.

When considering a specific extension, we should prioritize practicalneeds over arbitrary adherence to GNU ld's implementation. If compellingreasons justify a particular feature and GNU ld's approach provesrestrictive, we should feel empowered to innovate within LLD.

Conversely, when developing new extensions, it's essential to engagewith the broader community. I often submit feature requests to GNU ld toinform decisions we are going to make. I believe this collaborativeapproach fosters knowledge sharing.


There is no performance-specific change.

In the future, we should refactorRelocationScanner::scanOne to make Arch/*.cppdrive the relocation process, removing the virtual functionoverhead.

Link: lld 18 ELFchanges

Mapping symbols: rethinking for efficiency

作者 MaskRay
2024年7月21日 15:00

In object files, certain code patterns embed data within instructionsor transitions occur between instruction sets. This can create hurdlesfor disassemblers, which might misinterpret data as code, resulting ininaccurate output. Furthermore, code written for one instruction setcould be incorrectly disassembled as another. To address these issues,some architectures (Arm, C-SKY, NDS32, RISC-V, etc) define mappingsymbols to explicitly denote state transition. Let's explore thisconcept using an AArch32 code example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
.text
adr r0, .LJTI0_0
vldr d0, .LCPI0_0
bl thumb_callee
.LBB0_1:
nop
.LBB0_2:
nop

.LCPI0_0:
.long 3367254360 @ double 1.234
.long 1072938614

nop

.LJTI0_0:
.long .LBB0_1
.long .LBB0_2

.thumb
.type thumb_callee, %function
thumb_callee:
bx lr

Jump tables (.LJTI0_0): Jump tables canreside in either data or text sections, each with its trade-offs. Herewe see a jump table in the text section(MachineJumpTableInfo::EK_Inline in LLVM), allowing asingle instruction to take its address. Other architectures generallyprefer to place jump tables in data sections. While avoiding data incode, RISC architectures typically require two instructions tomaterialize the address, since text/data distance can be prettylarge.

Constant pool (.LCPI0_0): Thevldr instruction loads a 16-byte floating-point literal tothe SIMD&FP register.

ISA transition: This code blends A32 and T32instructions (the latter used in thumb_callee).

In these cases, a dumb disassembler might treat data as code and trydisassembling them as instructions. Assemblers create mapping symbols toassist disassemblers. For this example, the assembled object file lookslike the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$a:
...

$d:
.long 3367254360 @ double 1.234
.long 1072938614

$a:
nop

$d:
.long .LBB0_1
.long .LBB0_2
.long .LBB0_3

$t:
thumb_callee:
bx lr

Toolchain

Now, let's delve into how mapping symbols are managed within thetoolchain.

Disassemblers

llvm-objdump sorts symbols, including mapping symbols, relative tothe current section, presenting interleaved labels and instructions.Mapping symbols act as signals for the disassembler to switchstates.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
% llvm-objdump -d --triple=armv7 --show-all-symbols a.o

a.o: file format elf32-littlearm

Disassembly of section .text:

00000000 <$a.0>:
0: e28f0018 add r0, pc, #24
4: ed9f0b02 vldr d0, [pc, #8] @ 0x14
8: ebfffffe bl 0x8 @ imm = #-0x8
c: e320f000 hint #0x0
10: e320f000 hint #0x0

00000014 <$d.1>:
14: 58 39 b4 c8 .word 0xc8b43958
18: 76 be f3 3f .word 0x3ff3be76

0000001c <$a.2>:
1c: e320f000 nop

00000020 <$d.3>:
20: 0c 00 00 00 .word 0x0000000c
24: 10 00 00 00 .word 0x00000010

00000028 <$t.4>:
00000028 <thumb_callee>:
28: 4770 bx lr

I changed llvm-objdump18 to not display mapping symbols as labels unless--show-all-symbols is specified.

nm

Both llvm-nm and GNU nm typically conceal mapping symbols alongsideSTT_FILE and STT_SECTION symbols. However, youcan reveal these special symbols using the --special-symsoption.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
% cat a.s
foo:
bl thumb_callee
.long 42
.thumb
thumb_callee:
bx lr
% clang --target=arm-linux-gnueabi -c a.s
% llvm-nm a.o
00000000 t foo
00000008 t thumb_callee
% llvm-nm --special-syms a.o
00000000 t $a.0
00000004 t $d.1
00000008 t $t.2
00000000 t foo
00000008 t thumb_callee

GNU nm behaves similarly, but with a slight quirk. If the default BFDtarget isn't AArch32, mapping symbols are displayed even without--special-syms.

1
2
3
4
5
6
7
8
9
% arm-linux-gnueabi-nm a.o
00000000 t foo
00000008 t thumb_callee
% nm a.o
00000000 t $a.0
00000004 t $d.1
00000008 t $t.2
00000000 t foo
00000008 t thumb_callee

Symbolizers

Mapping symbols, being non-unique and lacking descriptive names, areintentionally omitted by symbolizers like addr2line and llvm-symbolizer.Their primary role lies in guiding the disassembly process rather thanproviding human-readable context.

Size problem: symbol tablebloat

While mapping symbols are useful, they can significantly inflate thesymbol table, particularly in 64-bit architectures(sizeof(Elf64_Sym) == 24) with larger programs. This issuebecomes more pronounced when using-ffunction-sections -fdata-sections, which generatesnumerous small sections.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
% cat a.c
void f0() {}
void f1() {}
void f2() {}
int d1 = 1;
int d2 = 2;
% clang -c --target=aarch64 -ffunction-sections -fdata-sections a.c
% llvm-objdump -d --show-all-symbols a.o # GNU objdump --show-all-symbols does no display mapping symbols

a.o: file format elf64-littleaarch64

Disassembly of section .text.f0:

0000000000000000 <$x>:
0000000000000000 <f0>:
0: d65f03c0 ret

Disassembly of section .text.f1:

0000000000000000 <$x>:
0000000000000000 <f1>:
0: d65f03c0 ret

Disassembly of section .text.f2:

0000000000000000 <$x>:
0000000000000000 <f2>:
0: d65f03c0 ret
% llvm-readelf -sX a.o

Symbol table '.symtab' contains 16 entries:
Num: Value Size Type Bind Vis+Other Ndx(SecName) Name [+ Version Info]
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 FILE LOCAL DEFAULT ABS a.c
2: 0000000000000000 0 SECTION LOCAL DEFAULT 3 (.text.f0) .text.f0
3: 0000000000000000 0 NOTYPE LOCAL DEFAULT 3 (.text.f0) $x
4: 0000000000000000 0 SECTION LOCAL DEFAULT 4 (.text.f1) .text.f1
5: 0000000000000000 0 NOTYPE LOCAL DEFAULT 4 (.text.f1) $x
6: 0000000000000000 0 SECTION LOCAL DEFAULT 5 (.text.f2) .text.f2
7: 0000000000000000 0 NOTYPE LOCAL DEFAULT 5 (.text.f2) $x
8: 0000000000000000 0 NOTYPE LOCAL DEFAULT 6 (.data) $d
9: 0000000000000000 0 NOTYPE LOCAL DEFAULT 7 (.comment) $d
10: 0000000000000000 0 NOTYPE LOCAL DEFAULT 9 (.eh_frame) $d
11: 0000000000000000 4 FUNC GLOBAL DEFAULT 3 (.text.f0) f0
12: 0000000000000000 4 FUNC GLOBAL DEFAULT 4 (.text.f1) f1
13: 0000000000000000 4 FUNC GLOBAL DEFAULT 5 (.text.f2) f2
14: 0000000000000000 4 OBJECT GLOBAL DEFAULT 6 (.data) d1
15: 0000000000000004 4 OBJECT GLOBAL DEFAULT 6 (.data) d2

Except the trivial cases (e.g. empty section), in both GNU assemblerand LLVM integrated assemble's AArch64 ports:

  • A non-text section (data, debug, etc) almost always starts with aninitial $d.
  • A text section almost always starts with an initial $x.ABI requires a mapping symbol at offset 0.

The behaviors ensure that each function or data symbol has acorresponding mapping symbol, while extra mapping symbols might occur inrare cases. Thereore, the number of mapping symbols in the output symboltable usually exceeds 50%.

Most text sections have 2 or 3 symbols:

  • A STT_FUNC symbol.
  • A STT_SECTION symbol due to a referenced from.eh_frame. This symbol is absent if-fno-asynchronous-unwind-tables.
  • A $x mapping symbol.

During the linking process, the linker combines input sections andeliminates STT_SECTION symbols.

Note: LLVM integrated assemblers used to create unique$x.<digit> due to an assembler limitation. I haveupdated LLVM 19 to drop.<digit> suffixes.

In LLVM's ARM port, data sections do not have mapping symbols, unlessthere are A32 or T32 instructions (D30724).

Alternative mapping symbolscheme

I have proposed an alternaive scheme to address the size concern.

  • Text sections: Assume an implicit $x at offset 0. Addan ending $x if the final data isn't instructions.
  • Non-text sections: Assume an implicit $d at offset 0.Add an ending $d only if the final data isn't datadirectives.

This approach eliminates most mapping symbols while ensuring correctdisassembly. Here is an illustrated assembler example:

1
2
3
4
5
6
7
8
.section .text.f0,"ax"
ret
// emit $d
.long 42
// emit $x. Without this, .text.f1 might be interpreted as data.

.section .text.f1,"ax"
ret

The ending mapping symbol is to ensure the subsequent section in thelinker output starts with the desired state. The data in code case isextremely rare for AArch64 as jump tables are placed in.rodata.

Impressive results

I have developed a LLVM patches to add an opt-in optionclang -Wa,-mmapsyms=implicit. Experiments with a Clangbuild using this alternative scheme have shown impressive results,eliminating over 50% of symbol table entries.

1
2
3
4
.o size   |   build |
261345384 | a64-0 | standard
260443728 | a64-1 | optimizing $d
254106784 | a64-2 | optimizing both $x
1
2
3
4
5
6
% bloaty a64-2/bin/clang -- a64-0/bin/clang
FILE SIZE VM SIZE
-------------- --------------
-5.4% -1.13Mi [ = ] 0 .strtab
-50.9% -4.09Mi [ = ] 0 .symtab
-4.0% -5.22Mi [ = ] 0 TOTAL

However, omitting a mapping symbol at offset 0 for sections withinstructions is currently non-conformant. An ABI update has been requestedto address this, though it unlikely has an update in the near term dueto lack of GNU toolchain support and interoperability concern.

I'll elaborate interoperability concerns below. Note, they'reunlikely to impact a majority of users.

For instance, if a text section with trailing data is assembled usingthe traditional behavior, the last mapping symbol will be$d. When linked with another text section assembled usingthe new behavior (lacking an initial $x), disassemblersmight misinterpret the start of the latter section as data.

Similarly, linker scripts that combine non-text and text sectionscould lead to text sections appearing in a data state.

1
2
3
4
SECTIONS {
...
mix : { *(.data.*) *(.text.foo) }
}

However, many developers would classify these scenarios as errorconditions.

A text section may rarely start with data directives (e.g.,-fsanitize=function, LLVM prefixdata). When the linker combines two such sections, the ending$x of the first section and the initial $d ofthe second might have the same address.

1
2
3
4
5
6
7
8
9
.section .text.0, "ax"
// $d
.word 0
// $x this

.section .text.1, "ax"
// $d may have the same address
.word 0
// $x

In a straightforward implementation, symbols are stable-sorted byaddress and the last symbol at an address wins. Ideally we want$d $x $d $x. If the sections are in different files, alinker that respects input order will naturally achieves this. Ifthey're in the same file, the assembler should output$d $x $d $x instead of $d $d $x $x. This worksif .text.0 precedes .text.1 in the linkeroutput, but the other section order might be unexpected. In the worstcase where the linker's section order mismatches the assembler's sectionorder (--symbol-ordering-file=,--call-graph-profile-sort, linker scripts), the initialdata directives could be mistakenly identified as code. But thefollowing code won't, making this an acceptable risk for certainusers.

Teaching linkers to scan and insert missing mapping symbols istechnically possible but inelegant and impactsperformance. There's a strong emphasis on the philosophy of "smartformat, dumb linker," which favors keeping the format itself intelligentand minimizing the complexity of the linker.

Ultimately, the proposed alternative scheme effectively addressessymbol table bloat, but requires careful consideration for complianceand interoperability. With this optimization enabled, the remainingsymbols would primarily stem from range extension thunks, prebuiltlibraries, or highly specialized assembly code.

Mapping symbols forrange extension thunks

When lld creates an AArch64range extension thunk, it defines a $x symbol tosignify the A64 state. This symbol is only relevant when the precedingsection ends with the data state, a scenario that's only possible withthe traditional assembler behavior.

Given the infrequency of range extension thunks, the $xsymbol overhead is generally tolerable.

Peculiar alignmentbehavior in GNU assembler

In contrast to LLVM's integrated assembler, which restricts statetransitions to instructions and data directives, GNU assemblerintroduces additional state transitions for alignments. These alignmentscan be either implicit (arising from alignment requirements) or explicit(specified through directives). This behavior has led to someinteresting edge cases and bug fixes over time. (See related code beside[PATCH][GAS][AARCH64]Fix"align directive causes MAP_DATA symbol to be lost"https://sourceware.org/bugzilla/show_bug.cgi?id=20364)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
.section .foo1,"a"
// no $d
.word 0

.section .foo2,"a"
// $d
.balign 4
.word 0

.section .foo3,"a"
// $d
.word 0
// $a
nop

In the example, .foo1 only contains data directives andthere is no $d. However, .foo2 includes analignment directive, triggering the creation of a $dsymbol. Interestingly, .foo3 starts with data but ends withan instruction, necessitating both a $d and an$a mapping symbol.

It's worth noting that DWARF sections, typically generated by thecompiler, don't include explicit alignment directives. They behavesimilarly to the .foo1 example and lack an associated$d mapping symbol.

AArch32 ld --be8

The BE-8 mode (byte-invariant addressing big-endian mode) requiresthe linker to convert big-endian code to little-endian. This is implemented by scanningmapping symbols. See Linker notes onAArch32#--be8 for context.

RISC-V ISA extension

RISC-V mapping symbols are similar to AArch64, but with a notableextension:

1
2
$x<ISA>       | Start of a sequence of instructions with <ISA> extension.
$x<ISA>.<any>

The alternative scheme for optimizing symbol table size can beadapted to accommodate RISC-V's $x<ISA> symbols. Theapproach remains the same: add an ending $x<ISA> onlyif the final data in a text section doesn't belong to the desiredISA.

The alternative scheme can be adapted to work with$x<ISA>: Add an ending $x<ISA> ifthe final data isn't of the desired ISA.

This adaptation works seamlessly as long as all relocatable filesprovided to the linker share the same baseline ISA. However, inscenarios where the relocatable files are more heterogeneous, a crucialquestion arises: which state should be restored at section end? Wouldthe subsequent section in the linker output be compiled with differentISA extensions?

Technically, we could teach linkers to insert $xsymbols, but scanning each input text section isn't elegant.

Mach-OLC_DATA_IN_CODE load command

In contrast to ELF's symbol pair approach, Mach-O employs theLC_DATA_IN_CODE load command to store non-instructionranges within code sections. This method is remarkably compact, witheach entry requiring only 8 bytes. ELF, on the other hand, needs twosymbols ($d and $x) per data region, consuming48 bytes (in ELFCLASS64) in the symbol table.

1
2
3
4
5
struct data_in_code_entry {
uint32_t offset; /* from mach_header to start of data range*/
uint16_t length; /* number of bytes in data range */
uint16_t kind; /* a DICE_KIND_* value */
};

In llvm-project, the possible kind values are defined inllvm/include/llvm/BinaryFormat/MachO.h. I recentlyrefactored the generic MCAssembler to place this Mach-Ospecific thing, alongside others, to MachObjectWriter.

1
2
3
4
5
6
7
8
enum DataRegionType {
// Constants for the "kind" field in a data_in_code_entry structure
DICE_KIND_DATA = 1u,
DICE_KIND_JUMP_TABLE8 = 2u,
DICE_KIND_JUMP_TABLE16 = 3u,
DICE_KIND_JUMP_TABLE32 = 4u,
DICE_KIND_ABS_JUMP_TABLE32 = 5u
};

Achieving Mach-O'sefficiency in ELF

Given ELF's symbol table bloat due to the st_size member(myprevious analysis), how can it attain Mach-O's level of efficiency?Instead of introducing a new format, we can leverage the standard ELFfeature: SHF_COMPRESSED.

Both .symtab and .strtab lack theSHF_ALLOC flag, making them eligible for compressionwithout requiring any changes to the ELF specification.

  • LLVMdiscussion
  • A featurerequest has already been submitted to binutils to explore thispossibility.

The implementation within LLVM shouldn't be overly complex, and I'mmore than willing to contribute if there's interest from thecommunity.

Linker compatibility and the "User-Agent" problem

作者 MaskRay
2024年7月7日 15:00

The output of ld.lld -v includes a message "compatiblewith GNU linkers" to address detectionmechanism used by GNU Libtool. This problem is described by Softwarecompatibility and our own "User-Agent" problem.

The latest m4/libtool.m4 continues to rely on aGNU check.

1
2
3
4
5
6
7
8
9
10
[AC_CACHE_CHECK([if the linker ($LD) is GNU ld], lt_cv_prog_gnu_ld,
[# I'd rather use --version here, but apparently some GNU lds only accept -v.
case `$LD -v 2>&1 </dev/null` in
*GNU* | *'with BFD'*)
lt_cv_prog_gnu_ld=yes
;;
*)
lt_cv_prog_gnu_ld=no
;;
esac])

Check-basedconfiguration can be a valuable tool, ensuring software remainsfunctional in the future. However, this example highlights how overlyspecific checks can lead to unintended consequences.

If Libtool needs to check whether certain options are available, itcan utilize -v.

1
2
3
4
5
6
7
% ld.bfd -v --whole-archive
GNU ld (GNU Binutils) 2.42.0
% ld.bfd -v --whole-archivex; echo $?
GNU ld (GNU Binutils) 2.42.0
ld.bfd: unrecognized option '--whole-archivex'
ld.bfd: use the --help option for usage information
1

This blog post explores more forms of the "User-Agent" problemexposed by an LLD patch changing the version message format.

LLD supports many object file formats. It largely emulates thebehavior of GNU ld for ELF, while emulating the behavior of MSVClink.exe for PE/COFF. Previously, LLD's ELF port displays the versioninformation like this:

1
2
% /tmp/out/custom2/bin/ld.lld --version
LLD 19.0.0 (compatible with GNU linkers)

A recent patch (llvm-project#97323)changed it to one of the following formats, depending on the build-timevariable LLVM_APPEND_VC_REV:

With LLVM_APPEND_VC_REV=on:

1
2
% /tmp/out/custom2/bin/ld.lld --version
LLD 19.0.0 (git@github.com:llvm/llvm-project.git 0f9fbbb63cfcd2069441aa2ebef622c9716f8dbb), compatible with GNU linkers

With LLVM_APPEND_VC_REV=off:

1
2
% /tmp/out/custom2/bin/ld.lld --version
LLD 19.0.0, compatible with GNU linkers

Meson

In Meson, mesonbuild/linkers/detect.py:guess_win_linkerchecks the --version output to determine whether the LLDinvocation is for ELF or PE/COFF. It performed an overly strict check"(compatible with GNU linkers)", which failed when the parentheses werestripped by #97323.

1
2
3
4
5
6
7
8
9
10
# mesonbuild/linkers/detect.py
if 'LLD' in o.split('\n', maxsplit=1)[0]:
if '(compatible with GNU linkers)' in o:
return linkers.LLVMDynamicLinker(
compiler, for_machine, comp_class.LINKER_PREFIX,
override, version=search_version(o))
elif not invoked_directly:
return linkers.ClangClDynamicLinker(
for_machine, override, exelist=compiler, prefix=comp_class.LINKER_PREFIX,
version=search_version(o), direct=False, machine=None)

The latest Meson has loosened the check (meson#13383).

It seems that the linker detection has a larger problem that--target= is not taken into account with Clang (#6662).

Linux kernel

The Linux kernel's scripts/ld-version.sh script detectslinker versions. Introduced in 2014, it initially checked for GNU ldcompatibility with GCC LTO (though LTO support remains unmerged). It waslater revamped to handle LLD versions as well. While it can handlesuffixes like 2.34-4.fc32, it struggles with versionscontaining with comma suffix (19.0.0,).

1
2
% scripts/ld-version.sh /tmp/out/custom2/bin/ld.lld
scripts/ld-version.sh: line 19: 10000 * 19 + 100 * 0 + 0,: syntax error: operand expected (error token is ",")

The script extracts the version string from the--version output and parses it as major.minor.patch.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Get the first line of the --version output.
IFS='
'
set -- $(LC_ALL=C "$@" --version)

# Split the line on spaces.
IFS=' '
set -- $1

...

# Some distributions append a package release number, as in 2.34-4.fc32
# Trim the hyphen and any characters that follow.
version=${version%-*}

To support suffixes starting with either - or,, the script willemploy a POSIX shell trick utilizing the "Remove Largest SuffixPattern" feature:

1
version=${version%%[!0-9.]*}

More fun with versions

llvm-nm and llvm-objcopy also claim GNU compatibility.

1
2
3
4
5
6
7
8
9
10
% /tmp/Rel/bin/llvm-nm --version
llvm-nm, compatible with GNU nm
LLVM (http://llvm.org/):
LLVM version 19.0.0git
Optimized build with assertions.
% /tmp/Rel/bin/llvm-objcopy --version
llvm-objcopy, compatible with GNU objcopy
LLVM (http://llvm.org/):
LLVM version 19.0.0git
Optimized build with assertions.

Ever wondered what the subtle differences are between-v, -V, and --version when usingGNU ld? Let's break it down:

  • --version skips linker input processing and displaysbrief copyright information.
  • -v and -V keep processing command linearguments and perfoming a linking step. This behavior gives an easy wayto check whether an option is supported.
  • -V goes a step further than -v byincluding a list of supported BFD emulations alongside the versioninformation.

Prior to September 2022, -V in ld.lld used to an aliasfor --version. This caused issues when usinggcc -v -fuse-ld=lld on certain targets like*-freebsd and powerpc-*: gcc passes -V to thelinker, expecting it to process the input files and complete the linkingstep. However, ld.lld's behavior with -V skipped thisprocess.

I made an adjustment by making-V an alias for -v instead. This ensuresthat gcc -v -fuse-ld=lld performs the linking step.

GCC has a similar -v and --versionbehavior, but -V does not exist.

Clang's GNU driver emulates GCC 4.2.1, but you can change the versionwith -fgnuc-version=.

1
2
3
4
5
6
7
8
9
10
% clang -E -dM -xc /dev/null | grep GNU
#define __GNUC_MINOR__ 2
#define __GNUC_PATCHLEVEL__ 1
#define __GNUC_STDC_INLINE__ 1
#define __GNUC__ 4
% clang -E -dM -xc /dev/null -fgnuc-version=5.3.2 | grep GNU
#define __GNUC_MINOR__ 3
#define __GNUC_PATCHLEVEL__ 2
#define __GNUC_STDC_INLINE__ 1
#define __GNUC__ 5

Integrated assembler improvements in LLVM 19

作者 MaskRay
2024年6月30日 15:00

Within the LLVM project, MC is a library responsible for handlingassembly, disassembly, and object file formats. Introto the LLVM MC Project, which was written back in 2010, remains agood source to understand the high-level structures.

In the latest release cycle, substantial effort has been dedicated torefining MC's internal representation for improved performance andreadability. These changes have decreased compiletime significantly. This blog post will delve into the details,providing insights into the specific changes.

MergedMCAsmLayout into MCAssembler

MCAssembler manages assembler states (includingsections, symbols) and implements post-parsing passes (computing alayout and writing an object file). MCAsmLayout, tightlycoupled with MCAssembler, was in charge of symbol andfragment offsets during MCAssembler::Finish.MCAsmLayout was a wrapper of MCAssembler and asection order vector (actually Mach-O specific). ManyMCAssembler and MCExpr member functions have aconst MCAsmLayout & parameter, contributing to slightoverhead. Here are some functions that are called frequently:

  • MCAssembler::computeFragmentSize is called a lot in thelayout process.
  • MCAsmBackend::handleFixup andMCAsmBackend::applyFixup evaluate each fixup and producerelocations.
  • MCAssembler::fixupNeedsRelaxation determines whether aMCRelaxableFragment needs relaxation due to aMCFixup.
  • MCAssembler::relaxFragment andMCAssembler::relaxInstruction relax a fragment.

I startedto merge MCAsmLayout into MCAssembler andsimplify MC code, and eventually removed llvm/include/llvm/MC/MCAsmLayout.h.

Fragments

Fragments, representing sequences of non-relaxable instructions,relaxable instruction, alignment directives, and other elements.MCDataFragment and MCRelaxableFragment, whosesizes are crucial for memory consumption, have undergone severaloptimizations:

The fragment management system has also been streamlined bytransitioning from a doubly-linked list (llvm::iplist) to asingly-linkedlist, eliminating unnecessary overhead. A few prerequisite commitsremoved backward iterator requirements.

Furthermore, I introducedthe "current fragment" concept (MCSteamer::CurFrag)allowing for faster appending of new fragments.

I have also simplified and optimized fragment offset computation:

  • [MC] Relax fragments eagerly

Previously, calculating fragment offsets happened lazily in thegetFragmentOffset function. All sections were iterativelyrelaxed until they all converged. This process was inefficient as theslowest section determined the number of iterations for all others,resulting in extra calculations.

Previously, fragment offset computation was lazily performed bygetFragmentOffset. The section that converged the slowestdetermined other sections' iteration steps, leading to some unneededcomputation.

The new layout algorithm assigns fragment offsets and iterativelyrefines them for each section until it's optimized. Then, it moves on tothe next section. If relaxation doesn't change anything, fragment offsetassignment will be skipped. This way, sections that converge quicklydon't have to wait for the slowest ones, resulting in asignificant decrease in compile time for full LTO.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
bool MCAssembler::relaxOnce() {
bool ChangedAny = false;
for (MCSection &Sec : *this) {
auto MaxIter = NumFrags + 1;
uint64_t OldSize = getSectionAddressSize(Sec);
do {
uint64_t Offset = 0;
Changed = false;
for (MCFragment &F : Sec) {
if (F.Offset != Offset) {
F.Offset = Offset;
Changed = true;
}
relaxFragment(F);
Offset += computeFragmentSize(F);
}

Changed |= OldSize != Offset;
ChangedAny |= Changed;
OldSize = Offset;
} while (Changed && --MaxIter);
if (MaxIter == 0)
return false;
}
return ChangedAny;
}

Symbols

@aengelke madetwo noticeable performance improvements:

In MCObjectStreamer, newly defined labels were put intoa "pending label" list and initially assigned to aMCDummyFragment associated with the current section. Thesymbols will be reassigned to a new fragment when the next instructionor directive is parsed. This pending label system, while necessary foraligned bundling, introduced complexity and potential for subtlebugs.

To streamline this, I revampedthe implementation by directly adjusting offsets of existingfragments, eliminating over 100 lines of code and reducing the potentialfor errors.

Details: In 2014, [MC]Attach labels to existing fragments instead of using a separatefragment introduced flushPendingLabels aligned bundlingassembler extension for Native Client. [MC] Match labels to existingfragments even when switching sections., built on top offlushPendingLabels, added further complication.

In MCObjectStreamer, a newly defined label wastemporarily assigned to a MCDummyFragment. The symbol wouldbe reassigned to a new fragment when the next instruction or directivewas parsed. The MCDummyFragment was not in the section'sfragment list. However, during expression evaluation, it should beconsidered as the temporary end of the section.

For the following code, aligned bundling requires that.Ltmp is defined at addl.

1
2
3
4
5
6
7
8
9
$ clang var.c -S -o - -fPIC -m32
...
.bundle_lock align_to_end
calll .L0$pb
.bundle_unlock
.L0$pb:
popl %eax
.Ltmp0:
addl $_GLOBAL_OFFSET_TABLE_+(.Ltmp0-.L0$pb), %eax

Worse, a lot of directive handling code had to addflushPendingLabels and a missingflushPendingLabels could lead to subtle bugs related toincorrect symbol values.

( MCAsmStreamer doesn't callflushPendingLabels in its handlers. This is the reason thatwe cannot change MCAsmStreamer::getAssemblerPtr to use aMCAssembler and changeAsmParser::parseExpression. )

Sections

Section handling was also refined. MCStreamer maintains a a sectionstack for features like.push_section/.pop_section/.previousdirectives. Many functions relied on the section stack for loading thecurrent section, which introduced overhead due to the additionalindirection and nullable return values.

By leveraging the "current fragment" concept, the need for thesection stack was eliminated in most cases, simplifying the codebase andimproving efficiency.

I have eliminated nullable getCurrentSectionOnly usesand changedgetCurrentSectionOnly to leverage the "current fragment"concept. This change also revealedan interesting quirk in NVPTX assembly related to DWARFsections.

Section symbols

Many section creation functions (MCContext::get*Section)had a const char *BeginSymNameparameter to support the section symbol concept. This led to issueswhen we want to treat the section name as a symbol. In 2017, theparameter was removedfor ELF, streamlining section symbol handling.

I changed the way MC handles section symbols for COFF andremoved the unused parameters for WebAssembly. The work planned forXCOFF is outlined in https://github.com/llvm/llvm-project/issues/96810.

Expression evaluation

Expression evaluation in MCAssembler::layout previouslyemployed a complex lazy evaluation algorithm, which aimed to minize thenumber of fragment relaxation. It proved difficult to understand andresulted in complex recursiondetection.

To address this, I removed lazy evaluation in favor of eagerfragment relaxation. This simplification improved the reliability ofthe layout process, eliminating the need for intricate workarounds likethe MCFragment::IsBeingLaidOut flag introduced earlier.

Note: the benefit of lazy evaluation largely diminished when https://reviews.llvm.org/D76114 invalidated all sectionsto fix the correctness issue for the following assembly:

1
2
3
4
5
6
7
8
.section .text1,"ax"
.skip after-before,0x0
.L0:

.section .text2
before:
jmp .L0
after:

In addition, I removed anoverload of isSymbolRefDifferenceFullyResolvedImpl, enablingconstant folding for variable differences in Mach-O.

Target-specificfeatures misplaced in the generic implementation

I have made efforts to relocate target-specific functionalities totheir respective target implementations:

The class hierarchy has been cleaned up by making moreMC*ObjectWriter public and accessing them fromMC*Streamer.

CREL

The integrated assembler now supports CREL(compact relocation) for ELF.

Once the Clang and lld patches are merged, enabling compactrelocations is as simple as this:

clang -c -Wa,--crel,--allow-experimental-crel a.c && clang -fuse-ld=lld a.o.

Note: This feature is unstable. While relocatable files created withClang version A will work with lld version A, they might not becompatible with newer versions of lld (where A is older than B).

As thefuture of the generic ABI remains uncertain, CREL might not get"standardized". In that case, I will just get the section code agreedwith the GNU community to ensure wider compatibility.

Assembly parser

  • \+, the per-macro invocation count, is nowavailable for .irp/.irpc/.rept.
  • [MCParser].altmacro: Support argument expansion not preceded by \
  • [MC]Support .cfi_label

Summary

I've been contributing to MC for several years. Back then, while manycontributed, most focused on adding specific features. Rafael Ávila deEspíndola was the last to systematically review and improve the MClayer. Unfortunately, simplification efforts stalled after Rafael'sdeparture in 2018.

Picking up where Rafael left off, I'm diving into the MC layer tostreamline its design. A big thanks to @aengelke for his invaluable performancecentric contributions in this release cycle. LLVM 19 introducessignificant enhancements to the integrated assembler, resulting innotable performance gains, reduced memory usage, and a more streamlinedcodebase. These optimizations pave the way for future improvements.

I compiled the preprocessed SQLite Amalgamation (from llvm-test-suite)using a Release build of clang:

build 2024-05-14 2024-07-02
-O0 0.5304 0.4930
-O0 -g 0.8818 0.7967
-O2 6.249 6.098
-O2 -g 7.931 7.682

clang -c -w sqlite3.i

The AsmPrinter pass, which is coupled with the assembler, dominatesthe -O0 compile time. I have modified the-ftime-report mechanism to decrease the per-instructionoverhead. The decrease in compile time matches the decrease in the spentin AsmPrinter. Coupled with a recent observation that BOLT, whichheavily utilizes MC, is ~8% faster, it's clear that MC modificationshave yielded substantial improvements.

Noticeableoptimizations in previous releases

[MC] Always encodeinstruction into SmallVector optimizedMCCodeEmitter::encodeInstruction for x86 by avoidingraw_ostream::write overhead. I have migrated other targetsand removedthe extra overload.

1
raw_ostream::write =(inlinable)=> flush_tied_then_write (unneeded TiedStream check) =(virtual function call)=> raw_svector_ostream::write_impl ==> SmallVector append(ItTy in_start, ItTy in_end) (range; less efficient then push_back).

[MC]Optimize relaxInstruction: remove SmallVector copy. NFC removes codeand fixup copy for relaxInstruction.

Roadmap

Symbol redefinition

llvm-mc:Diagnose misuse (mix) of defined symbols and labels. addedredefinition error. This was refined many times. I hope to fix this inthe future.

Addressing Mach-O weakness

The Mach-O assembler lacks the robustness of its ELF counterpart.Notably, certain aspects of the Mach-O implementation, such as theconditions for constant folding inMachObjectWriter::isSymbolRefDifferenceFullyResolvedImpl(different for x86-64 and AArch64), warrant revisiting.

Additionally, the Mach-O has a hack to maintaincompatibility with Apple cctools assembler, when the relocationaddend is non-zero.

1
2
3
4
5
6
7
8
9
10
.data
a = b + 4
.long a # ARM64_RELOC_UNSIGNED(a) instead of b; This might work around the linker bug(?) when the referenced symbol is b and the addend is 4.

c = d
.long c # ARM64_RELOC_UNSIGNED(d)

y:
x = y + 4
.long x # ARM64_RELOC_UNSIGNED(x) instead of y

This leads to another workaround inMCFragment.cpp:getSymbolOffsetImpl ([MC] Recursively calculatesymbol offset), which is to support the following assembly:

1
2
3
4
l_a:
l_b = l_a + 1
l_c = l_b
.long l_c

Misc

  • emitLabel at switchSection was for DWARFsections, which might be no longer useful

Understanding orphan sections

作者 MaskRay
2024年6月2日 15:00

GNU ld's output section layout is determined by a linker script,which can be either internal (default) or external (specified with-T or -dT). Within the linker script,SECTIONS commands define how input sections are mapped intooutput sections.

Input sections not explicitly placed by SECTIONScommands are termed "orphansections".

Orphan sections are sections present in the input files which are notexplicitly placed into the output file by the linker script. The linkerwill still copy these sections into the output file by either finding,or creating a suitable output section in which to place the orphanedinput section.

GNU ld's default behavior is to create output sections to hold theseorphan sections and insert these output sections into appropriateplaces.

Orphan section placement is crucial because GNU ld's built-in linkerscripts, while understanding common sections like.text/.rodata/.data, are unawareof custom sections. These custom sections should still be included inthe final output file.

  • Grouping: Orphan input sections are grouped into orphan outputsections that share the same name.
  • Placement: These grouped orphan output sections are then insertedinto the output sections defined in the linker script. They are placednear similar sections to minimize the number of PT_LOADsegments needed.

GNU ld's algorithm

GNU ld's orphan section placement algorithm is primarily specifiedwithin ld/ldlang.c:lang_place_orphans andld/ldelf.c:ldelf_place_orphan.lang_place_orphans is a linker pass that is betweenINSERT processing and SHF_MERGE sectionmerging.

The algorithm utilizes a structure (orphan_save) toassociate desired BFD flags (e.g., SEC_ALLOC, SEC_LOAD)with special section names (e.g., .text, .rodata) and areference to the associated output section. The associated outputsection is initialized to the special section names (e.g.,.text, .rodata), if present.

For each orphan section:

  • If an output section of the same name is present and--unique is not specified, the orphan section is placed init.
  • Otherwise, GNU ld identifies the matching orphan_saveelement based on the section's flags.
  • If an associated output section exists related to theorphan_save element, the orphan section is placed afterit.
  • Otherwise, heuristics are applied to place the orphan section aftera similar existing section. For example:
    • .text-like sections followSHF_ALLOC|SHF_WRITE|SHF_EXECINSTR sections.
    • .rodata-like sections follow .text-like sections.
    • .tdata-like sections follow .data-like sections.
    • .sdata-like sections follow .data-like sections.
    • .data-like sections can follow .rodata-like sections.
  • The associated output section is replaced with the new outputsection. The next orphan output section of similar flags will be placedafter the current output section.

As a special case, if an orphan section is placed after the lastoutput section(else if (as != snew && as->prev != snew)), itwill be adjusted to be placed after all trailing commands(sym = expr, . = expr, etc).

For example, custom code section mytext (withSHF_ALLOC | SHF_EXECINSTR) would typically be placed after.text, and custom data section mydata (withSHF_ALLOC | SHF_WRITE) after .data.

1
2
3
4
5
6
7
8
9
10
11
static struct orphan_save hold[] = {
{ ".text", SEC_HAS_CONTENTS | SEC_ALLOC | SEC_LOAD | SEC_READONLY | SEC_CODE, 0, 0, 0, 0 },
{ ".rodata", SEC_HAS_CONTENTS | SEC_ALLOC | SEC_LOAD | SEC_READONLY | SEC_DATA, 0, 0, 0, 0 },
{ ".tdata", SEC_HAS_CONTENTS | SEC_ALLOC | SEC_LOAD | SEC_DATA | SEC_THREAD_LOCAL, 0, 0, 0, 0 },
{ ".data", SEC_HAS_CONTENTS | SEC_ALLOC | SEC_LOAD | SEC_DATA, 0, 0, 0, 0 },
{ ".bss", SEC_ALLOC, 0, 0, 0, 0 },
{ 0, SEC_HAS_CONTENTS | SEC_ALLOC | SEC_LOAD | SEC_READONLY | SEC_DATA, 0, 0, 0, 0 },
{ ".interp", SEC_HAS_CONTENTS | SEC_ALLOC | SEC_LOAD | SEC_READONLY | SEC_DATA, 0, 0, 0, 0 },
{ ".sdata", SEC_HAS_CONTENTS | SEC_ALLOC | SEC_LOAD | SEC_DATA | SEC_SMALL_DATA, 0, 0, 0, 0 },
{ ".comment", SEC_HAS_CONTENTS, 0, 0, 0, 0 },
};

Noteworthy details:

.interp and .rodata have the same BFDflags, but they are anchors for different sections.SHT_NOTE sections go after .interp, whileother read-only sections go after .rodata.

Consider a scenario where a linker script defines .dataand .rw1 sections with identical BFD flags. If we haveorphan sections that share the same flags, GNU ld would insert theseorphans after .data, even if it might seem more logical toplace them after .rw1.

1
2
3
.data : { *(.data .data.*) }
// .rw2 .rw3 orphans are inserted here
.rw1 : { *(.rw1) }

Renaming the output section .data will achieve thedesired placement:

1
2
3
.mydata : { *(.data .data.*) }
.rw1 : { *(.rw1) }
// .rw2 .rw3 orphans are inserted here

lld's algorithm

The LLVM linker lld implements a large subset of the GNU ld linkerscript. However, due to the complexity of GNU ld and lack of an officialspecification, there can be subtle differences in behavior.

While lld strives to provide a similar linker script behavior, itoccasionally makes informed decisions to deviate where deemedbeneficial. We balance compatibility with practicality andinterpretability.

Users should be aware of these potential discrepancies whentransitioning from GNU ld to lld, especially when dealing with intricatelinker script features.

lld does not have built-in linker scripts. When noSECTIONS is specified, all input sections are orphansections.

Rank-based sorting

lld assigns a rank to each output section, calculated using variousflags like RF_NOT_ALLOC, RF_EXEC, RF_RODATA, etc. Orphanoutput sections are then sorted by these ranks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
enum RankFlags {
RF_NOT_ADDR_SET = 1 << 27,
RF_NOT_ALLOC = 1 << 26,
RF_PARTITION = 1 << 18, // Partition number (8 bits)
RF_LARGE_ALT = 1 << 15,
RF_WRITE = 1 << 14,
RF_EXEC_WRITE = 1 << 13,
RF_EXEC = 1 << 12,
RF_RODATA = 1 << 11,
RF_LARGE = 1 << 10,
RF_NOT_RELRO = 1 << 9,
RF_NOT_TLS = 1 << 8,
RF_BSS = 1 << 7,
};

The important ranks are:

  • .interp
  • SHT_NOTE
  • read-only (non-SHF_WRITEnon-SHF_EXECINSTR)
  • SHF_EXECINSTR
  • SHF_WRITE (RELRO, SHF_TLS)
  • SHF_WRITE (RELRO, non-SHF_TLS)
  • SHF_WRITE (non-RELRO)
  • Sections without the SHF_ALLOC flag

Special case: non-alloc sections

Non-alloc sections are placed at the end.

Finding the most similar section

For each orphan section, lld identifies the output section with themost similar rank. The similarity is determined by counting the numberof leading zeros in the XOR of the two ranks.

1
2
3
4
5
6
7
8
9
10
// We want to find how similar two ranks are.
// The more branches in getSectionRank that match, the more similar they are.
// Since each branch corresponds to a bit flag, we can just use
// countLeadingZeros.
static int getRankProximity(OutputSection *a, SectionCommand *b) {
auto *osd = dyn_cast<OutputDesc>(b);
return (osd && osd->osec.hasInputSections)
? llvm::countl_zero(a->sortRank ^ osd->osec.sortRank)
: -1;
}

When multiple output sections share the maximum similarity with anorphan section, resolving the ambiguity is crucial. I refined thebehavior for lld 19: if the orphan section's rank is not lower thanthe similar sections, the last similar section is chosen forplacement.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// Find the most similar output section as the anchor. Rank Proximity is a
// value in the range [-1, 32] where [0, 32] indicates potential anchors (0:
// least similar; 32: identical). -1 means not an anchor.
//
// In the event of proximity ties, we select the first or last section
// depending on whether the orphan's rank is smaller.
int maxP = 0;
auto i = e;
for (auto j = b; j != e; ++j) {
int p = getRankProximity(sec, *j);
if (p > maxP ||
(p == maxP && cast<OutputDesc>(*j)->osec.sortRank <= sec->sortRank)) {
maxP = p;
i = j;
}
}
if (i == e)
return e;

For example, when inserting .bss orphan sections(SHF_ALLOC|SHF_WRITE, SHT_NOBITS), lld shouldfind the last output section that carries the flags/typeSHF_ALLOC|SHF_WRITE SHT_PROGBITS.

1
2
3
4
5
6
WA PROGBITS    (not here)
A
WA PROGBITS
AX
WA PROGBITS (here)
<== WA NOBITS

Placement decision

The orphan section is placed either before or after the most similarsection, based on a complex rule involving:

  • The relative ranks of the orphan and similar section.
  • The presence of PHDRSor MEMORYcommands in the linker script.
  • Scanning backward or forward through the script for a suitableinsertion point.

In essence:

  • If the orphan section's rank is lower than the similar section'srank, and no PHDRS or MEMORY commands exist,it's placed before the similar section.
  • Otherwise, it's placed after the similar section,potentially skipping symbol assignments or output sections without inputsections in the process.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
auto isOutputSecWithInputSections = [](SectionCommand *cmd) {
auto *osd = dyn_cast<OutputDesc>(cmd);
return osd && osd->osec.hasInputSections;
};

// If i's rank is larger, the orphan section can be placed before i.
//
// However, don't do this if custom program headers are defined. Otherwise,
// adding the orphan to a previous segment can change its flags, for example,
// making a read-only segment writable. If memory regions are defined, an
// orphan section should continue the same region as the found section to
// better resemble the behavior of GNU ld.
bool mustAfter = script->hasPhdrsCommands() || !script->memoryRegions.empty();
if (cast<OutputDesc>(*i)->osec.sortRank <= sec->sortRank || mustAfter) {
for (auto j = ++i; j != e; ++j) {
if (!isOutputSecWithInputSections(*j))
continue;
if (getRankProximity(sec, *j) != proximity)
break;
i = j + 1;
}
} else {
for (; i != b; --i)
if (isOutputSecWithInputSections(i[-1]))
break;
}

// As a special case, if the orphan section is the last section, put
// it at the very end, past any other commands.
// This matches bfd's behavior and is convenient when the linker script fully
// specifies the start of the file, but doesn't care about the end (the non
// alloc sections for example).
if (std::find_if(i, e, isOutputSecWithInputSections) == e)
return e;

while (i != e && shouldSkip(*i))
++i;
return i;

Special case: last section

If the orphan section happens to be the last one, it's placed at thevery end of the output, mimicking GNU ld's behavior for cases where thelinker script fully specifies the beginning but not the end of thefile.

Special case: skipping symbol assignments

It is common to surround an output section description withencapsulation symbols. lld has a special case to not place orphansbetween foo and a following symbol assignment.

Backward scan example:

1
2
3
4
5
previous_start = .;
previous : { *(previous) } // Found output section with a backward scan
previous_end = .; // The orphan should be after here

similar : { *(similar) } // The most similar section found by the first step

Forward scan example:

1
2
3
4
similar0 : { *(similar0) }
similar1_start = .;
similar1 : { *(similar1) } // The most similar section found by the first step
similar1_end = .; // The orphan should be after here

However, an assignment to the location counter serves as a barrier tostop the forward scan.

1
2
3
4
5
6
7
previous_start = .;
previous : { *(previous) } // Found output section with a backward scan
previous_end = .; // The orphan should be after here
symbol = .; // We conservatively assume any symbol as a probable "end" symbol.
. = ALIGN(CONSTANT(MAXPAGESIZE)); // Barrier

similar : { *(similar) } // The most similar section found by the first step

Special case: initial location counter

In addition, if there is a location counter assignment before thefirst output section, orphan sections cannot be inserted before theinitial location counter assignment. This is to recognize the commonpattern that the initial location counter assignments specifies the loadaddress.

1
2
3
sym0 = .;
. = initial_location; // Initial assignment to DOT. Orphan sections cannot be inserted before here.
.text : {} // First output section

Presence of PHDRS orMEMORY

The presence of PHDRS or MEMORY commandsdisallows lld to place the orphan section before the anchor. Thiscondition is introduced in two patches:

When a linker script defines PHDRS, it typicallyspecifies the initial section within each PT_LOAD segment.These sections often have address requirements, indicated by a preceding. = expr statement. If an orphan section is associated withsuch a section as its anchor, lld avoids inserting the orphan before theanchor to maintain the intended segment structure and addressalignment.

For instance, consider this linker script excerpt:

1
2
3
4
5
6
7
8
9
PHDRS {
...
rodata PT_LOAD;
}

SECTIONS {
...
. = ALIGN(CONSTANT(MAXPAGESIZE));
.rodata : { *(.rodata .rodata.*) } : rodata

Here, .rodata is the first section in aPT_LOAD segment, and it's aligned toMAXPAGESIZE. If an orphan section is inserted before.rodata, it would inherit the previous segment's flags andbreak the intended address requirement.

Program headers propagation

After orphan section placement, if the PHDRS command isspecified, lld will propagate program headers to output sections that donot specify :phdr.

Case study

By employing this rank-based approach, lld provides an elegantimplementation that does not hard code specific section names (e.g.,.text/.rodata/.data). In GNU ld,if you rename special section names.text/.rodata/.data in the linkerscript, the output could become subtle different.

Orphan sections matching an output section name

The following output section description does not match.foo input sections, but .foo orphan sectionswill still be placed inside .foo.

1
.foo : { *(.bar) }

Read-only sections

Among read-only sections (e.g., .dynsym,.dynstr, .gnu.hash, .rela.dyn,.rodata, .eh_frame_hdr,.eh_frame), lld prioritizes the placement ofSHT_PROGBITS sections (.rodata,.eh_frame_hdr, and .eh_frame) closer to codesections. This is achieved by assigning them a higher rank.The rationale behind this design is to mitigate the risk of relocationoverflow in the absence of an explicit linker script.

These non-SHT_PROGBITS sections do not containrelocations to code sections and can be placed away from codesections.

1
2
3
4
5
6
7
8
9
10
11
.dynsym
.gnu.version
.gnu.version_r
.gnu.hash
.dynstr
.rela.dyn
.rela.plt
.rodata // closer to .text
.eh_frame_hdr // closer to .text
.eh_frame // closer to .text
.text

If a linker script explicitly includes a SECTIONScommand specifying .rodata without mentioning otherread-only sections, orphan sections like .dynsym might beplaced before .rodata.

1
.rodata : { *(.rodata .rodata.*) }

This behavior can be further influenced by the presence ofPHDRS commands. If an outputsection phdr is specified with .rodata, orphan sectionslike .dynsym would not be placed before.rodata, ensuring that the orphans would not affect theflags of the preceding program header.

1
2
3
4
5
6
7
8
9
PHDRS {
text PT_LOAD;
}

SECTIONS {
...
.rodata : { *(.rodata .rodata.*) } : text
// .dynsym cannot be placed before .rodata with specified program headers
}

Symbol assignment between two output sections

A symbol assignment placed between output sections can be interpretedin two ways: as marking the end of the preceding section or the start ofthe following section. lld doesn't attempt to guess the intendedmeaning, leading to potential ambiguity in scenarios like this:

1
2
3
.data : { *(.data .data.*) }
bss_start = .;
.bss : { *(.bss .bss.*) }

versus

1
2
3
.data : { *(.data .data.*) }
data_end = .;
.bss : { *(.bss .bss.*) }

In both cases, lld might place SHF_ALLOC|SHF_WRITESHT_PROGBITS orphan sections before .bss,potentially disrupting the intended behavior if the goal was to mark thestart of the .bss section with bss_start = ..

To avoid this ambiguity and ensure consistent behavior, therecommended practice is to place symbol assignments within the outputsection descriptions (FreeBSDexample):

1
2
.data : { *(.data) }
.bss : { bss_start = .; *(.data) bss_end = .; }

Portability

To maximize portability of linker scripts across different linkers,it's essential to establish clear boundaries for PT_LOAD segments. Thiscan be achieved by:

  • Explicit alignment: Utilizing MAXPAGESIZE alignment todistinctly separate sections within the linker script.
  • Anchoring sections: Ensuring that the first section in eachPT_LOAD segment includes at least one input section,preventing ambiguous placement decisions by the linker. When thePHDRS command is present, ensure that the first sectionshave :phdr.

By adhering to these guidelines, you can reduce reliance onlinker-specific orphan section placement algorithms, promotingconsistency across GNU ld and lld.

When linking a regular position-dependent executable, you may alsosupply a minimal linker script like the following for a-no-pie link:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
SECTIONS
{
PROVIDE (__executable_start = SEGMENT_START("text-segment", 0x400000)); . = SEGMENT_START("text-segment", 0x400000) + SIZEOF_HEADERS;

. = DATA_SEGMENT_ALIGN (CONSTANT (MAXPAGESIZE), CONSTANT (COMMONPAGESIZE));
.init_array :
{
PROVIDE_HIDDEN (__init_array_start = .);
KEEP (*(SORT_BY_INIT_PRIORITY(.init_array.*)))
KEEP (*(.init_array))
PROVIDE_HIDDEN (__init_array_end = .);
}
.fini_array :
{
PROVIDE_HIDDEN (__fini_array_start = .);
KEEP (*(SORT_BY_INIT_PRIORITY(.fini_array.*)))
KEEP (*(.fini_array))
PROVIDE_HIDDEN (__fini_array_end = .);
}
. = DATA_SEGMENT_RELRO_END (0, .);
.data : { *(.data .data.*) }
. = .;
.bss : { *(.bss .bss.*) *(COMMON) }
. = DATA_SEGMENT_END (.);
}

A better style may define .text and .rodataas well. This linker script works with both GNU ld and lld.

1
2
clang -fuse-ld=bfd -Wl,-T,a.lds -no-pie a.c -o a.lld
clang -fuse-ld=lld -Wl,-T,a.lds -no-pie a.c -o a.bfd

Disabling orphan sections

For projects that require absolute control over section placement,GNU ld version 2.26 and later provides--orphan-handling=[place|warn|error|discard]. This allowsyou to choose how orphan sections are handled:

  • place (default): The linker places orphan sections according to itsinternal algorithm.
  • warn: The linker places orphan sections but also issues warnings foreach instance.
  • error: The linker treats orphan sections as errors, preventing thelinking process from completing.
  • discard: The linker discards orphan sections entirely.

--unique

TODO

Evolution of the ELF object file format

作者 MaskRay
2024年5月26日 15:00

The ELF object file format is adopted by many UNIX-like operatingsystems. While I've previously delvedinto the control structures of ELF and its predecessors, tracing thehistorical evolution of ELF and its relationship with the System V ABIcan be interesting in itself.

The format consists of the generic specification, processor-specificspecifications, and OS-specific specifications. Three key documentsoften surface when searching for the generic specification:

  • Tool Interface Standard (TIS) Portable Formats Specification,version 1.2 on https://refspecs.linuxfoundation.org/
  • SystemV Application Binary Interface - DRAFT - 10 June 2013 onwww.sco.com
  • Oracle Solaris Linkers and Libraries Guide

The TIS specification breaks ELF into the generic specification, aprocessor-specific specification (x86), and an OS-specific specification(System V Release 4). However, it has not been updated since 1995. TheSolaris guide, though well-written, includes Solaris-specific extensionsnot applicable to Linux and *BSD. This leaves us primarily with theSystem V ABI hosted on www.sco.com, which dedicates Chapters 4 and 5 tothe ELF format.

Let's trace the ELF history to understand its relationship with theSystem V ABI.

History

UnixSystem Laboratories (USL) created ELF for their System V Release 4in late 1980s. USL also maintained the System V Application BinaryInterface, of which ELF was a core component. The dynamic shared librarysystem was contributed by Sun Microsystems from their SunOS 4.x (in 1988, SunOS4.0 got an extended a.out format with dynamic shared librarysupport).

USL intended ELF to be an open standard and published documents aboutthe format, e.g.

  • In Proceedings of the Summer 1990 USENIX Conference, ELF: AnObject File to Mitigate Mischievous Misoneism by James Q.Arnold
  • UNIX System V Release 4 Programmer's Guide: ANSI C andProgramming Support Tools (ISBN 0-13-933706-7) published in1990
  • System V Application Binary Interface (Standards) (ISBN0-13-104670-5) published in 1993

In 1993, the Tool Interface Standard (TIS) Committee, a consortium ofindustry leaders, adopted ELF and developed the "Tool Interface Standard(TIS) Portable Formats Specification". Version 1.2 was released in May1995.

ELF has been very influential. In the 1990s, many Unix and Unix-likeoperating systems, including Solaris, IRIX, HP-UX, Linux, and FreeBSD,switched to ELF. The 86open Project's FAQ specified:

Q18: How can you get a single binary to work identically across allthese diverse systems?

Most Unix-on-Intel binary packages are already largely similar.Almost all such operating systems use the "ELF" binary 'packaging'; thevarious operating systems have small but significant differences,though, that make each system's ELF binary unusable on others'.

The evolvingstewardship of the System V ABI

The Tool Interface Standard (TIS) Committee essentially dissolvedafter 1995. The stewardship of the System V ABI, and consequently thegeneric ELF specification, has followed a complex path mirroring thetransfer of Unix software assets.

Between 1993 and 2011, Unix assets underwent a few transfers.

  • In 1993, Novell acquiredUnix assets including all copyrights, trademarks, and licensingcontracts.
  • In September 1995, Novell sold the "develop and sell licenses toUnix binaries" plus "handle source licencees" business to The Santa CruzOperation (sometimes referred to as "old SCO"). Novell still owned thecopyrights (SCOvs Novell verdict).
  • In 2001, The Santa Cruz Operation sold its Unix software asserts toCaldera Systems (later renamed The SCO Group, Inc; sometimes referred toas "new SCO" or "SCOX").
  • In 2011, The SCO Group's Unix software assets were sold off to UnXis(later renamed Xinuos).

The task of maintaining and updating the generic ABI fell tothese successive owners of Unix software assets. The Santa CruzOperation, and later The SCO Group and Xinuos, managed updates andextensions to the ABI, including the ELF specification.

In this binutilscommit in November 2000, it was said that e_machinevalues should eventually ask registry@sco.com for blessing(now registry@xinuos.com).

Dave Prosser had maintainedthe System V ABI at USL, then The Santa Cruz Operation, and then The SCOGroup. The last maintainer at The SCO Group and UnXis/Xinuous was JohnWolfe, who oversaw updates until his departurefrom Xinuos in 2015. The generic ABI (including the ELFspecification) then became unmaintained.

The final functional update on https://www.sco.com/developers/gabi/latest/contents.htmlwas made in June 2013 forSHF_COMPRESSED. Since then, the specification hasremained frozen.

"All rights reserved"?

The copyright notices on the SCO website's documentation for theSystem V ABI seem potentially misleading.

The footnotes of https://www.sco.com/developers/gabi/1998-04-29/contents.htmlpages today (and in 2003 per web.archive.org) specify:

© 1997, 1998, 1998 The Santa Cruz Operation, Inc. All rightsreserved.

The footnotes of https://www.sco.com/developers/gabi/latest/contents.htmlpages specify:

© 1997, 1998, 1999, 2000, 2001 The Santa Cruz Operation, Inc. Allrights reserved. © 2002 Caldera International. All rights reserved. ©2003-2010 The SCO Group. All rights reserved. © 2011-2015 Xinuos Inc.All rights reserved.

The repeated phrase "All rights reserved" could be interpreted asimplying exclusive ownership over the ELF format itself. This isinaccurate, as ELF is an open standard developed through thecollaboration of many organizations and individuals. The Santa CruzOperation's role in the evolution of the System V ABI seems to have beenmore of an editor than an innovator. After The Santa Cruz Operation soldits Unix assets in 2001, the specification has largely stayed unchangedwith occasional constant updates.

The earliest available snapshot on the Wayback Machine dates back to2003, a time when The SCO Group had assumed ownership and initiated alawsuit against IBM, alleging that the success of Linux was due to themisappropriation of SCO's technology. Regrettably, earlier snapshots areunavailable to provide a more complete historical context.

Tool Interface Standard (TIS) Portable Formats Specification,version 1.2 effectively putthe specification in the public domain:

The TIS Committee grants you a non-exclusive, worldwide, royalty-freelicense to use the information disclosed in this Specification to makeyour software TIS-compliant; no other license, express or implied, isgranted or intended hereby.

Further reading:

The generic-abi Google Group

A neutral GoogleGroup not affliated with The SCO Group/Xinuous exists for discussingthe generic ABI. Hongjiu Lu might be the owner. The group served as aplatform for OS and toolchain vendors to collaborate. In recent years,participation has dwindled to primarily representatives from OracleSolaris (just Ali Bahrami) and the GNU toolchain.

The reduced activity might not seem critical, as significantnon-OS-specific changes to the ELF format are infrequent.

Evolution of the generic ABI

https://www.sco.com/developers/gabi/latest/revision.htmloutlines the evolution of the ELF from 1998 to 2013. Important featureswere all available as of April 2001.

  • Symbol visibility
  • Sectiongroups
  • EI_OSABI and EI_ABIVERSION
  • SHF_MERGEand SHF_STRINGS
  • SHF_LINK_ORDER

However, discussions regarding these specific features seemunavailable. Please let me know if you have any information onthem.

There were merely constant updates from April 2001 to June 2013.SHF_COMPRESSED was added in June 2013.

The generic-abi Google Group reached consensus onproposals that haven't been reflected on the www.sco.comwebsite:

  • 2018: RELRrelative relocation format
  • 2022: ELFCOMPRESS_ZSTD

A future in flux

In April 2020, Cary Coutant reached a preliminaryagreement with Xinuos, but the future remainsuncertain. While some constants (e.g., e_machineand EI_OSABI values, ELFCOMPRESS_ZSTD) havebeen defined, no functional updates to the ABI have materialized.

The absence of a centralized, up-to-date repository for thespecification complicates matters. While some clarificationsand consensus have been reached within the generic-abi group, accessingthe latest, definitive text remains a challenge.

A potential solution could be to decouple the ELFspecification from the broader System V ABI, as was done in thepast with the TIS specification. This would create a dedicated andaccessible reference for ELF, independent of the broader System Vspecificities that are of less general interest.

Despite this uncertainty, innovation within the ELF ecosystem shouldcontinue. Efforts like my own to replace ELF control structures toreduce object file sizes (e.g., compactrelocations.) can still move forward. In practice, achievingconsensus among major toolchain vendors (GNU and LLVM) may besufficient, even without formal approval from the generic ABI. Whilealigning with Solaris would be ideal and I will try doing so, this mightnot always be feasible due to varying priorities.

FreeBSD, which Xinuos's OpenServer is based on, utilizes the LLVMtoolchain. Xinuos might indirectly benefit from my heavy involvementinto the LLVM toolchain.

System V ABI ProcessorSupplement (psABI)

Processor-specific details for the System V ABI are found in thepsABI documents. Actively maintained psABIs exist for variousarchitectures including AArch32, AArch64, LoongArch,PPC64, RISC-V,s390x,i386, and x86-64.(These links refer to my notes.)

Many architectures have older or unavailable psABIs. Forinstance:

  • ppc32: Power Architecture® 32-bit Application Binary InterfaceSupplement 1.0 - Linux & Embedded was published in 2011. My notes
  • MIPS: The most recent o32 ABI dates back to February 1996, and then64 ABI is unavailable. The n32 ABI accompanies the discontinuedcompiler MIPSpro: MIPSpro N32 ABI Handbook. My notes

Noteworthy details

  • Architectures like Motorola 6800, which have 16-bit address spaces,use the ELFCLASS32 format.
  • Many architectures have never been ported to any System V derivativeOS, but their psABI documents still use the "System V" name.
  • Some behaviors are not formally documented and can only be found inthe binutils project's source code.

Operating System SpecificABIs

In the System V ABI, Operating System Specific ABIs (OSABI) areextensions that provide operating system-specific details to supplementthe generic ABI.

For example, Oracle Solaris Linkers and Libraries Guidedefines the OSABI for Solaris.

The term OSABI is vague and might not be one single document. ForLinux, we need the following two documents:

The Linux Standard Base (LSB) is a related document that aims tostandardize the Linux system interface.

Until the recent introduction of gABI supplement for programloading and dynamic linking on GNU, SHT_GNU_HASH,while widely adopted, was absent from any official documentation.

Interestingly, many Linux ABI extensions are generic enough to beadopted by other operating systems like FreeBSD, suggesting that adedicated FreeBSD OSABI document may not be necessary.

Exploring GNU extensions in the Linux kernel

作者 MaskRay
2024年5月12日 15:00

The Linux kernel is written in C, but it also leverages extensionsprovided by GCC. In 2022, it moved from GCC/Clang-std=gnu89 to -std=gnu11. This articleexplores my notes on how these GNU extensions are utilized within thekernel.

Statement expressions

Statementexpressions are commonly used in macros.

Locallabels

Some macros use this extension to restart a for loop in a macro'sreplacement list.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// include/linux/wait.h
#define ___wait_event(wq_head, condition, state, exclusive, ret, cmd) \
({ \
__label__ __out; \
struct wait_queue_entry __wq_entry; \
long __ret = ret; /* explicit shadow */ \
\
init_wait_entry(&__wq_entry, exclusive ? WQ_FLAG_EXCLUSIVE : 0); \
for (;;) { \
long __int = prepare_to_wait_event(&wq_head, &__wq_entry, state);\
\
if (condition) \
break; \
\
if (___wait_is_interruptible(state) && __int) { \
__ret = __int; \
goto __out; \
} \
\
cmd; \
} \
finish_wait(&wq_head, &__wq_entry); \
__out: __ret; \
})

include/linux/instruction_pointer.h utilizes the featureto obtain the current instruction pointer:

1
#define _THIS_IP_  ({ __label__ __here; __here: (unsigned long)&&__here; })

Labelsas values and computed goto statements

Bytecode interpreters often leverage this extension. BPF, forinstance, utilizes this extension. We can also see this extension indrm_exec_retry_on_contention for the Direct RenderingManager.

When it comes to Position-Dependent Code (PDC), a switch statementwith a default label marked as __builtin_unreachable can bejust as efficient as using labels as values. However, forPosition-Independent Code (PIC), absolute addresses offer a slightperformance edge, although at the cost of requiring more dynamicrelocations.

1
2
3
4
5
6
7
8
static void *tbl[] = {&&do_ret, &&do_inc, &&do_dec};
#define NEXT goto *tbl[*pc++]
NEXT;
for(;;) {
do_ret: return val;
do_inc: val++; NEXT;
do_dec: val--; NEXT;
}

typeof and__auto_type

typeof is used in numerous code.__auto_type has a few occurrences.

C23 standardizes typeof and auto.

Conditionalswith omitted operands

x ?: y is known as the "Elvis operator". This is used innumerous code.

Emptystructures

The C standard (at least C11 and C23) specifies that:

If the member declaration list does not contain any named members,either directly or via an anonymous structure or anonymous union, thebehavior is undefined.

The empty structure extension enables a dummy structure when aconfiguration option disables the functionality.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#ifdef CONFIG_GENERIC_ENTRY

struct syscall_user_dispatch {
char __user *selector;
unsigned long offset;
unsigned long len;
bool on_dispatch;
};

#else

struct syscall_user_dispatch {};

#endif

Caseranges

This extension (case low ... high:), frequently used inthe kernel, allows us to specify a range of consecutive values in asingle case label within a switch statement.

Objectsize checking

glibc 2.3.4 introduced _FORTIFY_SOURCE in 2004 to catchsecurity errors due to misuse of some C library functions (primarilybuffer overflow). The implementation leverages inline functions and__builtin_object_size.

Linux kernel introduced CONFIG_FORTIFY_SOURCE in 2017-07.Like the userspace, the implementation relies on optimizations likeinlining and constant folding.

For example, include/linux/string.h combines thisfeature with BUILD_BUG_ON(__attribute__((error(...)))) to provide compile-timeerrors:

1
2
3
4
5
6
7
8
#define strtomem(dest, src)     do {                                    \
const size_t _dest_len = __builtin_object_size(dest, 1); \
const size_t _src_len = __builtin_object_size(src, 1); \
\
BUILD_BUG_ON(!__builtin_constant_p(_dest_len) || \
_dest_len == (size_t)-1); \
memcpy(dest, src, strnlen(src, min(_src_len, _dest_len))); \
} while (0)

The Clang implementation utilizes__attribute__((pass_object_size(type))) and__attribute__((overloadable(...))).

Clang introduced__builtin_dynamic_object_size in 2019 to possiblyevaluate the size dynamically. The feature was ported to GCC in 2021 andis used by the kernel.

Pragmas

#pragma GCC diagnosticis occasionally used to disable a local diagnostic.

include/linux/hidden.h uses #pragma GCC visibility("hidden")to force direct access (instead of GOT indirection) for external symbolsfor -fpie/-fpic.

bpf uses #pragma GCC poisonto forbid some undesired identifiers.

Structure-layoutpragmas are commonly used.

Inline assembly

The Linux kernel uses inline assembly for a few reasons

  • Hardware interaction: Certain operations require direct interactionwith the underlying hardware capabilities, which might not beexpressible in standard C.
  • Performance optimization: In critical code paths, inline assemblycan be used to squeeze out the last bit of performance by utilizingprocessor-specific instructions or bypassing certain C languageconstructs that might introduce overhead.
  • Special facility: Some features, such as static keys, alternativesruntime patching, cannot be implemented with standard C.

Built-in functions

Special machine codeinstructions

GCC provides built-in functions to generate machine code instructionsdesigned for certain tasks like popcount/clz/ctz. The compiler has moreinformation about the function's purpose and leverages internaloptimizations tailored for these tasks.

__builtin_choose_expr

This is analogous to ? : but the condition is a constantexpression and the return type is not altered by promotion rules. InGCC, __builtin_choose_expr is not available in C++.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// include/linux/math.h
#define abs(x) __abs_choose_expr(x, long long, \
__abs_choose_expr(x, long, \
__abs_choose_expr(x, int, \
__abs_choose_expr(x, short, \
__abs_choose_expr(x, char, \
__builtin_choose_expr( \
__builtin_types_compatible_p(typeof(x), char), \
(char)({ signed char __x = (x); __x<0?-__x:__x; }), \
((void)0)))))))

#define __abs_choose_expr(x, type, other) __builtin_choose_expr( \
__builtin_types_compatible_p(typeof(x), signed type) || \
__builtin_types_compatible_p(typeof(x), unsigned type), \
({ signed type __x = (x); __x < 0 ? -__x : __x; }), other)

__builtin_constant_p

__builtin_constant_p identifies whether an expressioncan be evaluated to a constant. This capability unlocks several codepatterns.

Conditional static assertions:

1
BUILD_BUG_ON(__builtin_constant_p(nr) && nr != 1);

__builtin_constant_p decides whetherMAYBE_BUILD_BUG_ON expands to a compile-time static assertmechanism (__attribute__((error(...)))) or a runtimemechanism.

1
2
3
4
5
6
7
#define MAYBE_BUILD_BUG_ON(cond)                        \
do { \
if (__builtin_constant_p((cond))) \
BUILD_BUG_ON(cond); \
else \
BUG_ON(cond); \
} while (0)

Alternative code paths:

Sometimes, the constant expression case might enable a more efficientcode path.

1
#define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))

Constant Folding Optimizations:

By knowing expressions are constant, the compiler can performconstant folding optimizations. This means pre-calculating theexpression's value during compilation and replacing it with the actualconstant in the code. This leads to potentially smaller and faster codeas the calculation doesn't need to be done at runtime. However, I noticethat __builtin_constant_p is often abused. This is one ofthe source why the kernel cannot be built with -O0.

__builtin_expect

Used by likely and unlikely macros to givethe compiler hints for optimization.

1
2
3
// include/linux/compiler.h
# define likely(x) __builtin_expect(!!(x), 1)
# define unlikely(x) __builtin_expect(!!(x), 0)

__builtin_frame_address

Used by stack trace utilities.

__builtin_offsetof

Used to implement offsetof.

__builtin_*_overflow

Used by include/linux/overflow.h for overflowchecking.

__builtin_prefetch

Used as the default implementation of prefetch andprefetchw.

__builtin_unreachable

Used by include/linux/compiler.h:unreachable to informunreachability for optimization or diagnostics suppressing purposes.

__builtin_types_compatible_p

This is often used to emulate C11 static_assert

1
2
3
4
5
6
7
8
#define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))

// include/linux/highmem-internal.h
#define kunmap_atomic(__addr) \
do { \
BUILD_BUG_ON(__same_type((__addr), struct page *)); \
__kunmap_atomic(__addr); \
} while (0)

or implement C++ function overloading, e.g.

1
2
3
4
5
6
7
8
9
// fs/bcachefs/util.h
#define strtoi_h(cp, res) \
( type_is(*res, int) ? bch2_strtoint_h(cp, (void *) res)\
: type_is(*res, long) ? bch2_strtol_h(cp, (void *) res)\
: type_is(*res, long long) ? bch2_strtoll_h(cp, (void *) res)\
: type_is(*res, unsigned) ? bch2_strtouint_h(cp, (void *) res)\
: type_is(*res, unsigned long) ? bch2_strtoul_h(cp, (void *) res)\
: type_is(*res, unsigned long long) ? bch2_strtoull_h(cp, (void *) res)\
: -EINVAL)

include/linux/jump_label.h combines__builtin_types_compatible_p and dead code elimination toemulate C++17 if constexpr.____wrong_branch_error is not defined. If the unexpectedcode path is taken, there will be a linker error.

1
2
3
4
5
6
7
8
9
10
#define static_branch_likely(x)                                                 \
({ \
bool branch; \
if (__builtin_types_compatible_p(typeof(*x), struct static_key_true)) \
branch = !arch_static_branch(&(x)->key, true); \
else if (__builtin_types_compatible_p(typeof(*x), struct static_key_false)) \
branch = !arch_static_branch_jump(&(x)->key, true); \
else \
branch = ____wrong_branch_error(); \
likely_notrace(branch); \ })

BTF built-in functions

Clangthread safety analysis

See perfmutex: Add thread safety annotations.

Function attributes

tools/include/linux/compiler.h defines macros for manycommonly used attributes. Many of them are commonly used by otherprojects and probably not worth describing.

tools/testing/selftests has a few__attribute__((constructor)) for initialization.

__attribute__((error(...))) is used with inlining anddead code elimination to provide static assertion.

1
2
3
4
5
6
7
8
9
10
11
12
// include/linux/compiler_types.h
#ifdef __OPTIMIZE__
# define __compiletime_assert(condition, msg, prefix, suffix) \
do { \
__noreturn extern void prefix ## suffix(void) \
__compiletime_error(msg); \
if (!(condition)) \
prefix ## suffix(); \
} while (0)
#else
# define __compiletime_assert(condition, msg, prefix, suffix) do { } while (0)
#endif

__attribute__((naked)) (__naked) is used byarm and bpf code.

__attribute__((section(...))) is used to place functionsin the specified section, e.g. .init.text.

__attribute__((weak)) is used to allow a genericfunction to be replaced by an arch-specific implementation. The weakreference feature is used by some__start_<sectionname>/__stop_<sectionname>encapsulation symbols.

#define __noendbr __attribute__((nocf_check)) is usedto disable Intel CET for some special functions.

Variable attributes

__attribute__((cleanup(...))) runs a function when thevariable goes out of scope. include/linux/cleanup.h definesDEFINE_FREE and __free based on thisfeature.

1
2
3
4
5
// include/linux/slab.h
DEFINE_FREE(kfree, void *, if (!IS_ERR_OR_NULL(_T)) kfree(_T))
// kernel/irq/irq_sim.c
struct irq_sim_work_ctx *work_ctx __free(kfree) =
kmalloc(sizeof(*work_ctx), GFP_KERNEL);

Type attributes

__attribute__((aligned(...))) and__attribute__((packed)) are often to control structurelayouts.

Language dialects

-fno-delete-null-pointer-checks

Assume that programs can safely dereference null pointers, and codeor data element may reside at address zero.

-fno-strict-aliasing

The C11 standard (section 6.5) defines strict aliasing rules.

An object shall have its stored value accessed only by an lvalueexpression that has one of the following types:

— a type compatible with the effective type of the object, — aqualified version of a type compatible with the effective type of theobject, — a type that is the signed or unsigned type corresponding tothe effective type of the object, — a type that is the signed orunsigned type corresponding to a qualified version of the effective typeof the object, — an aggregate or union type that includes one of theaforementioned types among its members (including, recursively, a memberof a subaggregate or contained union), or — a character type.

Compilers leverage these aliasing rules for optimizations, which canbe disabled with -fno-strict-aliasing.

Linus Torvalds has expressed reservations about strict aliasing. Hereis a discussionfrom 2018.

-fno-strict-overflow

In GCC, this option is identical to-fwrapv -fwrapv-pointer: make signed integer overflowdefined using wraparound semantics. While avoiding undefined behaviors,unexpected arithmetic overflow bugs might be lurking.

[RFC]Mitigating unexpected arithmetic overflow is a 2024-05 thread aboutdetecting and mitigating unsigned integer overflow.

Clang's -O0 output: branch displacement and size increase

作者 MaskRay
2024年4月27日 15:00

tl;dr Clang 19 will remove the -mrelax-all default at-O0, significantly decreasing the text section size forx86.

Span-dependent instructions

In assembly languages, some instructions with an immediate operandcan be encoded in two (or more) forms with different sizes. On x86-64, adirect JMP/JCC can be encoded either in 2 bytes with a 8-bit relativeoffset or 6 bytes with a 32-bit relative offset. A short jump ispreferred because it takes less space. However, when the target of thejump is too far away (out of range for a 8-bit relative offset), a nearjump must be used.

1
2
3
4
ja foo    # jump short if above, 77 <rel8>
ja foo # jump near if above, 0f 87 <rel32>
.nops 126
foo: ret

A 1978 paper by Thomas G. Szymanski ("Assembling Code forMachines with Span-Dependent Instructions") used the term"span-dependent instructions" to refer to such instructions with shortand long forms. Assemblers grapple with the challenge of choosing theoptimal size for these instructions, often referred to as the "branchdisplacement problem" since branches are the most common type. A goodresource for understanding Szymanski's work is AssemblingSpan-Dependent Instructions.

Start small and grow

Popular assemblers still used today tend to favor a "start small andgrow" approach, typically requiring one more pass than Szymanski's"start big and shrink" method. This approach often results in smallercode and can handle additional complexities like alignmentdirectives.

In LLVM, the MClibrary (Machine Code) is reponsible for assembly, disassembly, andobject file formats. Within MC, "assembler relaxation" deals withspan-dependent instructions. This is distinct from linkerrelaxation.

Eli Bendersky provides a detailed explanation in a 2013blog post and highlights an interesting behavior:

For example, when compiling with -O0, the LLVM assembler simplyrelaxes all jumps it encounters on first sight. This allows it to putall instructions immediately into data fragments, which ensures there'smuch fewer fragments overall, so the assembly process is faster andconsumes less memory.

When -O0 is enabled and the integrated assembler is used(common by default), clangDriverpasses the -mrelax-all flag to the LLVM MC library. Thissets the MCRelaxAll flag in MCTargetOptions,instructing the assembler to potentially start with the long form (near)for JMP and JCC instructions on the X86 target only. Other instructionslike ADD/SUB/CMP and non-x86 architectures remain unaffected.

-mrelax-all tradeoff

Here is an example:

1
2
3
4
5
void foo(int a) {
// -mrelax-all: near jump (6 bytes)
// -mno-relax-all or -fno-integrated-as: short jump (2 bytes)
if (a) bar();
}

The assembly (clang -S) looks like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
foo:                                    # @foo
# %bb.0: # %entry
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl %edi, -4(%rbp)
cmpl $0, -4(%rbp)
je .LBB0_2
# %bb.1: # %if.then
movb $0, %al
callq bar@PLT
.LBB0_2: # %if.end
addq $16, %rsp
popq %rbp
retq

The JE instruction assembles to either a short jump (8-bit relativeoffset) or near jump (32-bit relative offset).

1
2
3
4
5
6
7
8
9
10
11
12
13
# -mrelax-all
MCSection
MCDataFragment: empty
MCAlignFragment: alignment=4
MCDataFragment: instructions including JE (jump near if equal, 6 bytes)

# -mno-relax-all
MCSection
MCDataFragment: empty
MCAlignFragment: alignment=4
MCDataFragment: instructions before JE (push; mov; sub; mov; cmp)
MCRelaxableFragment: JE (jump short if equal, 2 bytes). This JE could be expanded, but not in this case.
MCDataFragment: instructions after JE (mov; call; add; pop; ret)

The impact of -mrelax-all on text section size issignificant, especially when there are many branch instructions. In anx86-64 release build of lld, -mrelax-all increased the.text section size by 7.9%. This translates to a 5.4%increase in VM size and a 4.6% increase in the overall file size. In aRISC-V rv64gc release build of lld, -mrelax-all increasedthe .text section size by 13.7%. This translates to a 9.0%increase in VM size and a 7.2% increase in the overall file size.

Dean Michael Berris proposed to remove the-mrelax-all default for -O0 in 2016, butit stalled. -mrelax-all caused undesired interaction issueswith RISC-V's conditionalbranch transforms, leading Craig Topper to remove-mrelax-all at -O0 for RISC-Vrecently.

This actually indicated a size regression when the condition branchtransform patch landed in 2023.

1
2
3
4
5
6
7
8
9
10
11
12
13
blt a1, a2, .Lfoo
beqz a1, .Lfoo
.Lfoo:

# llvm-mc -filetype=obj -triple=riscv64 -mattr=+relax,+c -mc-relax-all
blt a1, a2, .Lfoo # R_RISCV_BRANCH(.Lfoo), range: +-4KiB
c.beqz a1, .Lfoo # R_RISCV_BRANCH(.Lfoo)
.Lfoo:
# llvm-mc -filetype=obj -triple=riscv64 -mattr=+relax,+c
bge a1, a2, .+8
jal zero, .Lfoo # R_RISCV_JAL(.Lfoo), range: +-2MiB
c.bneq a1, .+8
jal zero, .Lfoo # R_RISCV_JAL(.Lfoo)

While -mrelax-all might have offered slight compile timebenefits in the past, the gains are negligible today. Benchmarking usingstage 2 builds of Clang showed no measurable difference between-mrelax-all and -mno-relax-all. Onllvm-compile-time-tracker running the llvm-test-suite/CTMark benchmark,compile time actually increasedslightly by 0.62% while the text section size decreasedby 4.44%.

A difference for assembly at different optimisation levels would bequite surprising. GCC/GNU assembler don't exhibit similar expansion ofJMP/JCC instructions even at -O0.

These arguments strengthen the case for removing-mrelax-all as the default for -O0. My patch haslanded and will be included in the next major release, LLVM 19.1. I havealso changed Clang to respect -mrelax-all for assemblyinput: clang -c --target=x86_64 -mrelax-all a.s

Understanding thecompile time difference

I have studied a notorious huge file,llvm/lib/Target/X86/X86ISelLowering.cpp.

Fragment count: A significant difference exists inthe number of assembler fragments generated:

  • -mrelax-all: 89633
  • -mno-relax-all: 143852

With -mrelax-all, the number ofMCRelaxableFragments is substantially reduced (to zero whenbuilding Clang). This reduction likely contributes to the compile timedifference.

Fixed-point iteration: -mrelax-allensures the fixed-point iteration algorithm (almost always) converges ina single iteration. In contrast, with -mno-relax-all,around 6% of sections require additional iterations. However, thisdifference is likely not the primary factor affecting compile time.

1
2
3
4
5
6
7
8
9
// -mrelax-all
1: 13919
2: 1

// -mno-relax-all
1: 13103
2: 793
3: 23
4: 1

Whydidn't people complain about the code size increase?

Because people generally care less about -O0 codesize.

-O0 is frequently used with -g to includedebugging information. This debug information can overshadow the sizeincrease caused by -mrelax-all. (-O1 or abovesacrifices some debuggability.)

In addition, not all projects can be successfully built with-O0 optimization. This is typically due to issues like verylarge programs or mandatory inlining behavior.

For a discussion on size reduction ideas in ELF relocatable files,please check out my blog post about LightELF.

PNaCl aligned bundlingand -mrelax-all

PNaCl's Software Fault Isolation mechanism requires that certaininstructions and groups of instructions do not cross a bundle boundary.When .bundle_align_mode is in effect, each instruction isplaced in its own fragment, allowing flexible NOP padding. Enabling-mrelax-all enables an optimization that might causesome data fragments to combine, potentially decreasing the total numberof fragments.


You might also be interested in my notes about GNU assembler andLLVM integrated assembler.

When QOI meets XZ

作者 MaskRay
2024年4月23日 15:00

QOI, the Quite OK Image format,has been gaining in popularity. Chris Wellons offers a great analysis.

QOI's key advantages is its simplicity. Being a byte-oriented formatwithout entropy encoding, it can be further compressed with generic datacompression programs like LZ4, XZ, and zstd. PNG, on the other hand,uses DEFLATE compression internally and is typically resistant tofurther compression. By applying a stronger compression algorithm on QOIoutput, you can often achieve a smaller file size compared to PNG.

XZ

Lasse Collin has shared some effective options for compressinguncompressed BMP/TIFF files. I tested them on the QOI benchmark images.

When the color table (palette) is used, a delta filter would increasethe compressed size and should be disabled.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
% cat ~/tmp/b.sh
#!/bin/zsh -ue
f() {
pngcrush -fix -m 1 -l 0 $1 ${1/.png/.uncompressed.png}
[[ -f ${1/.png/.uncompressed.png} ]] || cp $1 ${1/.png/.uncompressed.png}
/tmp/p/qoi/qoiconv $1 ${1/.png/.qoi}
convert $1 ${1/.png/.bmp}
convert $1 -compress none ${1/.png/.tiff}
xz --lzma2=pb=0 -fk ${1/.png/.qoi}
if [[ $(file $1) =~ RGBA ]]; then
pnm=${1/.png/.pam}
convert $1 $pnm
xz --delta=dist=4 --lzma2=lc=4 -fk $pnm
xz --delta=dist=4 --lzma2=lc=4 -fk ${1/.png/.bmp}
xz --delta=dist=4 --lzma2=lc=4 -fk ${1/.png/.tiff}
elif [[ $(file $1) =~ 'colormap' ]]; then
pnm=${1/.png/.ppm}
convert $1 $pnm
xz --lzma2=pb=0 -fk $pnm
xz --lzma2=pb=0 -fk ${1/.png/.bmp}
xz --lzma2=pb=0 -fk ${1/.png/.tiff}
else
pnm=${1/.png/.ppm}
convert $1 $pnm
xz --delta=dist=3 --lzma2=pb=0 -fk $pnm
xz --delta=dist=3 --lzma2=pb=0 -fk ${1/.png/.bmp}
xz --delta=dist=3 --lzma2=pb=0 -fk ${1/.png/.tiff}
fi
stat -c '%n %s' $1 ${1/.png/.qoi.xz} $pnm.xz ${1/.png/.bmp.xz} ${1/.png/.tiff.xz}
}

f $1
1
2
3
4
5
6
7
8
9
10
11
12
cd /tmp/dc-img/images/
ls -1 **/*.png | rush ~/tmp/b.sh '"{}"'
ls -1 **/*.uncompressed.png | rush 'xz -fk --lzma2=pb=0 "{}"'

ruby -e 'puts "directory\t.png\t.png.xz\t.qoi.xz\t.bmp.xz\t.tiff.xz\t.p[ap]m.xz"; Dir.glob("*").each{|dir| next unless File.directory? dir;
png=pngxz=qoi=bmp=pnm=tiff=0; Dir.glob("#{dir}/*.qoi.xz").each{|f|
png+=File.size(f.sub(/\.qoi.xz/,".png"));
pngxz+=File.size(f.sub(/\.qoi.xz/,".uncompressed.png.xz"));
qoi+=File.size(f); bmp+=File.size(f.sub(/\.qoi/,".bmp")); ppm=f.sub(/\.qoi/,".ppm"); pnm+=File.exists?(ppm) ? File.size(ppm) : File.size(f.sub(/\.qoi/,".pam")); tiff+=File.size(f.sub(/\.qoi/,".tiff"));
};
puts "#{dir}\t#{png}\t#{pngxz}\t#{qoi}\t#{bmp}\t#{tiff}\t#{pnm}"
}'

While DEFLATE-compressed PNG files can hardly be further compressed,we can convert these PNG files to uncompressed ones then apply xz. The.png.xz results below do not apply a filter, and the filesare generally larger than .qoi.xz.

directory .png .png.xz .qoi.xz .bmp.xz .tiff.xz .p[ap]m.xz
icon_512 11154424 7861652 7476640 8042032 8064476 8039192
icon_64 828119 750836 708480 730472 757760 735296
photo_kodak 15394305 14464504 12902852 13612440 13616140 13610844
photo_tecnick 237834256 254803292 213268188 210591724 210508596 210468412
photo_wikipedia 88339751 100449996 86679696 86380124 86274480 86241296
pngimg 229608249 134233476 193382668 186389368 186654256 186420564
screenshot_game 266238855 237976536 218915316 216626004 216847500 216765748
screenshot_web 40272678 24690360 21321460 21458496 21532360 21533432
textures_photo 37854634 36393340 28967008 30054968 30064236 30059784
textures_pk 43523493 40868036 54117600 41990596 40632916 46695172
textures_pk01 18946769 15734348 14950836 14835648 14853420 14839312
textures_pk02 102962935 86037000 82279000 79374112 79348768 79336276
textures_plants 51765329 53044044 43681548 44913260 45021996 45048652

While compressing QOI with XZ (.qoi.xz) can achieve goodresults, using a delta filter directly on the uncompressed BMP format(.bmp.xz) can sometimes lead to even smaller files. (TIFFand PPM/PAM, when compressed, can achieve similar file sizes to.bmp.xz.) This suggests that QOI is probably not betterthan a plain delta filter.

It's important to note that uncompressed BMP/TIFF files are huge.This can be problematic if the decompressed data can't be streameddirectly into the program's internal structures. In such cases, a largetemporary buffer would be needed, wasting memory.

Drop LZ match finders

QOI_OP_INDEX essentially does length-1 LZ77 using aconceptual window that contains 64 unique pixels. When furthercompressed, another match finder seems to help very little.

Lasse Collin mentioned that the LZ layer cannot be disabled but itcan be made really weak usingxz --lzma2=dict=4KiB,mode=fast,nice=2,mf=hc3,depth=1. Let'stry it.

1
2
3
4
5
6
% =time -f '%e' xz -fk Prune_video_game_screenshot_2.qoi && stat -c %s Prune_video_game_screenshot_2.qoi.xz
0.76
2462360
% =time -f '%e' xz --lzma2=dict=4KiB,mode=fast,nice=2,mf=hc3,depth=1 -fk Prune_video_game_screenshot_2.qoi && stat -c %s Prune_video_game_screenshot_2.qoi.xz
0.27
2526664

Indeed, weakening the LZ layer improves compression speedsignicantly. Now, let's test all benchmark images.

1
2
3
4
5
6
7
8
9
10
11
12
% cat ~/tmp/qoi-weak-xz.sh
#!/bin/zsh
/tmp/p/qoi/qoiconv $1 ${1/.png/.qoi}
xz --lzma2=pb=0 -fk ${1/.png/.qoi}
xz --lzma2=dict=4KiB,mode=fast,nice=2,mf=hc3,depth=1 -c ${1/.png/.qoi} > ${1/.png/.qoi.weak-lz.xz}
% cd /tmp/dc-img/images
% ls -1 **/*.png | rush ~/tmp/qoi-weak-xz.sh '"{}"'

ruby -e 'puts "directory\tstrong\tweak\tincrease"; Dir.glob("*").each{|dir| next unless File.directory? dir;
strong=weak=0; Dir.glob("#{dir}/*.qoi.weak-lz.xz").each{|f| weak+=File.size(f); strong+=File.size(f.sub(/\.weak-lz/,""));};
puts "#{dir}\t#{strong}\t#{weak}\t#{(100.0*weak/strong-100).round(2)}%"
}'
directory strong weak increase
icon_512 7476640 8629900 15.42%
icon_64 708480 735036 3.75%
photo_kodak 12902852 13464072 4.35%
photo_tecnick 213268188 217460392 1.97%
photo_wikipedia 86679696 88609716 2.23%
pngimg 193382668 206679224 6.88%
screenshot_game 218915316 234889060 7.3%
screenshot_web 21321460 24820020 16.41%
textures_photo 28967008 31249492 7.88%
textures_pk 54117600 57956168 7.09%
textures_pk01 14950836 15749556 5.34%
textures_pk02 82279000 87747576 6.65%
textures_plants 43681548 45494084 4.15%

This size increase is small for certain directories but quite largefor the others. For the directories with small size increases, relyingpurely on delta coding and a fast entropy encoder will give a strongcompetitor.

TheDark Horse of the Image Codec World: Near-Lossless Image Formats UsingUltra-Fast LZ Codecs remarks that fast LZ can make strongcontenders.

PNG

The PNG International Standard defines the compression method 0 asDEFLATE with a sliding window of at most 32768 bytes. Technically newcompression methods can be defined, but that would break compatibilityof existing decoders and stakeholders would just resort to new imageformats. However, it would be a nice experiment to check that after thecompression part is improved, how PNG compares with newer imageformats.

Light ELF: exploring potential size reduction

作者 MaskRay
2024年4月1日 15:00

ELF's design emphasizes naturalsize and alignment guidelines for its control structures. Whileensured efficient processing in the old days, this can lead to largerfile sizes. I propose "Light ELF" (EV_LIGHT, version 2) – awhimsical exploration inspired by Light Elves of Tolkien's legendarium(who had seen the light of the Two Trees in Valinor).

In a light ELF file, the e_version member of the ELFheader is set to 2. EV_CURRENT remains 1 for backwardcompatibility.

1
2
3
#define EV_NONE 0
#define EV_CURRENT 1
#define EV_LIGHT 2

When linking a program, traditional ELF (version 1) and light ELF(version 2) files can be mixed together.

Relocations

Light ELF utilizes CREL forrelocations. RELand RELA from traditional ELF are unused.

Existing lazy binding schemes rely on random access to relocationentries within the DT_JMPREL table. Due to CREL'ssequential nature, keeping lazy binding requires a memory allocationthat holds decoded JUMP_SLOT relocations.

Section header table

Traditional ELF (version 1) sectionheader tables can be large. Light ELF addresses this through acompact section header table format signaled bye_shentsize == 0 in the ELF header.

Acompact section header table for ELF contains the detail. Itscurrent version is copied below for your convenience.

nshdr denotes the number of sections (includingSHN_UNDEF). The section header table (located ate_shoff) begins with nshdrElf_Word values. These values specify the offset of eachsection header relative to e_shoff.

Following these offsets, nshdr section headers areencoded. Each header begins with a presence byte indicatingwhich subsequent Elf_Shdr members use explicit values vs.defaults:

  • sh_name, ULEB128 encoded
  • sh_type, ULEB128 encoded (ifpresence & 1), defaults toSHT_PROGBITS
  • sh_flags, ULEB128 encoded (ifpresence & 2), defaults to 0
  • sh_addr, ULEB128 encoded (ifpresence & 4), defaults to 0
  • sh_offset, ULEB128 encoded
  • sh_size, ULEB128 encoded (ifpresence & 8), defaults to 0
  • sh_link, ULEB128 encoded (ifpresence & 16), defaults to 0
  • sh_info, ULEB128 encoded (ifpresence & 32), defaults to 0
  • sh_addralign, ULEB128 encoded as log2 value (ifpresence & 64), defaults to 1
  • sh_entsize, ULEB128 encoded (ifpresence & 128), defaults to 0

In traditional ELF, sh_addralign can be 0 or a positiveintegral power of two, where 0 and 1 mean the section has no alignmentconstraints. While the compact encoding cannot encodesh_addralign value of 0, there is no loss ofgenerality.

Example C++ code that decodes a specific section header:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// readULEB128(const uint8_t *&p);

const uint8_t *sht = base + ehdr->e_shoff;
const uint8_t *p = sht + ((Elf_Word*)sht)[i];
uint8_t presence = *p++;
Elf_Shdr shdr = {};
shdr.sh_name = readULEB128(p);
shdr.sh_type = presence & 1 ? readULEB128(p) : ELF::SHT_PROGBITS;
shdr.sh_flags = presence & 2 ? readULEB128(p) : 0;
shdr.sh_addr = presence & 4 ? readULEB128(p) : 0;
shdr.sh_offset = readULEB128(p);
shdr.sh_size = presence & 8 ? readULEB128(p) : 0;
shdr.sh_link = presence & 16 ? readULEB128(p) : 0;
shdr.sh_info = presence & 32 ? readULEB128(p) : 0;
shdr.sh_addralign = presence & 64 ? 1UL << readULEB128(p) : 1;
shdr.sh_entsize = presence & 128 ? readULEB128(p) : 0;

While the current format allows for O(1) in-place random access ofsection headers using offsets at the beginning of the table, this accesspattern seems uncommon in practice. At least, I haven't encountered (orremembered) any instances within the llvm-project codebase. Therefore,I'm considering removing this functionality.

In a release build of llvm-project(-O3 -ffunction-sections -fdata-sections -Wa,--crel, thetraditional section header tables occupy 16.4% of the .ofile size while the compact section header table drastically reduces theratio to 4.7%.

Symbol table

Like other sections, symboltable and string table sections (SHT_SYMTAB andSHT_STRTAB) can be compressed throughSHF_COMPRESSED. However, compressing the dynamic symboltable (.dynsym) and its associated string table(.dynstr) is not recommended.

Symbol table sections have a non-zero sh_entsize, whichremains unchanged after compression.

The string table, which stores symbol names (also section names inLLVM output), is typically much larger than the symbol table itself. Toreduce its size, we can utilize a text compression algorithm. Whilecompressing the string table, compressing the symbol table along with itmight make sense, but using a compact encoding for the symbol tableitself won't provide significant benefits.

Program headers

Program headers, while individually large (eachElf64_Phdr is 56 bytes) and no random access is needed,typically have a limited quantity within an executable or shared object.Consequently, their overall size contribution is relatively small. LightELF maintains the existing format.

Section compression

Compressed sections face a challenge due to header overheadespecially for ELFCLASS64.

1
2
3
4
5
6
7
8
9
10
11
12
typedef struct {
Elf32_Wordch_type;
Elf32_Wordch_size;
Elf32_Wordch_addralign;
} Elf32_Chdr;

typedef struct {
Elf64_Wordch_type;
Elf64_Wordch_reserved;
Elf64_Xwordch_size;
Elf64_Xwordch_addralign;
} Elf64_Chdr;

The overhead and alignmentpadding limit the effectiveness when used with features like-ffunction-sections and -fdata-sections thatgenerate many smaller sections. For example, I have found that the largeElf64_Chdr makes evaluating compressed .rela.*sections difficult. Light ELF addresses this challenge by introducing anheader format of smaller footprint:

  • ch_type, ULEB128 encoded
  • ch_size, ULEB128 encoded
  • ch_addralign, ULEB128 encoded as log2 value

This approach allows Light ELF to represent the header information injust 3 bytes for smaller sections, compared to the 24 bytes required bythe traditional format. The content is no longer guaranteed to beword-aligned, a property that most compression libraries don't requireanyway.

Furthermore, compressedsections with the SHF_ALLOC flag are allowed. Usingthem outside of relocatable files needs caution, though.

Experiments

I have developed a Clang/lld prototype that implements compactsection header table and CREL (https://github.com/MaskRay/llvm-project/tree/april-2024).

.o size sht size build
136599656 18429824 -O3
112088088 18431616 -O3 -Wa,--crel
97517175 3860703 -O3 -Wa,--crel,--cshdr
2166435360 260303360 -g
1755222784 260305152 -g -Wa,--crel
1557420523 62502891 -g -Wa,--crel,--chsdr

WebAssembly has a dense object file format. We can make a comparisonwith WebAssembly.

1
2
clang -c --target=wasm64 a.cc
clang -c -Wa,--crel,--allow-experimental-crel,--cshdr -fno-asynchronous-unwind-tables -fno-exceptions -fno-ident -fno-addrsig -fno-unique-section-names a.cc

Light ELF: a thoughtexperiment

By now, you might have realized that post is about a joke. Whilebumping e_version and modifying Elf_Chdr mightnot be feasible, it's interesting to consider the possibilities ofcompact section headers and compressed symbol/string tables. Perhapsthis can spark some interesting discussions!

A compact section header table for ELF

作者 MaskRay
2024年3月31日 15:00

ELF's design emphasizes natural sizeand alignment guidelines for its control structures. However, thisapproach has substantial size drawbacks.

In a release build of llvm-project(-O3 -ffunction-sections -fdata-sections, the sectionheader tables occupy 13.4% of the .o file size.

I propose an alternative section header table format that is signaledby e_shentsize == 0 in the ELF header.e_shentsize == sizeof(Elf64_Shdr) (or the 32-bitcounterpart) selects the traditional section header table format.

The content bgins with a ULEB128-encoded value nshdr:the number of sections (including SHN_UNDEF).nshdr section headers follow nshdr.

Each header begins with a presence byte indicating whichsubsequent Elf_Shdr members use explicit values vs.defaults:

  • sh_name, ULEB128 encoded
  • sh_type, ULEB128 encoded (ifpresence & 1), defaults toSHT_PROGBITS
  • sh_flags, ULEB128 encoded (ifpresence & 2), defaults to 0
  • sh_addr, ULEB128 encoded (ifpresence & 4), defaults to 0
  • sh_offset, ULEB128 encoded
  • sh_size, ULEB128 encoded (ifpresence & 8), defaults to 0
  • sh_link, ULEB128 encoded (ifpresence & 16), defaults to 0
  • sh_info, ULEB128 encoded (ifpresence & 32), defaults to 0
  • sh_addralign, ULEB128 encoded as log2 value (ifpresence & 64), defaults to 1
  • sh_entsize, ULEB128 encoded (ifpresence & 128), defaults to 0

In traditional ELF, sh_addralign can be 0 or a positiveintegral power of two, where 0 and 1 mean the section has no alignmentconstraints. While the compact encoding cannot encodesh_addralign value of 0, there is no loss ofgenerality.

Example C++ code that decodes a section header:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// readULEB128(const uint8_t *&p);

const uint8_t *sht = base + ehdr->e_shoff;
const uint8_t *p = sht + offsets[i];
uint8_t presence = *p++;
Elf_Shdr shdr = {};
shdr.sh_name = readULEB128(p);
shdr.sh_type = presence & 1 ? readULEB128(p) : ELF::SHT_PROGBITS;
shdr.sh_flags = presence & 2 ? readULEB128(p) : 0;
shdr.sh_addr = presence & 4 ? readULEB128(p) : 0;
shdr.sh_offset = readULEB128(p);
shdr.sh_size = presence & 8 ? readULEB128(p) : 0;
shdr.sh_link = presence & 16 ? readULEB128(p) : 0;
shdr.sh_info = presence & 32 ? readULEB128(p) : 0;
shdr.sh_addralign = presence & 64 ? 1UL << readULEB128(p) : 1;
shdr.sh_entsize = presence & 128 ? readULEB128(p) : 0;

While O(1) random access isn't supported, this is often addressed byapplications building their own data representations. Alternatively, forsimpler applications, a prescan can be performed to determine thestarting offset of each header beforehand.

In a release build of llvm-project(-O3 -ffunction-sections -fdata-sections -Wa,--crel, thetraditional section header tables occupy 16.4% of the .ofile size while the compact section header table drastically reduces theratio to 4.7%.

Experiments

I have developed a Clang/lld prototype that implements compactsection header table and CREL. https://github.com/MaskRay/llvm-project/tree/demo-cshdr

.o size sht size build
136599656 18429824 -O3
112088088 18431616 -O3 -Wa,--crel
97517175 3860703 -O3 -Wa,--crel,--cshdr
2166435360 260303360 -g
1755222784 260305152 -g -Wa,--crel
1557420523 62502891 -g -Wa,--crel,--chsdr

Alternatives

The WebAssembly object file format implemented by LLVM adopts thefollowing design that does not use section headers. Consumders need toperform a few random accesses to collect all section information.

1
2
3
4
5
6
7
8
9
10
11
# start section foo
section_id
size
content
# end section

# start_section bar
section_id
size
content
# end section

We could adopt a similar design by adding metadata beside the sizefield. We would be able to remove the sh_offset member.

We could take inspirations from DWARF .debug_abbrev:define a few shapes with fixed fields (e.g.sh_type==SHT_PROGBITS) and make a section header fill inthe rest fields.

We could also adopt a scheme similar to.subsections_via_symbols, but using metadata instead ofsymbols to describe subsection boundaries.

More ideas

Like other sections, symboltable and string table sections (SHT_SYMTAB andSHT_STRTAB) can be compressed throughSHF_COMPRESSED. However, compressing the dynamic symboltable (.dynsym) and its associated string table(.dynstr) is not recommended.

Symbol table sections have a non-zero sh_entsize, whichremains unchanged after compression.

The string table, which stores symbol names (also section names inLLVM output), is typically much larger than the symbol table itself. Toreduce its size, we can utilize a text compression algorithm. Whilecompressing the string table, compressing the symbol table along with itmight make sense, but using a compact encoding for the symbol tableitself won't provide significant benefits.

C++ exit-time destructors

作者 MaskRay
2024年3月17日 15:00

In ISO C++ standards, [basic.start.term] specifies that:

Constructed objects ([dcl.init]) with static storage duration aredestroyed and functions registered with std::atexit are called as partof a call to std::exit ([support.start.term]). The call to std::exit issequenced before the destructions and the registered functions. [Note1: Returning from main invokes std::exit ([basic.start.main]). — endnote]

For example, consider the following code:

1
struct A { ~A(); } a;

The destructor for object a will be registered for execution atprogram termination.

__cxa_atexit

The Itanium C++ ABI employs __cxa_atexit rather thanatexit for object destructor registration for two primary reasons:

  • Limited atexit guarantee: ISO C (up to C23) guaranteessupport for 32 registered functions, although most implementationssupport many more.
  • Dynamic library unloading: __cxa_atexit provides amechanism for handling destructors when dynamic libraries are unloadedvia dlclose before program termination.

Several standard libraries, including glibc, musl, and FreeBSD libc,implement atexit using __cxa_atexit.

  • In glibc, atexit returns__cxa_atexit ((void (*) (void *)) func, NULL, __dso_handle),where __dso_handle is part of libc itself.
  • musl uses 0 instead of __dso_handle.

https://itanium-cxx-abi.github.io/cxx-abi/abi.html#dso-dtor-runtime-apiprovides detailed documentation on object destruction mechanisms. Let'sillustrate this with a GCC and glibc example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
cat > a.cc <<'eof'
#include <dlfcn.h>
int main() {
void *h = dlopen("./b.so", RTLD_NOW);
((void (*)())dlsym(h, "foo"))();
dlclose(h);
}
eof
cat > b.cc <<'eof'
#include <stdio.h>
struct A { ~A(); } ga;
A::~A() { printf("~A %p\n", this); }
extern "C" void foo() {
static A a;
puts("foo");
}
eof
g++ -fpic -shared b.cc -o b.so
g++ a.cc -o a

An invocation yields:

1
2
3
foo
~A 0x7f70d66c4c79 // for the static-local variable
~A 0x7f70d66c4c78 // for the global variable

Key points:

  • The compiler registers destructors with __cxa_atexitusing the __dso_handle symbol as an argument.
  • crtbeginS.o defines the .fini_arraysection (triggering __do_global_dtors_aux) and the hiddensymbol __dso_handle.
  • Since 2017, lld defines__dso_handle as a hidden symbol if crtbegin doesnot.
  • dlclose invokes .fini_array functions.__cxa_finalize(d) iterates through the termination functionlist, calling matching destructors based on the DSO handle.
  • __cxa_atexit implementations typically allocate memorydynamically and may fail. The failures are simply ignored.

Note: In glibc, the DF_1_NODELETE flag marks a sharedobject as unloadable. Additionally, symbol lookups withSTB_GNU_UNIQUE automatically set this flag.

musl provides a no-opimplementation for dlclose and__cxa_finalize.

Thread storage durationvariables

Objects with thread storage duration that have non-trivialdestructors will register those destructors using__cxa_thread_atexit during construction.

When exit-timedestructors are undesired

Exit-time destructors for static and thread storage durationvariables can be undesired due to

  • Unnecessary overhead and complexity: This includes operating systemkernels and memory-constrained systems.
  • Potential race conditions: Destructors might execute during threadtermination, while other threads still attempt to access the object.Examples: webkit

Clang provides -Wexit-time-destructors (disabled bydefault) to warn about exit-time destructors.

1
2
3
4
5
% clang++ -c -Wexit-time-destructors g.cc
g.cc:1:20: warning: declaration requires an exit-time destructor [-Wexit-time-destructors]
1 | struct A { ~A(); } a;
| ^
1 warning generated.

Disabling exit-timedestructors

Then, I will describe some approaches to disable exit-timedestructors.

Pointer/referenceto a dynamically-allocated object

We can use a reference or pointer that refers to adynamically-allocated object.

1
2
3
4
5
6
struct A { int v; ~A(); };
A &g = *new A; // or A *const g = new A;
A &foo() {
static A &a = *new A;
return a; // or static A *a = new A; return *a
}

This approach prevents the destructor from running at program exit,as pointers and references have a trivial destructor. Note that thisdoes not create a memory leak, since the pointer/reference is part ofthe root set.

The primary downside is unnecessary pointer indirection whenaccessing the object. Additionally, this approach uses a mutable pointerin the data segment and requires a memory allocation.

1
2
3
4
5
6
# %bb.2:                                 // initializer
movl $4, %edi
callq _Znwm@PLT
movq %rax, _ZZ3foovE1a(%rip) // store pointer of the heap-allocated object to _ZZ3foovE1a
...
movq _ZZ3foovE1a(%rip), %rax // load a pointer from _ZZ3foovE1a

Class template with anempty destructor

A common approach, as outlined in P1247, is to use a class templatewith an empty destructor to prevent exit-time destruction:

1
2
3
4
5
6
7
8
template <class T> class no_destroy {
alignas(T) std::byte data[sizeof(T)];
public:
template <class... Ts> no_destroy(Ts&&... ts) { new (data) T(std::forward<Ts>(ts)...); }
T &get() { return *reinterpret_cast<T *>(data); }
};

no_destroy<widget> my_widget;

libstdc++ employs a variant that uses a union member.

1
2
3
4
5
6
7
8
9
10
11
12
struct A { ~A(); };

namespace {
struct constant_init {
union { A obj; };
constexpr constant_init() : obj() { }
~constant_init() { /* do nothing, union member is not destroyed */ }
};
constinit constant_init global;
}

A* get() { return &global.obj; }

C++20 will support constexpr destructor:

1
2
3
4
5
6
template <class T> union no_destroy {
template <typename... Ts>
explicit constexpr no_destroy(Ts&&... args) : obj(std::forward(args)...) {}
constexpr ~no_destroy() {}
T obj;
};

Libraries like absl::NoDestructorand folly::Indestructibleoffer similar functionality. The absl version optimizes for triviallydestructible types.

Compileroptimization for no-op destructors

Ideally, compilers should optimize out exit-time destructors forempty user-provided destructors:

1
2
struct C { C(); ~C() {} };
void foo() { static C c; }

LLVM has addressed this since2011. Its GlobalOpt pass eliminates __cxa_atexit callsrelated to empty destructors, along with other global variableoptimizations.

In contrast, GCC has an open featurerequest for this optimization since 2005.

no_destroy attribute

Clang supports [[clang::no_destroy]] (alternative form:__attribute__((no_destroy))) to disable exit-timedestructors for variables of static or thread storage duration. Its-fno-c++-static-destructors option allows disablingexit-time destructors globally.

  • July 2018 discussion: https://discourse.llvm.org/t/rfc-suppress-c-static-destructor-registration/49128
  • Patch: https://reviews.llvm.org/D50994 with follow-up https://reviews.llvm.org/D54344
  • Documentation: https://clang.llvm.org/docs/AttributeReference.html#no-destroy

Standardization efforts for this attribute are underway P1247R0.

I recently encountered a scenario where the no_destroyattribute would have been beneficial. I've filed a GCC feature request(PR114357) after I learnedthat GCC doesn't have the attribute.

Case study

LLVM provides ManagedStatic to construct an objecton-demand (good for reducing startup time) and make destructionexplicitly through llvm_shutdown.ManagedStatic is intended to be used at namespace scope. Aprime example is LLVM's statistics mechanisms (-stats and-time-passes).

Programs using LLVM can strategically avoid callingllvm_shutdown for fast teardown by skipping somedestructors. The lld linker employs this approach unless theLLD_IN_TEST environment variable is set to a non-zerointeger.

DSO plugin users requiring library unloading may findManagedStatic unsuitable. This is because:

  • A DSO may not be able to determine if other active LLVM users existwithin the process, making it unsafe to callllvm_shutdown.
  • If llvm_shutdown is deferred until around program exit,executing destructors becomes unsafe once the DSO's code has beenremoved.

The mold linker improves perceived linking speed by spawning aseparate process for the linking task. This allows the parent process(the one launched from the shell or other programs) to exit early. Thisapproach eliminates overhead associated with static destructors andother operations.

A compact relocation format for ELF

作者 MaskRay
2024年3月9日 16:00

This article introduces CREL (previously known as RELLEB), a newrelocation format offering incredible size reduction (LLVMimplementation in my fork).

ELF's design emphasizes natural size and alignment guidelines for itscontrol structures. This principle, outlined in Proceedings of theSummer 1990 USENIX Conference, ELF: An Object File to MitigateMischievous Misoneism, promotes ease of random access forstructures like program headers, section headers, and symbols.

All data structures that the object file format defines follow the"natural" size and alignment guidelines for the relevant class. Ifnecessary, data structures contain explicit padding to ensure 4-bytealignment for 4-byte objects, to force structure sizes to a multiple offour, etc. Data also have suitable alignment from the beginning of thefile. Thus, for example, a structure containing an Elf32_Addr memberwill be aligned on a 4-byte boundary within the file. Other classeswould have appropriately scaled definitions. To illustrate, the 64-bitclass would define Elf64 Addr as an 8-byte object, aligned on an 8-byteboundary. Following the strictest alignment for each object allows theformat to work on any machine in a class. That is, all ELF structures onall 32-bit machines have congruent templates. For portability, ELF usesneither bit-fields nor floating-point values, because theirrepresentations vary, even among pro- cessors with the same byte order.Of course the programs in an ELF file may use these types, but theformat itself does not.

While beneficial for many control structures, the natural sizeguideline presents significant drawbacks for relocations. Sincerelocations are typically processed sequentially, they don't gain thesame random-access advantages. The large 24-byte Elf64_Relastructure highlights the drawback. For a detailed comparison ofrelocation formats, see Exploringobject file formats#Relocations.

Furthermore, Elf32_Rel and Elf32_Relasacrifice flexibility to maintain a smaller size, limiting relocationtypes to a maximum of 255. This constraint has become noticeable forAArch32 and RISC-V, and especially when platform-specific relocationsare needed. While the 24-bit symbol index field is less elegant, ithasn't posed significant issues in real-world use cases.

In contrast, the WebAssemblyobject file format uses LEB128 encoding for relocations and otherconstrol structures, offering a significant size advantage over ELF.

Inspired by WebAssembly, I will start discussion with a genericcompression algorithm and then propose an alternative format (CREL) thataddresses ELF's limitations.

Compressed relocations

While the standard SHF_COMPRESSED feature is commonlyused for debug sections, its application can easily extend to relocationsections. I have developed a Clang/lld prototype that demonstrates thisby compressing SHT_RELA sections.

The compressed SHT_RELA section occupiessizeof(Elf64_Chdr) + size(compressed) bytes. Theimplementation retains uncompressed content if compression would resultin a larger size.

In scenarios with numerous smaller relocation sections (such as whenusing -ffunction-sections -fdata-sections), the 24-byteElf64_Chdr header can introduce significant overhead. Thisobservation raises the question of whether encodingElf64_Chdr fields using ULEB128 could further optimize filesizes. With larger monolithic sections (.text,.data, .eh_frame), compression ratio would behigher as well.

1
2
3
4
5
6
7
8
# configure-llvm is my wrapper of cmake that specifies some useful options.
configure-llvm s2-custom0 -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_PROJECTS='clang;lld'
configure-llvm s2-custom1 -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_PROJECTS='clang;lld' -DCMAKE_{C,CXX}_FLAGS=-Xclang=--compress-relocations=zstd
ninja -C /tmp/out/s2-custom0 lld
ninja -C /tmp/out/s2-custom1 lld

ruby -e 'p Dir.glob("/tmp/out/s2-custom0/**/*.o").sum{|f| File.size(f)}' # 135996752
ruby -e 'p Dir.glob("/tmp/out/s2-custom1/**/*.o").sum{|f| File.size(f)}' # 116424688

Relocations consume a significant portion (approximately 20.9%) ofthe file size. Despite the overhead of-ffunction-sections -fdata-sections, the compressiontechnique yields a significant reduction of 14.5%!

However, dropping in-place relocation processing is a downside.

CREL relocation format

The 1990 ELF paper ELF: An Object File to Mitigate MischievousMisoneism says "ELF allows extension and redefinition for othercontrol structures." Let's explore CREL, a new and more compactrelocation format designed to replace REL and RELA. Our emphasis is onsimplicity over absolute minimal encoding. This is achieved by using abyte-oriented encoding that avoids complex compression techniques (e.g.,dictionary-based compression, entropy encoder). As a byte-orientedformat, CREL relocations can be further compressed by other codecs, ifdesired. Using CREL as relocatable files can decrease memory usage.

See the end of the article for a detailed format description.

A SHT_CREL section (preferred name:.crel<name>) holds compact relocation entries thatdecode to Elf32_Rela or Elf64_Rela dependingon the object file class (32-bit or 64-bit). Its content begins with aULEB128-encoded relocation count, followed by entries encodingr_offset, r_type, r_symidx, andr_addend. The entries use ULEB128 and SLEB128 exclusivelyand there is no endianness difference.

Here are key design choices:

Relocation count (ULEB128):

This allows for efficient retrieval of the relocation count withoutdecoding the entire section. While a uint32_t (like SHT_HASH)could be used, ULEB128 aligns with subsequent entries, removesendianness differences, and offers a slight size advantage in most caseswhen the number of symbols can be encoded in one to three bytes.

Shifted offset:

64-bit data sections frequently have absolute relocations spaced 8bytes apart. Additionally, in RISC architectures, offsets are oftenmultiples of 2 or 4. A shift value of 2 allows delta offsets within the[0, 64) range to be encoded in a single byte, often avoiding the needfor two-byte encoding. In an AArch64 -O3 build, the shiftedoffset technique reduces size(.crel*) by 12.8%.

Many C++ virtual tables have the first relocation at offset 0x10. Inthe absence of the shifted offset technique, the relocation at offset0x10 cannot be encoded in one byte.

1
2
3
4
5
6
7
Relocation section '.crel.data.rel.ro._ZTVN12_GLOBAL__N_113InlineSpillerE' at offset 0x116fe contains 5 entries:
Offset Info Type Symbol's Value Symbol's Name + Addend
0000000000000010 0000007e00000001 R_X86_64_64 0000000000000000 _ZN4llvm7Spiller6anchorEv + 0
0000000000000018 0000000c00000001 R_X86_64_64 0000000000000000 .text._ZN12_GLOBAL__N_113InlineSpillerD2Ev + 0
0000000000000020 0000000f00000001 R_X86_64_64 0000000000000000 .text._ZN12_GLOBAL__N_113InlineSpillerD0Ev + 0
0000000000000028 0000001100000001 R_X86_64_64 0000000000000000 .text._ZN12_GLOBAL__N_113InlineSpiller5spillERN4llvm13LiveRangeEditE + 0
0000000000000030 0000001a00000001 R_X86_64_64 0000000000000000 .text._ZN12_GLOBAL__N_113InlineSpiller16postOptimizationEv + 0

The shifted offset also works really well for dynamic relocations,whose offset differences are almost always a multiple ofsizeof(Elf_Addr).

Addend bit:

CREL was initially designed with explicit addends in each relocationentry. However, this approach created redundancy when I extended CRELfor dynamic relocations (discussed throroughly in another section).While RELR remains an option, the goal is for CREL to be a viablealternative even without RELR. In addition, when implicit addends areused, I feel sad if one bit in every relocation entry is wasted.

To address these concerns, a single bit flag has been introduced inthe CREL header:

  • addend_bit==1: The flags include a bit to signal thepresence of the delta addend.
  • addend_bit==0: The flags bit is stolen to encode onemore bit for the offset. The addend is implicit and not encoded withinthe relocation.

This bit flag effectively balances efficiency by avoiding unnecessarystorage for dynamic relocations while maintaining flexibility for casesrequiring explicit addend values.

Assembler produced SHT_CREL sections are supposed toalways set the addend_bit bit.

Delta encoding for r_offset (ULEB128):

Section offsets can be large, and relocations are typically ordered.Storing the difference between consecutive offsets offers compressionpotential. In most cases, a single byte will suffice. While there areexceptions (general dynamic TLS model of s390/s390x uses a local"out-of-order" pair:R_390_PLT32DBL(offset=o) R_390_TLS_GDCALL(offset=o-2)), weare optimizing for the common case.

For ELFCLASS32, r_offsets members are calculated usingmodular arithmetic modulo 4294967296.

Delta encoding for r_symidx (SLEB128):

This is more for consistency and less for the benefit.

Absolute symbol indexes allow one-byte encoding for symbols in therange [0,128) and offer minor size advantage for static relocations whenthe symbol table is sorted by usage frequency. Delta encoding, on theother hand, might optimize for the scenario when the symbol tablepresents locality: neighbor symbols are frequently mutually called.

Delta symbol index enables one-byte encoding for GOT/PLT dynamicrelocations when .got/.got.plt entries areordered by symbol index. For example, R_*_GLOB_DAT andR_*_JUMP_SLOT relocations can typically be encoded withrepeated 0x05 0x01 (when addend_bit==0 && shift==3,offset++, symidx++). Delta encoding has a disvantage. It can partialclaim the optimization by arranging symbols in a "cold0 hot cold1"pattern. In addition, delta symbol index enables one-byte encoding forGOT/PLT dynamic relocations when .got/.got.pltentries are ordered by symbol index.

In my experiments, absolute encoding with ULEB128 results in slightlylarger .o file sizes for both x86-64 and AArch64 builds.

Delta encoding for r_type (SLEB128):

Some psABIs utilize relocation types greater than 128. AArch64'sstatic relocation types begin at 257 and dynamic relocation types beginat 1024, necessitating two bytes with ULEB128/SLEB128 encoding in theabsence of delta encoding. Delta encoding allows all but the firstrelocation's type to be encoded in a single byte. An alternative designis to define a base type in the header and encode types relative to thebase type, which would introduce slight complexity.

If the AArch32 psABI could be redesigned, allocating[0,64) for Thumb relocation types and [64,*)for ARM relocation types would optimize delta encoding even further.

While sharing a single type code for multiple relocations would beefficient, it would require reordering relocations. This conflicts withorder requirements imposed by several psABIs and could complicate linkerimplementations.

Delta encoding for addend (SLEB128):

Encoding the delta addend offers a slight size advantage andoptimizes for cases like:

1
2
3
4
.quad .data + 0x78
.quad .data + 0x80
.quad .data + 0x88
...

Symbol index/type/addend omission

Relocations often exhibit patterns that can be exploited for sizereduction:

  • Symbol index/type remain constant with varying addends, common forSTT_SECTION symbols, e.g. .rodata,.eh_frame, .debug_str_offsets,.debug_names, .debug_line, and.debug_addr.
  • Type/addend remain constant with varying symbol indexes, common fornon-section symbols, e.g. function calls, C++ virtual tables, anddynamic relocations.
1
2
3
4
// Type/addend do not change.
// R_AARCH64_CALL(g0), ...
// R_X86_64_PLT32(g0-4), ...
void f() { g0(); g1(); g2(); }

We use the least significant bits of the offset member to signal thepresence of symbol index, type, and addend information. This allows usto omit delta fields when they match the previous entry.

CREL steals 3 bits from the offset member. I have tried stealing justa bit and utilizing negative symbol index to signal type/addendomission, but offsets generally require fewer bits to encode andstealing bits from offsets is superior.

Omitting the symbol index information is especially beneficial forreducing debug build sizes. For example, most .debug_namesrelocations can be encoded using only 4 bytes (offset and delta addend),instead of the 7 bytes required otherwise.

While RISC architectures often require multiple relocations withdifferent types to access global data, making type omission slightlyless beneficial than x86, the frequent use of call instructions offerslarge size reduction with type omission.

With a limited number of types and frequent zero addends (exceptR_*_RELATIVE and R_*_IRELATIVE), dynamicrelocations also benefit from type/addend omission.


I have developed a prototype at https://github.com/MaskRay/llvm-project/tree/demo-crel.CREL demonstrates superrior size reduction compared to theSHF_COMPRESSED SHT_RELA approach.

LEB128 amongvariable-length integer encodings

LEB128 and UTF-8 stand out as the two most commonly usedbyte-oriented, variable-length integer encoding schemes. Binaryencodings often employ LEB128. While alternatives like PrefixVarInt (ora suffix-based variant) might excel when encoding larger integers,LEB128 offers advantages when most integers fit within one or two bytes,as it avoids the need for shift operations in the common one-byterepresentation.

While we could utilize zigzag encoding(i>>31) ^ (i<<1) to convert SLEB128-encodedtype/addend to use ULEB128 instead, the generate code is inferior to oron par with SLEB128 for one-byte encodings on x86, AArch64, andRISC-V.

1
2
3
4
5
6
7
8
9
// One-byte case for SLEB128
int64_t from_signext(uint64_t v) {
return v < 64 ? v - 128 : v;
}

// One-byte case for ULEB128 with zig-zag encoding
int64_t from_zigzag(uint64_t z) {
return (z >> 1) ^ -(z & 1);
}

While some variale-length integer schemes allocate more integerswithin the one-byte bucket, I do not believe they would lead tonoticeable improvement over LEB128 and their complexity is higher thanLEB128. For example, when I assign one extra bit to offsets (by clearingaddend_bit in the header), .crel.dyn merelydecreases by 2.5%.

Here is an extremely simple C decoder implementation for ULEB128 andSLEB128. The clever use of 64/128 is from Stefan O'Rear. The return typeuint64_t can be changed to size_t when used ina dynamic loader.

1
2
3
4
5
6
7
8
9
10
11
12
static uint64_t read_leb128(unsigned char **buf, uint64_t sleb_uleb) {
uint64_t acc = 0, shift = 0, byte;
do {
byte = *(*buf)++;
acc |= (byte - 128*(byte >= sleb_uleb)) << shift;
shift += 7;
} while (byte >= 128);
return acc;
}

uint64_t read_uleb128(unsigned char **buf) { return read_leb128(buf, 128); }
int64_t read_sleb128(unsigned char **buf) { return read_leb128(buf, 64); }

I have used a modified lld analyze LEB128 length distribution in ax86-64 release build of lld that enables CREL.

1
2
3
4
5
6
7
8
9
10
zo[std::min(getULEB128Size(offset-old_offset), 3u) - 1]++;
if (b & 1) {
auto x = readSLEB128(p); symidx += x; zs[std::min(getSLEB128Size(x), 3u) - 1]++;
}
if (b & 2) {
auto x = readSLEB128(p); type += x; zt[std::min(getSLEB128Size(x), 3u) - 1]++;
}
if (b & 4) {
auto x = readSLEB128(p); addend += x; za[std::min(getSLEB128Size(x), 3u) - 1]++;
}

The distribution of ULEB128/SLEB128 lengths is:

1
2
3
4
5
        1       2       3+      any
offset 633056 48846 0 681902
type 187759 0 0 187759
symidx 360230 229610 879 590719
addend 191523 52293 2899 246715

80.5% LEB128 encodings are of 1 byte and 19.2% are of 2 bytes. 2-bytedelta symidx members are quite common, but I do not plan to steal bitsfrom other members to symidx.

VLQ (variable-length quantity) used by the MIDI file format is abig-endian variant of LEB128. While VLQ offers some advantages:

  • Lexicographic ordering: This can be beneficial for database sortingbut isn't a factor for object files.
  • Potential for better sign extension.

git uses a bijective variant of VLQ. While bijectivity is a niceproperty, it is not useful.

The potential benefits are outweighed by a slightly more complexencoder and advantages of sticking with popular LEB128 in objectfiles.

https://sqlite.org/src4/doc/trunk/www/varint.wikidescribes a scheme that encodes 0~240 in one byte, but it is a bitcomplex. A simplified version I explored led to larger relocatable filesand dynamic relocations compared to LEB128. Additionally, implementing67-bit delta offsets and flags without 128-bit integer support would becumbersome.

1
2
3
4
5
6
7
8
9
uint64_t read_vu128(const uint8_t *&p) {
uint8_t b = *p++;
if (b < 0xf0) return b;
b &= 0x0f;
uint64_t value = 0xf0 + *p++;
for (unsigned i = 1; i <= b; ++i)
value += *p++ << i * 8;
return value;
}

Experiments

build format .o size size(.rel*) .o size
decrease
-O3 RELA 136012504 28235448
-O3 CREL 111583312 3806234 18.0%
aarch64 -O3 RELA 124965808 25855800
aarch64 -O3 CREL 102529784 3388307 18.0%
ppc64le -O3 RELA 129017272 26589192
ppc64le -O3 CREL 105860576 3432419 17.9%
riscv64 -O3 RELA 227189744 91396344
riscv64 -O3 CREL 149343352 13549699 34.3%
-O1 -g RELA 1506173760 340965576
-O1 -g CREL 1202445768 37237274 20.2%
-O3 -g $SPLIT RELA 549003848 104227128
-O3 -g $SPLIT CREL 459768736 14992114 16.3%

SPLIT="-gpubnames -gsplit-dwarf"

Let's compare x86_64 -O3 builds of lld.size(.crel*)/size(.rel*) = 3806234 / 28235448, 13.5%. Thetotal .o file size has decreased by 18.0%. In addition, the maximumresident set size of the linker (also lld) using mimalloc has decreasedby 4.2%.

It would be interesting to explore the potential gains of combiningzstd compression with CREL.

1
2
3
4
configure-llvm s2-custom3 -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_PROJECTS='clang;lld' -DCMAKE_{C,CXX}_FLAGS='-Wa,--crel -Xclang --compress-relocations=zstd'
ninja -C /tmp/out/s2-custom3 lld

ruby -e 'p Dir.glob("/tmp/out/s2-custom3/**/*.o").sum{|f| File.size(f)}' # 111383192

I debated whether to name the new section SHT_RELOC(.reloc<name>) or SHT_RELLEB(.relleb<name>). Ultimately, I choseSHT_CREL because its unique name minimizes potentialconfusion, whereas SHT_RELOC could be confused withSHT_REL and SHT_RELA and the LEB128 part isnot that strong.

Case study

Let's explore some real-world scenarios where relocation size iscritical.

Marker relocations

Marker relocations are utilized to indicate certain linkeroptimization/relaxation is applicable. While many marker relocations areused scarcely, RISC-V relocatable files are typically filled up withR_RISCV_RELAX relocations. Their size contribution is quitesubstantial.

.llvm_addrsig

On many Linux targets, Clang emits a special section called.llvm_addrsig (type SHT_LLVM_ADDRSIG, LLVMaddress-significance table) by default to allowld.lld --icf=safe. The .llvm_addrsig sectionstores symbol indexes in ULEB128 format, independent of relocations.Consequently, tools like ld -r and objcopy risk invalidatethe section due to symbol table modifications.

Ideally, using relocations would allow certain operations. However,the size concern of REL/RELA in ELF hinders this approach. In contrast,lld's Mach-O port chosea relocation-based representation for__DATA,__llvm_addrsig.

If CREL is adopted, we can consider switching to the relocationrepresentation.

.llvm.call-graph-profile

LLVM leverages a special section called.llvm.call-graph-profile (typeSHT_LLVM_CALL_GRAPH_PROFILE) for both instrumentation- andsample-based profile-guided optimization (PGO). lld utilizesthis information ((from_symbol, to_symbol, weight) tuples) tooptimize section ordering within an input section description, enhancingcache utilization and minimizing TLB thrashing.

Similar to .llvm_addrsig, the.llvm.call-graph-profile section initially faced the symbolindex invalidation problem, which was solved by switching torelocations. I opted for REL over RELA to reduce code size.

DWARF sections

In a non-split-DWARF build, .rela.debug_str_offsets and.rela.debug_addr consume a significant portion of the filesize.

DWARF v5 accelerated name-based access with the introduction of the.debug_names section. However, in aclang -g -gsplit-dwarf -gpubnames generated relocatablefile, the .rela.debug_names section can consume asignificant portion (approximately 10%) of the file size.

1
2
3
4
5
6
7
Relocation section '.crel.debug_names' at offset 0x65c0 contains 200 entries:
Offset Info Type Symbol's Value Symbol's Name + Addend
000000000000002c 0000002f0000000a R_X86_64_32 0000000000000000 .debug_info + 0
00000000000004d8 000000320000000a R_X86_64_32 0000000000000000 .debug_str + 1ab
00000000000004dc 000000320000000a R_X86_64_32 0000000000000000 .debug_str + f61
00000000000004e0 000000320000000a R_X86_64_32 0000000000000000 .debug_str + f7f
...

This size increase has sparked discussions within the LLVM communityabout potentially alteringthe file format for linking purposes.

.debug_line and .debug_addr also contributea lot of relocations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Relocation section '.crel.debug_addr' at offset 0x64f1 contains 51 entries:
Offset Info Type Symbol's Value Symbol's Name + Addend
0000000000000008 0000004300000001 R_X86_64_64 0000000000000000 _ZN4llvm30VerifyDisableABIBreakingChecksE + 0
0000000000000010 0000002d00000001 R_X86_64_64 0000000000000000 .rodata.str1.1 + 0
0000000000000018 0000002d00000001 R_X86_64_64 0000000000000000 .rodata.str1.1 + b
0000000000000020 0000002d00000001 R_X86_64_64 0000000000000000 .rodata.str1.1 + 13
...

Relocation section '.crel.debug_line' at offset 0x69a5 contains 81 entries:
Offset Info Type Symbol's Value Symbol's Name + Addend
0000000000000022 000000350000000a R_X86_64_32 0000000000000000 .debug_line_str + 0
0000000000000026 000000350000000a R_X86_64_32 0000000000000000 .debug_line_str + 18
000000000000002a 000000350000000a R_X86_64_32 0000000000000000 .debug_line_str + 2c
...

Many adjacent relocations share the same section symbol. CREL cancompress most relocations into a few bytes depending on addend sizes(offset+=O, addend+=A).

By teaching the assembler to use implicit addends, we achieve evengreater size reduction by compressing most relocations into a singlebyte (mostly 0x04, offset+=1<<3). However, this mightharm compression in the presence of--compress-debug-sections=zstd. Personally I recommend thatwe don't use implicit addends.

CREL for dynamic relocations

CREL excels with static relocations, but what about the dynamiccase?

A substantial part of position-independent executables (PIEs) anddynamic shared objects (DSOs) is occupied by dynamic relocations. WhileRELR (acompact relative relocation format) offers size-saving benefits forrelative relocations, other dynamic relocations can benefit from acompact relocation format. There are a few properties:

  • There are much fewer relocation types.
  • The offsets are often adjacent by Elf_Addr. No twodynamic relocations can share the same offset.
  • Each symbol is associated with very few dynamic relocations,typically 1 or 2 (R_*_JUMP_SLOT andR_*_GLOB_DAT). When a symbol is associated with moredynamic relocations, it is typically a base class function residing inmultiple C++ virtual tables, e.g. __cxa_pure_virtual.-fexperimental-relative-c++-abi-vtables would eliminatesuch dynamic relocations.

Android's packed relocation format (linker implementation:ld.lld --pack-dyn-relocs=android) was an earlier designthat applies to all dynamic relocations at the cost of complexity. Itreplaces .rel.dyn/.rela.dyn but does notchange the section name.

Additionally, Apple linkers and dyld use LEB128 encoding for bindopcodes.

Here is a one-liner to dump the relative relocation and non-relativerelocation sizes for each shared object in a system library directory:

1
ruby -e 'Dir.glob("/usr/lib/x86_64-linux-gnu/*.so.*").each{|f| next if File.symlink?(f) || `file #{f}`!~/shared object/; s=`readelf -Wr #{f}`.lines; nr=s.count{|x|x=~/R_/&&x !~/_RELATIVE/}*24; r=s.count{|x|x=~/_RELATIVE/}*24; s=File.size(f); puts "#{f}\t#{s}\t#{r} (#{(r*100.0/s).round(2)}%)\t#{nr} (#{(nr*100.0/s).round(2)}%)" }'

I believe CREL compresses dynamic relocations well, but it is farfrom an optimal dynamic relocation format. A generalized RELR formatwould leverage the dyamic relocation properties well. Here's a possibleencoding:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// R_*_RELATIVE group
Encode(length of the group)
Encode(R_*_RELATIVE)
RELR

// R_*_GLOB_DAT/absolute address relocation group
Encode(length of the group)
Encode(R_*_GLOB_DAT)
Use RELR to encode offsets
Encode symbol indexes separately

// R_*_JUMP_SLOT group
Encode(length of the group)
Encode(R_*_JUMP_SLOT)
Use RELR to encode offsets
Encode symbol indexes separately

...

We need to enumerate all dynamic relocation types includingR_*_IRELATIVE, R_*_TLSDESC used by some ports.Some R_*_TLSDESC relocations have a symbol index of zero,but the straightforward encoding does not utilize this property.

Traditionally, we have two dynamic relocation ranges for executablesand shared objects (except static position-dependent executables):

  • .rela.dyn ([DT_RELA, DT_RELA + DT_RELASZ))or .rel.dyn ([DT_REL, DT_REL + DT_RELSZ))
  • .rela.plt([DT_JMPREL, DT_JMPREL + DT_PLTRELSZ)): Stored JUMP_SLOTrelocations. DT_PLTREL specifies DT_REL orDT_RELA.

IRELATIVE relocations can be placed in either range, but preferrablyin .rel[a].dyn.

Some GNU ld ports (e.g. SPARC) treat .rela.plt as asubset of .rela.dyn, introducing complexity for dynamicloaders.

CREL adoption considerations

  • New dynamic tag (DT_CREL): To identify CRELrelocations, separate from existingDT_REL/DT_RELA.
  • No DT_CRELSZ: Relocation count can be derived from theCREL header.
  • Output section description.rela.dyn : { *(.rela.dyn) *(.rela.plt) } is incompatiblewith CREL.

Challenges with lazy binding

glibc's lazy binding scheme relies on randomaccess to relocation entries within the DT_JMPRELtable. CREL's sequential nature prevents this. However, eagerbinding doesn't require random access. Therefore, when-z now (eager binding) is enabled, we can:

  • Set DT_PLTREL to DT_CREL.
  • Replace .rel[a].plt with .crel.plt.

Challenges with statically linked position-dependentexecutables

glibc introduces additional complexity for IRELATIVE relocations instatically linked position-dependent executables. They should onlycontain IRELATIVE relocations and no other dynamic relocations.

glibc's csu/libc-start.c processes IRELATIVE relocationsin the range [__rela_iplt_start, __rela_iplt_end)(or [__rel_iplt_start, __rel_iplt_end), determined at buildtime through ELF_MACHINE_IREL). While CREL relocationscannot be decoded in the middle of the section, we can still placeIRELATIVE relocations in .crel.dyn because there wouldn'tbe any other relocation types (position-dependent executables don't haveRELATIVE relocations). When CREL is enabled, we can define__crel_iplt_start and __crel_iplt_end forstatically linked position-dependent executables.

If glibc only intends to support addend_bit==0, the codecan simply be:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
extern const uint8_t __crel_iplt_start[] __attribute__ ((weak));
extern const uint8_t __crel_iplt_end[] __attribute__ ((weak));
if (&__crel_iplt_start != &__crel_iplt_end) {
const uint8_t *p = __crel_iplt_start;
size_t offset = 0, count = read_uleb128 (&p), shift = count & 3;
for (count >>= 3; count; count--) {
uint8_t rel_head = *p++;
offset += rel_head >> 2;
if (rel_head & 128)
offset += (read_uleb128 (&p) << 5) - 32;
if (rel_head & 2)
read_sleb128 (&p);
elf_crel_irel ((ElfW (Addr) *) (offset << shift));
}
}

Considering implicit addends for CREL

Many dynamic relocations have zero addends:

  • COPY/GLOB_DAT/JUMP_SLOT relocations only use zero addends.
  • Absolute relocations could use non-zero addends withSTT_SECTION symbol, but linkers convert them to relativerelocations.

Usually only RELATIVE/IRELATIVE and potentially TPREL/TPOFF mightrequire non-zero addends. Switching from DT_RELA toDT_REL offers a minor size advantage.

I considered defining two separate dynamic tags (DT_CRELand DT_CRELA) to distinguish between implicit and explicitaddends. However, this would have introduced complexity:

  • Should llvm-readelf -r dump the zero addends forDT_CRELA?
  • Should dynamic loaders support both dynamic tags?

I placed the delta addend bit next to offset bits so that it can bereused for offsets. Thanks to Stefan O'Rear's for making me believe thatmy original thought of reserving a single bit flag(addend_bit) within the CREL header is elegant. Dynamicloaders prioritizing simplicity can hardcode the desiredaddend_bit value.

ld.lld -z crel defaults to implicit addends(addend_bit==0), but the option of using in-relocationaddends is available with -z crel -z rela.

DT_AARCH64_AUTH_RELR vs CREL

The AArch64 PAuth ABI introduces DT_AARCH64_AUTH_RELR asa variant of RELR for signed relocations. However, its benefit seemslimited.

In a release build of Clang 16, using -z crel -z relaresulted in a .crel.dyn section size of only 1.0% of thefile size. Notably, enabling implicit addends with-z crel -z rel further reduced the size to just 0.3%. WhileDT_AARCH64_AUTH_RELR will achieve a noticeable smallerrelocation size if most relative relocations are encoded with it, theadvantage seems less significant considering CREL's already compactsize.

Furthermore, DT_AARCH64_AUTH_RLEL introduces additionalcomplexity to the linker due to its 32-bit addend limitation: thein-place 64 value encodes a 32-bit schema, giving just 32 bits to theimplicit addend. If the addend does not fit into 32 bits,DT_AARCH64_AUTH_RELR cannot be used. CREL with addendswould avoid this complexity.

I have filed Quantifying thebenefits of DT_AARCH64_AUTH_RELR.


I've implemented ld.lld -z crel to replace.rel[a].dyn and .rel[a].plt with.crel.dyn and .crel.plt. Dynamic relocationsare sorted by (r_type, r_offset) to better utilizeCREL.

Let's link clang-16-release using RELA,-z pack-relative-relocs,--pack-dyn-relocs=android+relr, and-z pack-relative-relocs -z crel and analyze theresults.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
% fld.lld @response.txt -o - | fllvm-readelf -S - | grep -E ' \.c?rel.?\.'
[ 8] .rela.dyn RELA 00000000005df318 5df318 c3a980 18 A 3 0 8
[ 9] .rela.plt RELA 0000000001219c98 1219c98 001f38 18 AI 3 26 8
% fld.lld @response.txt -z pack-relative-relocs -o - | fllvm-readelf -S - | grep -E ' \.c?rel.?\.'
[ 8] .rela.dyn RELA 00000000005df340 5df340 011088 18 A 3 0 8
[ 9] .relr.dyn RELR 00000000005f03c8 5f03c8 0259d0 08 A 0 0 8
[10] .rela.plt RELA 0000000000615d98 615d98 001f38 18 AI 3 27 8
% fld.lld @response.txt --pack-dyn-relocs=android+relr -o - | fllvm-readelf -S - | grep -E ' \.c?rel.?\.'
[ 8] .rela.dyn ANDROID_RELA 00000000005df318 5df318 0011fc 01 A 3 0 8
[ 9] .relr.dyn RELR 00000000005e0518 5e0518 0259d0 08 A 0 0 8
[10] .rela.plt RELA 0000000000605ee8 605ee8 001f38 18 AI 3 27 8
% fld.lld @response.txt -z pack-relative-relocs -z crel -o - | fllvm-readelf -S - | grep -E ' \.c?rel.?\.'
[ 8] .crel.dyn CREL 00000000005df340 5df340 000fbc 00 A 3 0 8
[ 9] .relr.dyn RELR 00000000005e0300 5e0300 0259d0 08 A 0 0 8
[10] .rel.plt REL 0000000000605cd0 605cd0 0014d0 10 AI 3 27 8
% fld.lld @response.txt -z crel -o - | fllvm-readelf -S - | grep -E ' \.c?rel.?\.'
[ 8] .crel.dyn CREL 00000000005df318 5df318 082c29 00 A 3 0 8
[ 9] .rel.plt REL 0000000000661f48 661f48 0014d0 10 AI 3 26 8
% fld.lld @response.txt -z crel -z rela -o - | fllvm-readelf -S - | grep -E ' \.c?rel.?\.'
[ 8] .crel.dyn CREL 00000000005df318 5df318 1b8c69 00 A 3 0 8
[ 9] .rela.plt RELA 0000000000797f88 797f88 001f38 18 AI 3 26 8

Analysis

  • Relative relocations usually outnumber non-relativerelocations.
  • RELR significantly optimizes relative relocations, offering thelargest size reduction.
  • CREL further improves the non-relative portion, compressing thatportion to 5.77%, even better than Android packed relocations (6.60%)!Android's r_info sharing withRELOCATION_GROUPED_BY_INFO_FLAG has been overshadowed byour shifted offset technique.
  • The non-relative relocation advantage is less pronounced since.relr.dyn still accounts for a significant portion of thesize.
  • .crel.dyn using DT_CREL (implicit addends)without RELR is more than 3x as large as RELR.relr.dyn+.rela.dyn.

Decoding ULEB128/SLEB128 would necessitate more work in the dynamicloader.

I have a much patch for the DT_CREL support. In anx86-64 -O2 build, CREL support linked with-z pack-relative-relocs -z crel increases the size oflibc.so by just 200 bytes (0.0256%) compared to a non-CRELbuild with -z pack-relative-relocs.

However, DT_CREL currently uses the non-standard value 0x40000026(see this comment: https://github.com/llvm/llvm-project/pull/91280#discussion_r1659384454).I hope that we can switch to 0x26 before lld gets -z crelsupport.

Linker notes

--emit-relocs and -r necessitate combiningrelocation sections. The output size may differ from the sum of inputsections. The total relocation count must be determined, a new headerwritten, and section content regenerated, as symbol indexes and addendsmay have changed. If the linker does not attempt to determine the offsetshift in another relocation scan, the offset shift in the header can beset to 0. Debug sections, .eh_frame, and.gcc_except_table require special handling to rewriterelocations referencing a dead symbol to R_*_NONE. Thisalso necessitates updating the relocation type.

--emit-relocs and -r copy CREL relocationsections (e.g. .crel.text) to the output. When.rela.text is also present, linkers are required to merge.rela.text into .crel.text.

GNU ld allows certain unknown section types:

  • [SHT_LOUSER,SHT_HIUSER] andnon-SHF_ALLOC
  • [SHT_LOOS,SHT_HIOS] andnon-SHF_OS_NONCONFORMING

but reports errors and stops linking for others (unless--no-warn-mismatch is specified). When linking arelocatable file using SHT_CREL, you might encounter errorslike the following:

1
2
3
4
5
6
% clang -Wa,--crel -fuse-ld=bfd a.c b.c
/usr/bin/ld.bfd: unknown architecture of input file `/tmp/a-1e0778.o' is incompatible with i386:x86-64 output
/usr/bin/ld.bfd: unknown architecture of input file `/tmp/b-9963f0.o' is incompatible with i386:x86-64 output
/usr/bin/ld.bfd: error in /tmp/a-1e0778.o(.eh_frame); no .eh_frame_hdr table will be created
/usr/bin/ld.bfd: error in /tmp/b-9963f0.o(.eh_frame); no .eh_frame_hdr table will be created
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Older lld and mold do not report errors. I have filed:

In addition, when there is one .eh_frame section withCIE pieces but no relocation, _bfd_elf_parse_eh_frame willreport an error.

mips64el

mips64el has an incorrect r_info: a 32-bit little-endiansymbol index followed by a 32-bit big-endian type. If mips64el decidesto adopt CREL, they can utilize this opportunity to fixr_info.

Data compression

I analyzed relocatable files in lld 18 (x86_64, -O3builds) and extracted RELA and CREL relocations. I investigated thepotential benefits of compressing these combined sections within thefile.

It's important to consider that data compression, while beneficialfor size reduction, can introduce challenges for random access andincrease memory usage especially for the linker. Therefore, it might notbe a suitable solution.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
with open(f'{f}.bin', 'wb') as out, \
open(f'{f}.bin.lz4', 'wb') as out_lz4, \
open(f'{f}.bin.zst', 'wb') as out_zstd, \
open(f'{f}.bin.xz', 'wb') as out_xz:
for ofile in Path(where).glob('**/*.o'):
content = bytes()
elf = lief.parse(str(ofile))
for sec in elf.sections:
if sec.name.startswith(f'.{f}'):
out.write(sec.content)
content += sec.content
sub = subprocess.run(['lz4'], input=content, capture_output=True, check=True)
out_lz4.write(sub.stdout if len(sub.stdout) < len(content) else content)
sub = subprocess.run(['zstd'], input=content, capture_output=True, check=True)
out_zstd.write(sub.stdout if len(sub.stdout) < len(content) else content)
sub = subprocess.run(['xz'], input=content, capture_output=True, check=True)
out_xz.write(sub.stdout if len(sub.stdout) < len(content) else content)

CREL, being a byte-oriented format, allows for further compression.It can be regarded as a very efficient filter before a Lempel-Zivcompressor.

Even more surprising, CREL outperforms RELA compressed with lz4(level 9) and zstd (level 6).

1
2
3
4
5
6
7
8
9
% stat -c '%s %n' *.bin*
3808775 crel.bin
2855535 crel.bin.lz4
2184912 crel.bin.xz
2319041 crel.bin.zst
28238016 rela.bin
9679414 rela.bin.lz4
2835180 rela.bin.xz
4281547 rela.bin.zst

Dynamic relocations are also worth investigating.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
ld.lld @response.txt -z now --pack-dyn-relocs=relr -o clang.relr
ld.lld @response.txt -z now --pack-dyn-relocs=android+relr -o clang.relr+android
ld.lld @response.txt -z now --pack-dyn-relocs=relr -z crel -o clang.relr+crel
llvm-objcopy --dump-section .rela.dyn=reladyn clang.relr /dev/null
llvm-objcopy --dump-section .crel.dyn=creldyn clang.relr+crel /dev/null
llvm-objcopy --dump-section .rela.dyn=androiddyn clang.relr+android /dev/null
xz -fk reladyn androiddyn creldyn
zstd -fk reladyn androiddyn creldyn
% stat -c '%s %n' reladyn* androiddyn* creldyn*
69768 reladyn
4236 reladyn.xz
4449 reladyn.zst
4604 androiddyn
1484 androiddyn.xz
1481 androiddyn.zst
3980 creldyn
1344 creldyn.xz
1367 creldyn.zst

Interestingly, both zstd and LZMA2's default levels on RELAoutperform Android's packed relocation format. Even better, CRELoutperforms them both!

The results do not suggest that we add a Lempel-Ziv compressor toCREL. It would significantly increase complexity and decoder memoryusage. The filesystem's transparent compression can handle this for usconveniently.

CREL proposal for the genericABI

The latest revision has been proposed at https://groups.google.com/g/generic-abi/c/ppkaxtLb0P0. Ihave also created:

In https://www.sco.com/developers/gabi/latest/ch4.sheader.html,make the following changes.

In Figure 4-9: Section Types,sh_type, append a row

SHT_CREL | 20

Add text:

SHT_CREL - The section holds compact relocation entries with explicitaddends. An object file may have multiple relocation sections. See''Relocation'' below for details.

In Figure 4-16: Special Sections, append

.crelname | SHT_CREL | see below

Change the text below:

.relname, .relaname, and .crelname

These sections hold relocation information, as described in''Relocation''. If the file has a loadable segment that includesrelocation, the sections' attributes will include the SHF_ALLOC bit;otherwise, that bit will be off. Conventionally, name is supplied by thesection to which the relocations apply. Thus a relocation section for.text normally would have the name .rel.text, .rela.text, or.crel.text.

In Figure 4-23: Relocation Entries, add:

1
2
3
4
5
6
7
8
9
10
11
12
13
typedef struct {
Elf32_Addr r_offset;
Elf32_Word r_symidx;
Elf32_Word r_type;
Elf32_Sword r_addend;
} Elf32_Crel;

typedef struct {
Elf64_Addr r_offset;
Elf64_Word r_symidx;
Elf64_Word r_type;
Elf64_Sxword r_addend;
} Elf64_Crel;

Add text above "A relocation section references two othersections":

A SHT_CREL section holds compact relocation entries thatdecode to Elf32_Crel or Elf64_Crel dependingon the object file class (32-bit or 64-bit). Its content begins with aULEB128-encoded value count * 8 + addend_bit * 4 + shift(35-bit or 67-bit unsigned), where:

  • count: Relocation count (32-bit or 64-bitunsigned).
  • addend_bit: 1 indicates that relocation entries encodeaddends. 0 indicates implicit addends (stored in the location to bemodified).
  • shift: The shift value (0 to 3) applies todelta_offset in relocation entries.

Relocation entries (which encode r_offset,r_symidx, r_type, and r_addend)follow the header. Note: r_info in traditional REL/RELAformats has been split into r_symidx andr_type, allowing uint32_t relocation types forELFCLASS32 as well.

  • Delta offset and flags (ULEB128): Holdsdelta_offset * (addend_bit ? 8 : 4) + flags (35-bit or67-bit unsigned), where:
    • delta_offset: Difference in r_offset fromthe previous entry (truncated to Elf32_Addr orElf64_Addr), right shifted by shift.
    • flags: 0 to 7 if addend_bit is 1;otherwise 0 to 3.
    • flags & 1: Indicate if delta symbol index ispresent.
    • flags & 2: Indicate if delta type is present.
    • flags & 4: Indicate if delta addend ispresent.
  • Delta symbol index (SLEB128, if present): The difference in symbolindex from the previous entry, truncated to a 32-bit signedinteger.
  • Delta type (SLEB128, if present): The difference in relocation typefrom the previous entry, truncated to a 32-bit signed integer.
  • Delta addend (SLEB128, if present): The difference in addend fromthe previous entry, truncated to a 32-bit or 64-bit signed integerdepending on the object file class.

ULEB128 or SLEB128 encoded values use the canonical representation(i.e., the shortest byte sequence). For the first relocation entry, theprevious offset, symbol index, type, and addend members are treated aszero.

Encoding/decoding delta offset and flags does not needmulti-precision arithmetic. We can just unroll and special case thefirst iteration. The header can be encoded/decoded in a similar way. Animplementation can assume that the relocation count cannot be largerthan 2**61 and simplify the code.

Example C++ encoder:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// encodeULEB128(uint64_t, raw_ostream &os);
// encodeSLEB128(int64_t, raw_ostream &os);

const uint8_t addendBit = config->isRela ? 4 : 0, flagBits = config->isRela ? 3 : 2;
Elf_Addr offsetMask = 8, offset = 0, addend = 0;
uint32_t symidx = 0, type = 0;
for (const Elf_Crel &rel : relocs)
offsetMask |= rel.r_offset;
int shift = std::countr_zero(offsetMask)
encodeULEB128(relocs.size() * 8 + addendBit + shift, os);
for (const Elf_Crel &rel : relocs) {
Elf_Addr deltaOffset = (rel.r_offset - offset) >> shift;
uint8_t b = (deltaOffset << flagBits) + (symidx != rel.r_symidx) +
(type != rel.r_type ? 2 : 0) + (addend != rel.r_addend ? 4 : 0);
if (deltaOffset < (0x80 >> flagBits)) {
os << char(b);
} else {
os << char(b | 0x80);
encodeULEB128(deltaOffset >> (7 - flagBits), os);
}
if (b & 1) {
encodeSLEB128(static_cast<int32_t>(rel.r_symidx - symidx), os);
symidx = rel.r_symidx;
}
if (b & 2) {
encodeSLEB128(static_cast<int32_t>(rel.r_type - type), os);
type = rel.r_type;
}
if (b & 4 & addendBit) {
encodeSLEB128(std::make_signed_t<Elf_Addr>(rel.r_addend - addend), os);
addend = rel.r_addend;
}
}

Example C++ decoder:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// uint64_t decodeULEB128(uint8_t *&p);
// int64_t decodeSLEB128(uint8_t *&p);

const auto hdr = decodeULEB128(p);
const size_t count = hdr / 8, flagBits = hdr & 4 ? 3 : 2, shift = hdr % 4;
Elf_Addr offset = 0, addend = 0;
uint32_t symidx = 0, type = 0;
for (size_t i = 0; i != count; ++i) {
const uint8_t b = *p++;
offset += b >> flagBits;
if (b >= 0x80)
offset += (decodeULEB128(p) << (7 - flagBits)) - (0x80 >> flagBits);
if (b & 1)
symidx += decodeSLEB128(p);
if (b & 2)
type += decodeSLEB128(p);
if (b & 4 & hdr)
addend += decodeSLEB128(p);
rels[i] = {offset << shift, symidx, type, addend};
}

Both encoder and decoder can be simplified if the desiredaddend_bit is hardcoded, making flagBits aninteger literal.

In Figure 5-10: Dynamic Array Tags, d_tag, add:

DT_CREL | 38 | d_ptr | optional |optional

Add text below:

  • DT_CREL - This element is similar toDT_REL, except its table uses the CREL format. Therelocation count can be inferred from the header.

Update DT_PLTREL andDT_PLTRELSZ:

  • DT_PLTRELSZ: This element holds the total size, inbytes, of the relocation entries associated with the procedure linkagetable. If an entry of type DT_JMPREL is present and theDT_PLTREL entry value is DT_REL orDT_RELA, a DT_PLTRELSZ must accompany it.
  • DT_PLTREL: This member specifies the type of relocationentry to which the procedure linkage table refers. Thed_val member holds DT_REL,DT_RELA, or DT_CREL, as appropriate. Allrelocations in a procedure linkage table must use the same relocationtype.

Abandonded proposalRELLEB (last revision)

A SHT_CREL section holds compact relocation entries thatdecode to Elf32_Crel or Elf64_Crel dependingon the object file class (32-bit or 64-bit). Its content begins with aULEB128-encoded relocation count, followed by entries encodingr_offset, r_symidx, r_type, andr_addend. Note that the r_info member intraditional REL/RELA formats has been split into separater_symidx and r_type members, allowinguint32_t relocation types for ELFCLASS32 as well.

In the following description,Elf_Addr/Elf_SAddr denoteuint32_t/int32_t for ELFCLASS32 oruint64_t/int64_t for ELFCLASS64.

  • First member (ULEB128): Holds 2 * delta_offset + eq(33-bit or 65-bit unsigned), where:
    • delta_offset: Difference in r_offset fromthe previous entry (Elf_Addr).
    • eq: Indicates if the symbol index/type match theprevious entry (1 for match, 0 otherwise).
  • Second Member (SLEB128) if eq is 1:
    • Difference in r_addend from the previous entry(Elf_SAddr).
  • Second Member (SLEB128) if eq is 0:
    • If type and addend match the previous entry, the encoded value isthe symbol index; type and addend are omitted.
    • Otherwise, the bitwise NOT of the encoded value (33-bit signed) isthe symbol index; delta type and delta addend follow:
      • Delta type (SLEB128): The difference in relocation type from theprevious entry (32-bit signed).
      • Delta addend (SLEB128): The difference in r_addendrelative to the previous entry (signed Elf_Addr).

The bitwise NOT of symbol index 0xffffffff is -0x100000000 (33-bit)instead of 0 (32-bit).

Encoder in pseudo-code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Elf_Addr offset = 0, addend = 0;
uint32_t symidx = 0, type = 0;
encodeULEB128(relocs.size());
for (const Reloc &rel : relocs) {
if (symidx == rel.r_symidx && type == rel.r_type) {
// Symbol index/type match the previous entry. Encode the addend.
encodeULEB128(2 * uint128_t(rel.r_offset - offset) + 1); // at most 65-bit
encodeSLEB128(rel.r_addend - addend);
} else {
encodeULEB128(2 * uint128_t(rel.r_offset - offset)); // at most 65-bit
if (type == rel.r_type && addend == rel.r_addend) {
// Type/addend match the previous entry. Encode the symbol index.
encodeSLEB128(rel.r_symidx);
} else {
// No optimization is applied. Encode symbol index, type, and addend.
encodeSLEB128(~static_cast<int64_t>(symidx));
encodeSLEB128(static_cast<int32_t>(rel.type - type));
type = rel.r_type;
encodeSLEB128(static_cast<Elf_SAddr>(rel.r_addend - addend));
addend = rel.r_addend;
}
symidx = rel.r_symidx;
}
}

Encoding/decoding a unsigned 65-bit does not need multi-precisionarithmetic. We can just unroll and special case the first iteration.Example C++ encoder:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// encodeULEB128(uint64_t, raw_ostream &os);
// encodeSLEB128(int64_t, raw_ostream &os);

Elf_Addr offset = 0, addend = 0;
uint32_t symidx = 0, type = 0;
encodeULEB128(relocs.size(), os);
for (const Reloc &rel : relocs) {
auto deltaOffset = static_cast<uint64_t>(rel.r_offset - offset);
offset = rel.r_offset;
uint8_t odd = outSymidx == symidx && outType == type, b = deltaOffset * 2 + odd;
if (deltaOffset < 0x40) {
os << char(b);
} else {
os << char(b | 0x80);
encodeULEB128(deltaOffset >> 6, os);
}
symidx = rel.symidx;
if (!odd && type == rel.type && addend == rel.addend) {
encodeSLEB128(symidx, os);
} else {
if (!odd) {
encodeSLEB128(~static_cast<int64_t>(symidx), os);
encodeSLEB128(static_cast<int32_t>(rel.type - type), os);
type = rel.type;
}
encodeSLEB128(std::make_signed_t<uint>(rel.addend - addend), os);
addend = rel.addend;
}
}

Example C++ decoder:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// uint64_t decodeULEB128(uint8_t *&p);
// int64_t decodeSLEB128(uint8_t *&p);

size_t count = decodeULEB128(p);
Elf_Addr offset = 0, addend = 0;
uint32_t symidx = 0, type = 0;
for (size_t i = 0; i != count; ++i) {
const uint8_t b = *p++;
offset += b >> 1;
if (b >= 0x80)
offset += (decodeULEB128(p) << 6) - 0x40;
int64_t x = decodeSLEB128(p);
if (b & 1) {
addend += x;
} else {
if (x < 0) {
x = ~x;
type += decodeSLEB128(p);
addend += decodeSLEB128(p);
}
symidx = x;
}
rels[i] = {offset, symidx, type, addend};
}

A new relocation format for ELF

作者 MaskRay
2024年3月9日 16:00

ELF's design emphasizes natural size and alignment guidelines for itscontrol structures. This principle, outlined in Proceedings of theSummer 1990 USENIX Conference, ELF: An Object File to MitigateMischievous Misoneism, promotes ease of random access forstructures like program headers, section headers, and symbols.

All data structures that the object file format defines follow the"natural" size and alignment guidelines for the relevant class. Ifnecessary, data structures contain explicit padding to ensure 4-bytealignment for 4-byte objects, to force structure sizes to a multiple offour, etc. Data also have suitable alignment from the beginning of thefile. Thus, for example, a structure containing an Elf32_Addr memberwill be aligned on a 4-byte boundary within the file. Other classeswould have appropriately scaled definitions. To illustrate, the 64-bitclass would define Elf64 Addr as an 8-byte object, aligned on an 8-byteboundary. Following the strictest alignment for each object allows theformat to work on any machine in a class. That is, all ELF structures onall 32-bit machines have congruent templates. For portability, ELF usesneither bit-fields nor floating-point values, because theirrepresentations vary, even among pro- cessors with the same byte order.Of course the pro- grams in an ELF file may use these types, but theformat itself does not.

While beneficial for many control structures, the natural sizeguideline presents significant drawbacks for relocations. Sincerelocations are typically processed sequentially, they don't gain thesame random-access advantages. The large 24-byte Elf64_Rela structurehighlights the drawback. For a detailed comparison of relocationformats, see Exploringobject file formats#Relocations.

Furthermore, Elf32_Rel and Elf32_Relasacrifice flexibility to maintain a smaller size, limiting relocationtypes to a maximum of 255. This constraint has become noticeable forAArch32 and RISC-V. While the 24-bit symbol index field is less elegant,it hasn't posed significant issues in real-world use cases.

In contrast, the WebAssemblyobject file format uses LEB128 encoding for relocations and otherconstrol structures, offering a significant size advantage over ELF.

Inspired by WebAssembly, I will explore real-world scenarios whererelocation size is critical and propose an alternative format (RELLEB)that addresses ELF's limitations.

Use cases

Dynamic relocations

A substantial part of position-independent executables (PIEs) anddynamic shared objects (DSOs) is occupied by dynamic relocations. WhileRELR (acompact relative relocation format) offers size-saving benefits forrelative relocations, other dynamic relocations can benefit from acompact relocation format.

ld.lld --pack-dyn-relocs=android was an earlier designthat applies to all dynamic relocations at the cost of complexity.

Additionally, Apple linkers and dyld use LEB128 encoding for bindopcodes.

Marker relocations

Marker relocations are utilized to indicate certain linkeroptimization/relaxation is applicable. While many marker relocations areused scarcely, RISC-V relocatable files are typically filled up withR_RISCV_RELAX relocations. Their size contribution is quitesubstantial.

.llvm_addrsig

On many Linux targets, Clang emits a special section called.llvm_addrsig (type SHT_LLVM_ADDRSIG, LLVMaddress-significance table) by default to allowld.lld --icf=safe. The .llvm_addrsig sectionstores symbol indexes in ULEB128 format, independent of relocations.Consequently, tools like ld -r and objcopy risk invalidatethe section due to symbol table modifications.

Ideally, using relocations would allow certain operations. However,the size concern of REL/RELA in ELF hinders this approach. In contrast,lld's Mach-O port chosea relocation-based representation for__DATA,__llvm_addrsig.

.llvm.call-graph-profile

LLVM leverages a special section called.llvm.call-graph-profile (typeSHT_LLVM_CALL_GRAPH_PROFILE) for both instrumentation- andsample-based profile-guided optimization (PGO). lld utilizesthis information ((from_symbol, to_symbol, weight) tuples) tooptimize section ordering within an input section description, enhancingcache utilization and minimizing TLB thrashing.

Similar to .llvm_addrsig, the.llvm.call-graph-profile section initially faced the symbolindex invalidation problem, which was solved by switching torelocations. I opted for REL over RELA to reduce code size.

.debug_names

DWARF v5 accelerated name-based access with the introduction of the.debug_names section. However, in aclang -g -gpubnames generated relocatable file, the.rela.debug_names section can consume a significant portion(approximately 10%) of the file size. This size increase has sparkeddiscussions within the LLVM community about potentially alteringthe file format for linking purposes.

The availability of a more compact relocation format would likelyalleviate the need for such format changes.

Compressed relocations

While the standard SHF_COMPRESSED feature is commonlyused for debug sections, its application can easily extend to relocationsections. I have developed a Clang/lld prototype that demonstrates thisby compressing SHT_RELA sections.

The compressed SHT_RELA section occupiessizeof(Elf64_Chdr) + size(compressed) bytes. Theimplementation retains uncompressed content if compression would resultin a larger size.

In scenarios with numerous smaller relocation sections (such as whenusing -ffunction-sections -fdata-sections), the 24-byteElf64_Chdr header can introduce significant overhead. Thisobservation raises the question of whether encodingElf64_Chdr fields using ULEB128 could further optimize filesizes. With larger monolithic sections (.text,.data, .eh_frame), compression ratio would behigher as well.

1
2
3
4
5
6
7
8
# configure-llvm is my wrapper of cmake that specifies some useful options.
configure-llvm s2-custom0 -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_PROJECTS='clang;lld'
configure-llvm s2-custom1 -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_PROJECTS='clang;lld' -DCMAKE_{C,CXX}_FLAGS=-Xclang=--compress-relocations=zstd
ninja -C /tmp/out/s2-custom0 lld
ninja -C /tmp/out/s2-custom1 lld

ruby -e 'p Dir.glob("/tmp/out/s2-custom0/**/*.o").sum{|f| File.size(f)}' # 137725952
ruby -e 'p Dir.glob("/tmp/out/s2-custom1/**/*.o").sum{|f| File.size(f)}' # 117747320

Despite the overhead of-ffunction-sections -fdata-sections, the compressiontechnique yields a significant reduction of 14.5%!

However, dropping in-place relocation processing is a downside.

RELLEB relocation format

The 1990 ELF paper ELF: An Object File to Mitigate MischievousMisoneism says "ELF allows extension and redefinition for othercontrol structures." Inspired by WebAssembly, let's explore RELLEB, anew and more compact relocation format designed to replace RELA. Ouremphasis is on simplicity over absolute minimal encoding. See the end ofthe article for a detailed format description.

A SHT_RELLEB section (preferred name:.relleb<name>) holds compact relocation entries thatdecode to Elf32_Relr or Elf64_Relr dependingon the object file class (32-bit or 64-bit). The content begins with aULEB128-encoded relocation count, followed by a sequence of relocationentries. Individual entries encode r_offset,r_type, r_symidx using ULEB128/SLEB128, withan optional member encoding r_addend.

Here are key design choices:

Relocation count (ULEB128):

This allows for efficient retrieval of the relocation count withoutdecoding the entire section. While a uint32_t (like SHT_HASH)could be used, ULEB128 aligns with subsequent entries and offers aslight size advantage in most cases when the number of symbols can beencoded in one to three bytes.

Delta encoding for r_offset (ULEB128):

Section offsets can be large, and relocations are typically ordered.Storing the difference between consecutive offsets offers compressionpotential. In most cases, a single byte will suffice. While there areexceptions (general dynamic TLS model of s390/s390x uses a local"out-of-order" pair:R_390_PLT32DBL(offset=o) R_390_TLS_GDCALL(offset=o-2)), weare optimizing for the common case. Switching to SLEB128 would increasethe total .o size by 0.1%.

For ELFCLASS32, r_offsets members are calculated usingmodular arithmetic modulo 4294967296.

Delta encoding for r_type (SLEB128):

Some psABIs utilize relocation types greater than 128. AArch64'sstatic relocation types begin at 257 and dynamic relocation types beginat 1024, necessitating two bytes with ULEB128/SLEB128 encoding in theabsence of delta encoding. Delta encoding allows all but the firstrelocation's type to be encoded in a single byte. An alternative designis to define a base type in the header and encode types relative to thebase type, which would introduce slight complexity.

If the AArch32 psABI could be redesigned, allocating[0,64) for Thumb relocation types and [64,*)for ARM relocation types would optimize delta encoding even further.

Symbol index and addend presence (SLEB128):

ULEB128 optimizes for the common case when the symbol index isencodable in one or two bytes. Using SLEB128 and delta encoding insteadof ULEB128 for the symbol index field would increase the total size by0.4%. Potential gains for dynamic relocations with consecutive indexesare outweighed by the loss in static relocations and added complexity,hence avoiding delta encoding. We indicate addend omission using thesign bit (see below).

Delta encoding for addend (SLEB128):

Consecutive static relocations often have identical addends,especially on architectures with frequent zero addends (AArch64,PowerPC, RISC-V, etc). Addend omission optimizes this scenario.Additionally, it benefits architectures like AArch32 and x86, whichoften have identical negative addends (call instructions) forconsecutive relocations. Dynamic relocations (exceptR_*_RELATIVE and R_*_IRELATIVE) typically havezero addends, also benefiting from our scheme.

The sign bit of the previous member flags whether the addend haschanged relative to the prior entry. When the addend changes, we use anaddend delta. This offers a slight size advantage (about 0.20%) andoptimizes for cases like:

1
2
3
4
.quad .data + 0x78
.quad .data + 0x80
.quad .data + 0x88
...

Note: when the bitwise NOT code path is taken, the zero delta addendis not utilized.

1
2
// Many R_X86_64_PLT32(g-4)
void f() { g(); g(); g(); ... }

There is no RELLEB replacement for .rela.plt. In glibc,there is complexity due to __rela_iplt_start.

I have developed a prototype at https://github.com/MaskRay/llvm-project/tree/demo-relleb.RELLEB demonstrates superrior size reduction compared to theSHF_COMPRESSED SHT_RELA approach.

1
2
3
4
5
configure-llvm s2-custom2 -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_PROJECTS='clang;lld' -DCMAKE_{C,CXX}_FLAGS=-mrelleb
# Change /usr/local/bin/ld to point to an updated lld
ninja -C /tmp/out/s2-custom2 lld

ruby -e 'p Dir.glob("/tmp/out/s2-custom2/**/*.o").sum{|f| File.size(f)}' # 114305360

RELLEB yields a significant reduction of 20.5%! The total relocationsection size has decreased from 28768176 to 5318839, 18.4% or theoriginal.

It would be interesting to explore the potential gains of combiningzstd compression with RELLEB.

1
2
3
4
configure-llvm s2-custom3 -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_PROJECTS='clang;lld' -DCMAKE_{C,CXX}_FLAGS='-mrelleb -Xclang --compress-relocations=zstd'
ninja -C /tmp/out/s2-custom3 lld

ruby -e 'p Dir.glob("/tmp/out/s2-custom3/**/*.o").sum{|f| File.size(f)}' # 113061168

While the 26.4% reduction in RELLEB section size suggests room forfurther optimization, the overall decrease of only 1.088% in .o filesizes indicates that the current compact relocation format offers areasonable compromise. (In the absence of the addend presence and deltaaddend technique, the overall decrease is about 1.5%.)

I debated whether to name the new section SHT_RELOC(.reloc<name>) or SHT_RELLEB(.relleb<name>). Ultimately, I choseSHT_RELLEB because its unique nature minimizes potentialconfusion, whereas SHT_RELOC could be confused withSHT_REL and SHT_RELA.

RELLEB for dynamicrelocations

RELLEB excels with static relocations, but what about the dynamiccase? A truly optimal dynamic relocation format would differsubstantially. Since dynamic relocations are often adjacent in offsetsby 4 or 8 and include fewer types, a generalized RELR format would beideal. Here's a possible encoding:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// For R_*_RELATIVE
Encode(RELR bytes)
Encode(R_*_RELATIVE)
RELR

// For R_*_GLOB_DAT or the symbolic relocation type
Encode(R_*_GLOB_DAT)
Use RELR to encode offsets
Encode symbol indexes separately

// For R_*_JUMP_SLOT
Encode(R_*_JUMP_SLOT)
Use RELR to encode offsets
Encode symbol indexes separately

While RELLEB is a step up from REL/RELA for dynamic relocations, it'snot perfect. Android's packed relocation format, despite its complexityand lack of bitmap encoding, provides greater space savings.

I've implemented -z relleb in my prototype for testing.Let's link clang-16-debug using several methods and compare: RELA,--pack-dyn-relocs=relr,--pack-dyn-relocs=android+relr, and--pack-dyn-relocs=relr -z relleb.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
% llvm-readelf -S clang | grep ' \.rel.*\.'
[ 8] .rela.dyn RELA 0000000000f28998 f28998 e177d0 18 A 3 0 8
[ 9] .rela.plt RELA 0000000001d40168 1d40168 0025b0 18 AI 3 26 8
% llvm-readelf -S clang.relr | grep ' \.rel.*\.'
[ 8] .rela.dyn RELA 0000000000f28998 f28998 016e90 18 A 3 0 8
[ 9] .relr.dyn RELR 0000000000f3f828 f3f828 02ccd0 08 A 0 0 8
[10] .rela.plt RELA 0000000000f6c4f8 f6c4f8 0025b0 18 AI 3 27 8
% llvm-readelf -S clang.relr+android | grep ' \.rel.*\.'
[ 8] .rela.dyn ANDROID_RELA 0000000000f28998 f28998 001707 01 A 3 0 8
[ 9] .relr.dyn RELR 0000000000f2a0a0 f2a0a0 02ccd0 08 A 0 0 8
[10] .rela.plt RELA 0000000000f56d70 f56d70 0025b0 18 AI 3 27 8
% llvm-readelf -S clang.relr+relleb | grep ' \.rel.*\.'
[ 8] .relleb.dyn 0x14: <unknown> 0000000000f28998 f28998 003bcd 01 A 3 0 8
[ 9] .relr.dyn RELR 0000000000f2c568 f2c568 02ccd0 08 A 0 0 8
[10] .rela.plt RELA 0000000000f59238 f59238 0025b0 18 AI 3 27 8

Analysis

  • RELR significantly optimizes relative relocations, offering thelargest size reduction.
  • RELLEB further improves the non-relative portion, achieving a decent16.3% reduction. However, it's overshadowed by Android packedrelocations (6.3%).
  • While Android packed relocations have a smaller footprint, theiradvantage is less pronounced since .relr.dyn still accountsfor a significant portion of the size.

RELLEB proposal for thegeneric ABI

In https://www.sco.com/developers/gabi/latest/ch4.sheader.html,make the following changes.

In Figure 4-9: Section Types,sh_type, append a row

SHT_RELLEB | 20

Add text:

SHT_RELLEB - The section holds compact relocation entries withexplicit addends. An object file may have multiple relocation sections.''Relocation'' below for details.

In Figure 4-16: Special Sections, append a row

.rellebname | SHT_RELLEB | see below

Change the text below:

.relname, .relaname, and .rellebname

These sections hold relocation information, as described in''Relocation''. If the file has a loadable segment that includesrelocation, the sections' attributes will include the SHF_ALLOC bit;otherwise, that bit will be off. Conventionally, name is supplied by thesection to which the relocations apply. Thus a relocation section for.text normally would have the name .rel.text, .rela.text, or.relleb.text.

In Figure 4-23: Relocation Entries, add:

1
2
3
4
5
6
7
8
9
10
11
12
13
typedef struct {
Elf32_Addr r_offset;
Elf32_Word r_type;
Elf32_Word r_symidx;
Elf32_Sxword r_addend;
} Elf32_Relleb;

typedef struct {
Elf64_Addr r_offset;
Elf64_Word r_type;
Elf64_Word r_symidx;
Elf64_Sxword r_addend;
} Elf64_Relleb;

Add text before "A relocation section references two othersections":

A SHT_RELLEB section holds compact relocation entriesthat decode to Elf32_Relr or Elf64_Relrdepending on the object file class (32-bit or 64-bit). Its contentbegins with a ULEB128-encoded relocation count, followed by entriesencoding r_offset, r_type,r_symidx, and r_addend. Note that ther_info member in traditional REL/RELA formats has beensplit into separate r_type and r_symidxmembers, allowing uint32_t relocation types for ELFCLASS32as well.

  • Delta offset: (ULEB128-encoded) The difference inr_offset relative to the previous entry. The difference istruncated to 32-bit/64-bit for ELFCLASS32/ELFCLASS64, respectively.
  • Delta type: (SLEB128-encoded) The difference in relocation typerelative to the previous entry.
  • Symbol table index and addend presence: (SLEB128-encoded) If ther_offset is equal to the previous r_addend,the encoded value represents the symbol index; otherwise, the bitwiseNOT of the encoded value indicates the symbol index.
  • Delta addend: (SLEB128-encoded, only if indicated in the previousmember) The difference in r_addend relative to the previousentry. The difference is truncated to 32-bit/64-bit forELFCLASS32/ELFCLASS64, respectively.

For the first relocation entry, the previous offset, type, and addendmembers are treated as zero.

In Figure 5-10: Dynamic Array Tags, d_tag, add:

DT_RELLEB | 38 | d_ptr | optional | optional

Add text below:

  • DT_RELLEB - This element is similar toDT_RELA, except its table uses the RELLEB format. If thiselement is present, the dynamic structure must also have aDT_RELLEBSZ element.
❌
❌