Understanding alignment - from source to object file
Alignment refers to the practice of placing data or code at memoryaddresses that are multiples of a specific value, typically a power of2. This is typically for the underlying hardware requirement, or forefficient access. Misaligned memory accesses might be expensive or willcause traps on certain architectures.
This blog post explores how alignment is represented and managed asC++ code is transformed through the compilation pipeline: from sourcecode to LLVM IR, assembly, and finally the object file. We'll focus onalignment for both variables and functions.
Alignment in C++ source code
C++ [basic.align]specifies
Object types have alignment requirements ([basic.fundamental],[basic.compound]) which place restrictions on the addresses at which anobject of that type may be allocated. An alignment is animplementation-defined integer value representing the number of bytesbetween successive addresses at which a given object can be allocated.An object type imposes an alignment requirement on every object of thattype; stricter alignment can be requested using the alignment specifier([dcl.align]). Attempting to create an object ([intro.object]) instorage that does not meet the alignment requirements of the object'stype is undefined behavior.
alignas
can be used to request a stricter alignment.
An alignment-specifier may be applied to a variable or to a classdata member, but it shall not be applied to a bit-field, a functionparameter, or an exception-declaration ([except.handle]). Analignment-specifier may also be applied to the declaration of a class(in an elaborated-type-specifier ([dcl.type.elab]) or class-head([class]), respectively). An alignment-specifier with an ellipsis is apack expansion ([temp.variadic]).
Example:
1
2alignas(16) int i0;
struct alignas(8) S {};
If the strictest alignas
on a declaration is weaker thanthe alignment it would have without any alignas specifiers, the programis ill-formed.
1 |
% echo 'alignas(2) int v;' | clang -fsyntax-only -xc++ - |
However, the GNU extension __attribute__((aligned(1)))
can request a weaker alignment.
1 |
typedef int32_t __attribute__((aligned(1))) unaligned_int32_t; |
LLVM IR representation
In the LLVM Intermediate Representation (IR), both global variablesand functions can have an align
attribute to specify theirrequired alignment.
An explicit alignment may be specified for a global, which must be apower of 2. If not present, or if the alignment is set to zero, thealignment of the global is set by the target to whatever it feelsconvenient. If an explicit alignment is specified, the global is forcedto have exactly that alignment. Targets and optimizers are not allowedto over-align the global if the global has an assigned section. In thiscase, the extra alignment could be observable: for example, code couldassume that the globals are densely packed in their section and try toiterate over them as an array, alignment padding would break thisiteration. For TLS variables, the module flag MaxTLSAlign, if present,limits the alignment to the given value. Optimizers are not allowed toimpose a stronger alignment on these variables. The maximum alignment is1 << 32.
Function alignment
An explicit alignment may be specified for a function. If notpresent, or if the alignment is set to zero, the alignment of thefunction is set by the target to whatever it feels convenient. If anexplicit alignment is specified, the function is forced to have at leastthat much alignment. All alignments must be a power of 2.
A backend can override this with a preferred function alignment(STI->getTargetLowering()->getPrefFunctionAlignment()
),if that is larger than the specified align value. (
In addition, align
can be used in parameter attributesto decorate a pointer or
LLVM back end representation
Global variablesAsmPrinter::emitGlobalVariable
determines the alignment forglobal variables based on a set of nuanced rules:
- With an explicit alignment (
explicit
),- If the variable has a section attribute, return
explicit
. - Otherwise, compute a preferred alignment for the data layout(
getPrefTypeAlign
, referred to aspref
).Returnpref < explicit ? explicit : max(E, getABITypeAlign)
.
- If the variable has a section attribute, return
- Without an explicit alignment: return
getPrefTypeAlign
.
getPrefTypeAlign
employs a heuristic for global variabledefinitions: if the variable's size exceeds 16 bytes and the preferredalignment is less than 16 bytes, it sets the alignment to 16 bytes. Thisheuristic balances performance and memory efficiency for common cases,though it may not be optimal for all scenarios. (See
For assembly output, AsmPrinter emits .p2align
(power of2 alignment) directives with a zero fill value (i.e. the padding bytesare zeros).
1
2
3
4
5
6
7
8
9
10% echo 'int v0;' | clang --target=x86_64 -S -xc - -o -
.file "-"
.type v0,@object # @v0
.bss
.globl v0
.p2align 2, 0x0
v0:
.long 0 # 0x0
.size v0, 4
...
Functions For functions,AsmPrinter::emitFunctionHeader
emits alignment directivesbased on the machine function's alignment settings.
1 |
void MachineFunction::init() { |
- The subtarget's minimum function alignment
- If the function is not optimized for size (i.e. not compiled with
-Os
or-Oz
), take the maximum of the minimumalignment and the preferred alignment. For example,X86TargetLowering
sets the preferred function alignment to16.
1 |
% echo 'void f(){} [[gnu::aligned(32)]] void g(){}' | clang --target=x86_64 -S -xc - -o - |
The emitted .p2align
directives omits the fill valueargument: for code sections, this space is filled with no-opinstructions.
Assembly representation
GNU Assembler supports multiple alignment directives:
-
.p2align 3
: align to 2**3 -
.balign 8
: align to 8 -
.align 8
: this is identical to.balign
onsome targets and.p2align
on the others.
Clang supports "direct object emission" (clang -c
typically bypasses a separate assembler), the LLVMAsmPrinter directlyuses the MCObjectStreamer
API. This allows Clang to emitthe machine code directly into the object file, bypassing the need toparse and interpret alignment directives and instructions from atext-based assembly file.
These alignment directives has an optional third argument: themaximum number of bytes to skip. If doing the alignment would requireskipping more bytes than the specified maximum, the alignment is notdone at all. GCC's -falign-functions=m:n
utilizes thisfeature.
Object file format
In an object file, the section alignment is determined by thestrictest alignment directive present in that section. The assemblersets the section's overall alignment to the maximum of all thesedirectives, as if an implicit directive were at the start.
1 |
.section .text.a,"ax" |
This alignment is stored in the sh_addralign
fieldwithin the ELF section header table. You can inspect this value usingtools such as readelf -WS
(llvm-readelf -S
) orobjdump -h
(llvm-objdump -h
).
Linker considerations
The linker combines multiple object files into a single executable.When it maps input sections from each object file into output sectionsin the final executable, it ensures that section alignments specified inthe object files are preserved.
How the linker handlessection alignment
Output section alignment: This is the maximumsh_addralign
value among all its contributing inputsections. This ensures the strictest alignment requirements are met.
Section placement: The linker also uses inputsh_addralign
information to position each input sectionwithin the output section. As illustrated in the following example, eachinput section (like a.o:.text.f
or b.o:.text
)is aligned according to its sh_addralign
value before beingplaced sequentially.
1 |
output .text |
Link script control A linker script can override thedefault alignment behavior. The ALIGN
keyword enforces astricter alignment. For example .text : ALIGN(32) { ... }
aligns the section to at least a 32-byte boundary. This is often done tooptimize for specific hardware or for memory mapping requirements.
The SUBALIGN
keyword on an output section overrides theinput section alignments.
Padding: To achieve the required alignment, thelinker may insert padding between sections or before the first inputsection (if there is a gap after the output section start). The fillvalue is determined by the following rules:
- If specified, use the
=fillexp
output section attribute (within an output sectiondescription). - If a non-code section, use zero.
- Otherwise, use a trap or no-op instructin.
Padding and sectionreordering
Linkers typically preserve the order of input sections from objectfiles. To minimize the padding required between sections, linker scriptscan use a SORT_BY_ALIGNMENT
keyword to arrange inputsections in descending order of their alignment requirements. Similarly,GNU ld supports --sort-common
to sort COMMON symbols by decreasing alignment.
While this sorting can reduce wasted space, modern linking strategiesoften prioritize other factors, such as cache locality (for performance)and data similarity (for Lempel–Ziv compression ratio), which canconflict with sorting by alignment. (Search--bp-compression-sort=
on
ABI compliance
Some platforms have special rules. For example,
- On SystemZ, the
larl
(load address relative long)instruction cannot generate odd addresses. To prevent GOT indirection,compilers ensure that symbols are at least aligned by 2. (Toolchainnotes on z/Architecture) - On AIX, the default alignment mode is
power
: for doubleand long double, the first member of this data type is aligned accordingto its natural alignment value; subsequent members of the aggregate arealigned on 4-byte boundaries. (https://reviews.llvm.org/D79719) - z/OS caps the maximum alignment of static storage variables to 16.(https://reviews.llvm.org/D98864)
The standard representation of the the Itanium C++ ABI requiresmember function pointers to be even, to distinguish between virtual andnon-virtual functions.
In the standard representation, a member function pointer for avirtual function is represented with ptr set to 1 plus the function'sv-table entry offset (in bytes), converted to a function pointer as ifby
reinterpret_cast<fnptr_t>(uintfnptr_t(1 + offset))
,whereuintfnptr_t
is an unsigned integer of the same sizeasfnptr_t
.
Conceptually, a pointer to member function is a tuple:
- A function pointer or virtual table index, discriminated by theleast significant bit
- A displacement to apply to the
this
pointer
Due to the least significant bit discriminator, members function needa stricter alignment even if __attribute__((aligned(1)))
isspecified:
1 |
virtual void bar1() __attribute__((aligned(1))); |
Side note: check out
Architecture considerations
Contemporary architectures generally support unaligned memory access,likely with very small performance penalties. However, someimplementations might restrict or penalize unaligned accesses heavily,or require specific handling. Even on architectures supporting unalignedaccess, atomic operations might still require alignment.
- On AArch64, a bit in the system control register
sctlr_el1
enables alignment check. - On x86, if the AM bit is set in the CR0 register and the AC bit isset in the EFLAGS register, alignment checking of user-mode dataaccessing is enabled.
Linux's RISC-V port supportsprctl(PR_SET_UNALIGN, PR_UNALIGN_SIGBUS);
to enable strictalignment.
clang -fsanitize=alignment
can detect misaligned memoryaccess. Check out my
MIPS Computer Systems Inc was granted a patent in 1989:
Aligning code forperformance
GCC offers a family of performance-tuning options named-falign-*
, that instruct the compiler to align certain codesegments to specific memory boundaries. These options might improveperformance by preventing certain instructions from crossing cache lineboundaries (or instruction fetch boundaries), which can otherwise causean extra cache miss.
-
-falign-function=n
: Align functions. -
-falign-labels=n
: Align branch targets. -
-falign-jumps=n
: Align branch targets, for branchtargets where the targets can only be reached by jumping. -
-falign-loops=n
: Align the beginning of loops.
Important considerations
Inefficiency with Small Functions: Aligning smallfunctions can be inefficient and may not be worth the overhead. Toaddress this, GCC introduced -flimit-function-alignment
in2016. The option sets .p2align
directive's max-skip operandto the estiminated function size minus one.
1 |
% echo 'int add1(int a){return a+1;}' | gcc -O2 -S -fcf-protection=none -xc - -o - -falign-functions=16 | grep p2align |
In LLVM, the x86 backend does not implementTargetInstrInfo::getInstSizeInBytes
, making it challengingto implement -flimit-function-alignment
.
Cold code: These options don't apply to coldfunctions. To ensure that cold functions are also aligned, use-fmin-function-alignment=n
instead.
Benchmarking: Aligning functions can make benchmarksmore reliable. For example, on x86-64, a hot function less than 32 bytesmight be placed in a way that uses one or two cache lines (determined byfunction_addr % cache_line_size
), making benchmark resultsnoisy. Using -falign-functions=32
can ensure the functionalways occupies a single cache line, leading to more consistentperformance measurements.
LLVM notes: In clang/lib/CodeGen/CodeGenModule.cpp
,-falign-function=N
sets the alignment if a function doesnot have the gnu::aligned
attribute.
A hardware loop typically consistants of 3 parts:
A low-overhead loop (also called a zero-overhead loop) is ahardware-assisted looping mechanism found in many processorarchitectures, particularly digital signal processors (DSPs). Theprocessor includes dedicated registers that store the loop startaddress, loop end address, and loop count. A hardware loop typicallyconsists of three components:
- Loop setup instruction: Sets the loop end address and iterationcount
- Loop body: Contains the actual instructions to be repeated
- Loop end instruction: Jumps back to the loop body if furtheriterations are required
Here is an example from Arm v8.1-M low-overhead branch extension.
1 |
1: |
To minimize the number of cache lines used by the loop body, ideallythe loop body (the instruction immediately following DLS) should bealigned to a 64-byte boundary. However, GNU Assembler lacks a directiveto specify alignment like "align DLS to a multiple of 64 plus 60 bytes."Inserting an alignment after the DLS is counterproductive, as it wouldintroduce unwanted NOP instructions at the beginning of the loop body,negating the performance benefits of the low-overhead loopmechanism.
A potential solution would be to extend the alignment directives withan optional offset parameter:
1 |
# Align to 64-byte boundary with 60-byte offset, using NOP padding in code sections |
Xtensa's LOOP
instructions has similar alignmentrequirement, but I am not familiar with the detail. The GNU Assembleruses the special alignment as a special machine-dependent fragment. (