Diving Into Lazy Binding on Linux for DEC Alpha

2023-09-17

This article was originally part of “Fixing a 16+ Year Old Bug in Linux for DEC Alpha With Only One Line of Code”, but as the two subjects are only tangentially related, I figured this deserved its own treatment. This article is the first bit in what I hope I can make into a running series - much information about the DEC Alpha software ecosystem and what makes it all tick is either trapped in old manuals or, more often than not (doubly so in the case of the Linux port), was never properly documented. Linux on Alpha’s Application Binary Interface (ABI) is perhaps chief among these, as much of it was developed by hobbyists taking the Digital UNIX/Tru64 ABI and hitting it with hammers until it was compatible with Linux and its ELF format.

A Bit Lot of Background

Starting from the top, when a program binary is created, whether as an object file or executable, the vast majority of the time, it will be created as “position independent code”, aka PIC. PIC meaning that the code within the binary may be loaded at any position within the address space, without directly modifying the binary image. This has some advantages over absolute addressing, such as avoiding conflicts between different pieces of code that would otherwise want to load at the same address, and thwarting certain classes of security vulnerabilities that rely on a program being loaded at a predefined address known to the attacker. However, this also means that the program generating that binary won’t be able to know at generation time exactly where every symbol (functions and global variables) will be laid out in memory.

Thus, rather than trying to guess where symbols will be placed, references are written where these symbols are used, and “relocations”, effectively small notes containing bits of metadata, are written into the binary to fill in the aforementioned references with proper offsets and addresses. Most relocations are resolved by the static linker when the program executable is being created, but as most programs depend on dynamic libraries, such as .dll files on Windows or .so files on Linux - any relocations depending on those dynamic libraries are unfilled by necessity. Therefore, the dynamic linker (also commonly referred to as the runtime linker) that comes with the operating system will have to finish resolving the remaining relocations, as these shared libraries can get loaded at any arbitrary location.

Even PIC binaries need to be able to know exactly where certain symbols are located. Theoretically the easiest way to fix this would be to merely modify any instructions that reference a symbol so that they’d directly point to the address they need. However, this is unworkable for a number of reasons in practice:

One of the major reasons touted for using shared libraries is the ability to save both disk space and RAM usage. Rather than every program loading individual copies of many megabytes of the same functions each, they can instead share one copy of shared libraries both on disk and in RAM. If relocations were directly applied to any executable code, this would cause any processes sharing that same library, but that use different address space layouts, to break.
In order for the dynamic linker to be able to modify instructions, it’d have to map the program and its shared libraries' code to be writable, but this introduces a number of potential security vulnerabilities. Forget ROP gadgets - hackers could easily modify or insert code at will.
Programs may have many thousands, or even millions of symbol references scattered throughout their code! Imagine if every time you wanted to compile some code with gcc or clang, in addition to the actual compilation times, you’d have to wait an additional several seconds or even minutes just for every single symbol reference within the compiler and its shared libraries to be resolved!
Different processors may implement different versions of the architectural specification. It’d be a bit unfortunate if your shiny new Emerald Rapids chip was unable to harness its AVX512 capabilities because the programs you’re using were compiled exclusively for desktop Raptor Lake, which lack those fancy SIMD extensions. Wouldn’t it be nice if the program could easily swap certain function references from their AVX512-less variant to one that’s specifically optimized for AVX512 without having to maintain a bunch of switch cases at every call site?
Like many other RISC architectures, Alpha uses instructions of a fixed length of 32 bits. As any immediates passed to instructions are encoded as a part of the instruction, this naturally limits how many bits can be used to encode the immediate. In the case of memory instructions, only 16 bits are available for encoding the immediates used to specify offsets - a far cry from the whole 64-bit address space Alpha touts.

Fortunately, the ELF specification, and really just about every other executable format still in common use, has a solution to all of the aforementioned problems (with the exception of the aforementioned IFUNC method of selecting CPU microarchitecture optimized functions: it’s a nonstandard GNU extension). This is where the Global Offset Table (GOT) steps in. The GOT serves as something of a central database for all the various symbols and their absolute addresses in memory. Thus, when a piece of code wants to look up the address of a symbol, whether to jump to that symbol if it’s a function, or to read and write to it if it’s a variable, it can peek into the GOT via a known base pointer, and an offset corresponding to that symbol’s index within the GOT. Specifically on the Alpha architecture, this base pointer is always stored in the gp, or global pointer, register, and so symbol addresses are loaded via a load quadword instruction as follows:

ldq $dest_reg, symbol_offset(gp)

This symbol offset is filled in by the static linker at program link-time, even on PIC binaries. As the Executable and Linking Format Specification, Version 1.2 states:

“Though the system chooses virtual addresses for individual processes, it maintains the segments’ relative positions. Because position-independent code uses relative addressing between segments, the difference between virtual addresses in memory must match the difference between virtual addresses in the file. The difference between the virtual address of any segment in memory and the corresponding virtual address in the file is thus a single constant value for any one executable or shared object in a given process. This difference is the base address.”

Put simply, various segments and the data within them will always be at specific offsets relative to one another once a program or shared library has been loaded into memory, regardless of the base address the program or shared library starts at. These offsets are computed ahead of time by the static linker, and thus, while the static linker won’t know where the global offset table will end up, it does know that printf will be located at, say, index 4 of the GOT once the program has been loaded.

We already discussed that the dynamic linker is responsible for selecting the base address of PIC binaries and their shared libraries to prevent conflicts. However, it’s worthwhile to discuss how this change in base address is applied. As one may guess, it’s carried out by the dynamic linker applying a series of relocations, specifically “dynamic relocations.” Dynamic because they’re resolved at runtime rather than via the static linker. While there’s a variety of dynamic relocations, we’re most interested in R_ALPHA_RELATIVE, R_ALPHA_GLOB_DAT, and R_ALPHA_JMP_SLOT.

Dynamic Relocation	Address Calculation
R_ALPHA_RELATIVE	B + A
R_ALPHA_GLOB_DAT	S
R_ALPHA_JMP_SLOT	S (stub)

The static linker will fill some fields with pointers relative to a particular “preferred” base address. However, given that we’re talking about PIC code, the dynamic linker can move executables and shared libraries around to start at any location in memory. Thus, R_ALPHA_RELATIVE is employed to add the new base address to fields already populated with pointers. For example, certain GOT entries in some binaries will be prepopulated with symbol pointers relative to the “preferred” base address if that symbol is defined within the binary. After the R_ALPHA_RELATIVE relocations are processed, those function pointers will now be relative to the proper base address.

R_ALPHA_GLOB_DAT is relatively straightforward. Any empty GOT entries with the relocation applied to them will be filled in with the address of the symbol they represent once loaded by the dynamic linker. As for R_ALPHA_JMP_SLOT, it does the same - except with a twist.

Binding, but Lazy

Rather than filling in GOT entries for functions with memory addresses pointing directly to those functions, upon encountering an R_ALPHA_JMP_SLOT, the dynamic linker will update those GOT entries to point towards function-specific executable stubs located in the Procedure Linkage Table, or PLT. Why not directly towards the functions themselves? Every single symbol to be resolved requires the dynamic linker to search the executable and every loaded shared library until the symbol definition is found. Naturally, this takes some time to complete if searches must be conducted for hundreds, thousands, or even tens of thousands of functions at program startup. Doubly so if the program being loaded was written in a language that uses name mangling for its symbols, as these names once mangled can get quite long. As symbol lookups partly involve name hashing, a symbol name with dozens or even hundreds of characters only adds more time to the symbol resolution process.

Once these stubs are executed by the program, they’ll cause the program to jump into the dynamic linker itself and locate the actual symbol address, before saving that symbol address into the GOT. This entire process is what constitutes the subject of this article: lazy binding.

Procedure Linkage Table as loaded into memory, examined under GDB. 2 stub entries are shown, plus the primary stub each of them calls at the beginning of the PLT.

Lazy Binding on Linux for DEC Alpha

On Alpha, there’s two ways the PLT and GOT get filled in by the dynamic linker, however, I’m going to focus on the newer “secureplt” (more on the name in a bit) method implemented in Binutils and the GNU C Library by the incredibly prolific Richard Henderson as it’s been the default for many years now. When the linker goes through and emits relocations for function calls that will end up being dynamically linked, those relocations will point to a specific index within the GOT, represented by an offset from the global pointer register, commonly referred to as the gp register. “Global pointer” because it holds a pointer to the GOT.

Each of these GOT entries will be 8 bytes long (Alpha is a 64-bit architecture after all), and be filled in with an address to jump to. Thus as you can see in the disassembly above, when an application wants to branch to another function, it will load the function address stored within the GOT via an ldq, or load quadword, instruction, and execute a jsr, or jump to subroutine, instruction. If the program was linked with eager binding enabled, this would be the end of the story; the program would directly jump to the function via the address saved within the GOT.

In order to see what happens during lazy binding, we’ll need to take a peek at the GOT the linker generated. Luckily, finding the GOT is pretty easy, it has its own section within the binary after all.

Upon inspecting it though, you’ll pretty quickly realize trying to match specific GOT entries to the functions they’re suppose to contain isn’t so simple (ignore the objdump-generated instructions, it’s mistaking pointers for valid Alpha instructions). Additionally, figuring out which GOT index the calling function you’re coming from is just as confusing, why is the ldq instruction referring to an offset seemingly coming before the GOT section even begins?. The reason for the former is because rather than holding function pointers, during lazy binding, the static linker will instead fill in each GOT entry with pointers to PLT entries. As for the latter, well, it primarily comes down to limitations in the Alpha memory instruction encoding format.

As the “Tru64 UNIX Object File and Symbol Table Format Specification” points out, Alpha memory instruction encodings are only capable of holding a 16-bit value for their offset. This immediate is treated as a signed value, meaning that memory instructions can address both 32KB behind the address stored in the register containing the base memory pointer, or 32KB ahead.

In order to work around this limitation, the GOT on Alpha contains a maximum of 8192 entries (each pointer is 8 bytes, so 8×8192 = 64KB). Additionally, as the ABI manual points out, in order for the memory instruction encoding offset to be able to reach the full 64KB of the GOT, the gp register has to be set to +32KB ahead of the start of the GOT at the beginning of each function, as well as when a subroutine call returns back to the calling function.

Knowing this, we can figure out where in the GOT our prior procedure call assembly code is pointing to, As there’s only one GOT in this program, and it starts at 0x188100, +32KB away gives us 0x190100 as our global pointer. Moving -32752 bytes away from this GP gives us 0x188110 as the pointer to the GOT index our assembly code was referring to.

Following the pointer stored in this GOT entry, 0x2e694 (Alpha is a little endian architecture), we end up at the PLT section.

Now at the PLT, the program begins the process of finally binding the actual function pointer to the GOT.

The first instruction executed is the instruction at 0x2e964, which objdump has decided to represent as a pseudoinstruction, in actuality the instruction executed by the Alpha is

br $31, -8 // Register $31 is hardwired to 0 on Alpha.

The program then jumps to 0x2e690, where it executes what has again been represented as a pseudoinstruction by objdump:

br at, -36

This exploits the fact that the Alpha saves the address of the branch instruction it’s executing + 4 to into the provided register before branching to the program counter + the provided offset. The program then lands at 0x2e670

subq t12, at, t11

At this point, we now have the address of the PLT entry we initially jumped to saved into register t12, as well as the address of the beginning of the PLT’s indices saved into register at. From this, we can do a 64-bit subtraction of the two values, giving us an offset of 0, which is then saved into register t11.

ldah at, 22(at)

ldah, or load address high, treats the 16-bit value passed as the high 16-bits of a 32-bit addend. Thus, the 22 passed is multiplied by 65536, resulting in 0x160000, or 1441792, which then gets added to the present contents of the at register (0x2e694), resulting in 0x18e694. This new value then gets saved back into the at register.

lda at, -26004(at)

lda is similar to ldah, except rather than shifting the 16-bit value 16-bits to the left, it simply adds the addend directly to the value currently in the source register, and saves it back into the destination register. Thus, we get 0x18e694 - 26004, resulting in 0x188100 being saved back to the at register. The astute reader may recall that 0x188100 is actually the address of the GOT in this binary we’re examining. Well, hold onto that thought, since that’s not a coincidence.

ldq t12, 0(at) 
// ...
jmp (t12)

Skipping over a few instructions, we now end up loading the address at the very beginning of the GOT, and then jumping to that loaded address. Confusingly though, if we peek into the region of the GOT we attempted to load, we realize that area of the GOT is all zeros. So what’s going on?

Into the Dynamic Linker!

If we take a look at the glibc commit that introduced the secureplt functionality, we find a certain modified function that stands out:

/* Set up the loaded object described by L so its unrelocated PLT
   entries will jump to the on-demand fixup code in dl-runtime.c.  */

static inline int
elf_machine_runtime_setup (struct link_map *map, struct r_scope_elem *scope[],
			   int lazy, int profile)
{
  // ...

I’m not going to copy and paste the entire function, as it’s a bit long, but you can read it for yourself here. The main takeaways are that it looks for the section with the PLTGOT dynamic tag assigned to it. If a corresponding section is found, elf_machine_runtime_setup will write the function address for _dl_runtime_resolve_new to the absolute beginning of the .got, as well as the program’s “link_map” struct, an internal data structure used by glibc for various functions, including in our case, lazy binding. Needless to say, the reason the first 16 bytes of the .got are empty in the binary is because anything written there will inevitably end up getting overwritten by the dynamic linker anyways.

Quick side note before we continue: most architectures use the .got.plt for storing their function addresses - but not Alpha. In fact, within Binutils, Alpha’s ABI is one of only a few that specifically require BFD not to generate a .got.plt section. Other architectures can index into their .got.plt via tricks such as PC-relative load instructions; however, Alpha only contains register-relative loads. Perhaps one could get around this by using another register besides the gp register for storing the .got.plt address. However, this would consume precious register space and require additional code at the start of each function and after every call site to calculate both addresses. On Alpha, .got is shared by all symbol types, and thus it has the DT_PLTGOT tag assigned.

Picking up the trail where execution left off, we now find ourselves in the aforementioned _dl_runtime_resolve_new function:

_dl_runtime_resolve_new:
  // Allocate a stack frame and save a bunch of registers for preservation's-sake...

  mov	$28, $16		/* link_map from .got.plt */

  // Save more registers to the stack...

  mov	$25, $17		/* offset of reloc entry */
  
  // Save even more registers to the stack...

  bsr	$26, _dl_fixup		!samegp
  mov	$0, $27 

  // Reload the registers from the stack and then destroy the stack frame...

  jmp	$31, ($27), 0 // Jump to the fully-resolved function address.

I replaced a bunch of assembly with single-line comments describing their functionality - all the comments surrounded by multi-line demarcation come from the original source, while all the single-line comments are my doing. Once most of the boilerplate stack construction/destruction code is removed for clarity’s sake, _dl_runtime_resolve_new becomes a relatively simple function. It’s more or less a wrapper to go from the PLT stub we just looked at to a function written in C, _dl_fixup. As higher-level languages such as C expect a certain calling standard to be met, _dl_runtime_resolve_new shuffles various registers to comply with the Digital UNIX Calling Standard for Alpha Systems. That’s why there are plenty of moves into registers $16-$21 (aka a0-a5), as they’re used to pass the first six integer arguments in high-level languages such as C, and $0 (aka v0) holds the return value.

_dl_fixup is an incredibly hairy function, even in a somewhat simplified form. I stripped out a lot of the “extra” functionality it provides, such as retrieving cached PLT entries if lazy binding for the requested entry has already occurred, providing auditing hooks, locking, etc. Even so, let’s take it a few steps at a time:

DL_FIXUP_VALUE_TYPE
attribute_hidden __attribute ((noinline)) DL_ARCH_FIXUP_ATTRIBUTE
_dl_fixup (struct link_map *l, ElfW(Word) reloc_arg)
{
  const ElfW(Sym) *const symtab
    = (const void *) D_PTR (l, l_info[DT_SYMTAB]);
  const char *strtab = (const void *) D_PTR (l, l_info[DT_STRTAB]);

  const uintptr_t pltgot = (uintptr_t) D_PTR (l, l_info[DT_PLTGOT]);

While fairly wordy, this initial bit of the function is somewhat self explanatory. The ElfW macro is used in architecture independent glibc code to ensure that the appropriate ELF type is used for a given architecture, as the ELF spec defines separate types for various structures depending on whether the host has a native 32-bit or 64-bit word size. Really the primary tricky part of this code is the D_PTR macro. ELF has a concept of a “dynamic table” for dynamically linked code, which are indexed via “dynamic tags”, and dynamic tags in turn classify the data being referenced via dynamic table entries. Under some architectures, dynamic table entries aren’t relocated until they’re accessed, but since _dl_fixup is supposed to be architecture independent, it needs some way to perform this relocation if required. Ignoring the possible relocation, it’s possible to interpret, say,

D_PTR (l, l_info[DT_STRTAB])

as being

l->l_info[DT_STRTAB]->d_un.d_ptr

or in other words, it indexes into the dynamic table at the DT_STRTAB index and grabs the stored pointer to the ELF string table of the executable/shared library being operated on.

Moving on,

  PLTREL* reloc = l->l_info[DT_JMPREL].d_ptr + reloc_offset(pltgot, reloc_arg);

  Elf64_TYPE* sym = &(symtab[reloc->r_info]);
  Elf64_TYPE* refsym = sym;
  Elf64_TYPE* rel_addr = (l->l_addr + reloc->r_offset);

_dl_fixup next grabs the R_ALPHA_JMP_SLOT relocation for the PLT entry that needs to be fixed up, the symbol it’s being applied for, and the address it’ll be applied to.

  lookup_t result;
  DL_FIXUP_VALUE_TYPE value;

  // ...

  /* Look up the target symbol.  If the normal lookup rules are not
      used don't look in the global scope.  */
  if (__builtin_expect (ELFW(ST_VISIBILITY) (sym->st_other), 0) == 0)
    {
      // ...

      result = _dl_lookup_symbol_x (strtab + sym->st_name, l, &sym, l->l_scope,
				    version, ELF_RTYPE_CLASS_PLT, flags, NULL);

Now that we have the name of the symbol we want to discover the address of, the relocation entry for it, and where we want to apply the relocation to in memory, we can finally begin searching for the symbol definition within the symbol hash tables embedded within the ELF binaries of the loaded executable and its shared libraries, thus taking us into _dl_lookup_symbol_x.

I could push _dl_lookup_symbol_x onto our stack of lazy binding functions we’re exploring, but I won’t 😉. Ali Bahrami of Sun/Oracle’s Linker Aliens team has aleady done an excellent pair of writeups about ELF symbol hashtable lookups, and especially relevant to Linux, the additional bloom filter GNU Binutils adds to speed up searching the various hash tables loaded into memory.

      // ...
      
      value = DL_FIXUP_MAKE_VALUE (result,
				   SYMBOL_ADDRESS (result, sym, false));
    }
  
  // ...
  
  /* And now perhaps the relocation addend.  */
  value = elf_machine_plt_value (l, reloc, value);

The symbol address we found becomes the relocation value applied by the dynamic linker, and just in case there’s an additional relocation addend (there almost certainly won’t be for any R_ALPHA_JMP_SLOT relocations), the addend is applied to the relocation value.

  return elf_machine_fixup_plt (l, result, refsym, sym, reloc, rel_addr, value);
}

By contrast, the elf_machine_fixup_plt function barely even merits a mention, as under the secureplt method of lazy binding, it is quite literally only responsible for writing the resolved function address to the proper GOT index. This is also why the new method of filling out the PLT is referred to as “secureplt”, the prior method used to modify both the executable PLT, and read-only PLT, potentially allowing attackers to insert malicious new code. Once the updated function address is written, elf_machine_fixup_plt returns the resolved function address to _dl_fixup, _dl_fixup returns the resolved function address to _dl_runtime_resolve_new, and _dl_runtime_resolve_new tears down the stack frame it previously allocated, before finally jumping to the resolved function address.

At this point, lazy binding is done, the GOT index of the function we were looking for has been filled in with the proper function address, and any future function calls will go directly to the function rather than the PLT and its stubs.

Except, what was the function that was supposed to be filled in? Taking a look under readelf, we find this:

__ctype_toupper_loc from the GNU C Library. The symbol name itself isn’t directly encoded into the relocation. Intead, the symbol index is encoded as part of the r_info member of the relocation struct, and by checking the symbol table at that index, we can find an offset to the symbol name string.