Write your own print function in x86 assembly - Part 1
Preface
A while ago I did a small course in assembly. I did this mostly because I keep getting drawn to the question “why does it work this way?”. And if you keep questioning this over and over, at some point you’ll end up staring at assembly code. However, for a while I haven’t done this, so I figured it might’ve been a good exercise to write some assembly code again. I decided on writing a print function callable from C code. Nothing fancy or complex, but fun enough to work on and share a thing or two about.
What this covers
- Part 1 covers the assembly basics to get you up to speed with CPU basics, registers, basic x86 (32 bit) instructions, program memory and assembling;
- Part 2 covers writing the actual Print function.
What is Assembly?
Assembly, more than a programming language, would be the closest human readable standard of machine code. It is readable enough to make sense of what’s going on, but it’s too low level to practically write large scale software in (with only a few well established exceptions). Instead of using a compiler we’d use in C code, when writing assembly we use an assembler, which translates assembly instructions into machine code (opcodes) for the target platform.
Assembly files usually have the .asm suffix. There are several flavours of assembly, some work nicely together (x86 works fine on a x86-64 architecture), but some do not (like ARM on x86-64). This blog will focus only on x86 as I wanted my print function to be compatible for both x86 and x86-64.
The CPU
The Central Processing Unit (CPU) lies at the core of modern day computing. It is an electronic circuit that executes instructions. A CPU is typically composed of several cores or only one core if you’re using old hardware. Every core has it’s own set of registers. The more cores a CPU has, the more concurrent tasks it can perform and workload it can spread across the cores the CPU has available.
The CPU also has it’s own set of caches (L1, L2, L3 and rarely L4).
- L1 Cache is located as close to a CPU core as possible and thus offers the highest speed. Each CPU core has it’s own L1 cache. The downside of L1 caches is that these are typically only a few kilobytes in size;
- L2 Cache is bigger then L1 in size (megabytes), but physically seperate from the CPU. It is slower than the L1 Cache and used per core;
- L3 Cache is bigger then L1 and L2 and typically shared across multiple CPU cores. It is still faster then Random Access Memory (RAM), but slower then the L1 and L2 cache.
Registers
Registers are the fastest form of storage in the CPU. They hold small amounts of data (typically 64, 32, 16 or 8 bits depending on the register and architecture) and can manipulate the data to perform algorithimic, logic or controlling functions. The registers differ per architecture, but for the x86 architecture the following are the most important registers:
| Name | 32-bit | 16-bit | 8-bit high (8-15 bits) | 8-bit low (0-7 bits) |
|---|---|---|---|---|
| Accumulator | EAX | AX | AH | AL |
| Base Pointer | EBX | BX | BH | BL |
| Counter | ECX | CX | CH | CL |
| Data | EDX | DX | DH | DL |
| Stack Pointer | ESP | SP | - | - |
| - | ESI | SI | - | - |
| - | EBP | BP | - | - |
Hierarchy of memory speed
With this information in mind, we can make a model of fastest to slowest form of accessible memory to easily remember which kind of memory a machine has and how fast it can be retrieved. This information is great when writing code or reverse engineering a binary.
| Level | Speed | Size | Shared? |
|---|---|---|---|
| Registers | Fastest | Bits | Per core |
| L1 Cache | Very fast | ~KB | Per core |
| L2 Cache | Fast | ~MB | Per core |
| L3 Cache | Moderate | ~MB–GB | Shared |
| RAM | Slow | GB | Shared |
Instructions
Instructions can be categorized in 3 different camps:
- Data movement: e.g.,
mov,push,pop,xchg; - Arithmnetic and logic: e.g.,
add,sub,and,or,xor,not; - Control flow: e.g.,
jmp,call,ret,int.
A very basic assembly program can be seen below, where mov and int will be used:
mov: copies a value into a register. For example,mov eax, 1places the number1into theeaxregister;int: short for interrupt. It hands control over to a specific interrupt handler, identified by the number that comes after it.
With that in mind, here’s a small but functional program: example_1.asm
mov eax, 1
mov ebx, 1
int 0x80
Step by step:
mov eax, 1puts the number1intoeax.mov ebx, 1puts the number1intoebx.int 0x80triggers interrupt0x80, which on Linux is the handler for system calls.
When interrupt 0x80 is called, control is handed over to the kernel, which looks at eax to decide which system call to run. System call 1 is exit, afterwards the kernel looks at the ebx value to determine which exit code must be used. So this program exits with status 1.
Program Memory
While a program runs, it uses two main regions of memory for its information:
- Stack: Which holds information based on a LIFO (Last In First Out) structure. Values are added with
pushand removed withpop, and the CPU tracks the current top via thestack pointer(ESP); - Heap: Which is used for
dynamic allocation. This is memory requested at runtime, whose size or lifetime is unknown. However this category must be managed explicitly (free the memory up when it is no longer needed) to avoid unneccesary information lingering within the heap.
The Stack
Some examples in assembly of using the stack (assuming origin is 1000):
push 0 ; put '0' on the stack on position 996
push 1 ; put '1' on the stack on position 992
mov eax, [esp] ; read current stack pointer position (992) and move it into 'eax'
mov ebx, [esp+4]; read current stack pointer position (992)+4 and move it into 'ebx'
int 0x80
Notice when a variable is pushed on the stack the
espmoves by4, that is because inx86, a variable can only hold a maximum of4bytes.
It is important to note that the stack grows downwards in memory. If the origin of the stack is at address 10000, the stack will grow towards 9999, 9998, and so on. The stack pointer should never exceed its starting value (e.g., esp = 10001), otherwise a stack underflow occurs. The other way around if the stack grows outside of its bounds (e.g., esp = -1), a stack overflow occurs.
Stack memory is rather straight forward to manage. Especially in high level programming if the variables are no longer in scope, they may be removed or overwritten.
{
int a = 1; // A is initialized and will be in the scope.
printf("int a = %d", a); // A is being used.
}
// A is no longer in scope and its previous stack location may be reused.
The heap
The heap is less organized than the stack. Unlike the stack, the heap grows upwards in memory and does not follow a strict order. Memory on the heap must be manually requested and released by the programmer.
In low level languages like C, this is done explicitly:
int *a = malloc(4); // Request 4 bytes on the heap, store the address in 'a'.
*a = 42; // Write the value 42 to that address.
free(a); // Release the memory back to the heap.
If memory is allocated on the heap but never released, it will remain occupied for the lifetime of the program. This is known as a memory leak.
Unlike the stack, heap memory is not automatically cleaned up when a variable goes out of scope. The programmer is fully responsible for managing it. This is why higher level languages like Java or Python use a garbage collector, which automatically detects and frees heap memory that is no longer referenced.
In assembly, heap memory is requested from the operating system via system
calls. On Linux x86, for example, the brk or mmap syscall can be used to
request a region of memory:
mov eax, 45 ; syscall number for 'brk'
mov ebx, 0 ; pass 0 to get the current heap boundary
int 0x80 ; the OS returns the current end of the heap in eax
Because heap memory is unstructured, accessing memory that has already been freed or was never allocated leads to undefined behavior and is a common source of bugs and security vulnerabilities.
Assembling
Once assembly code has been written and needs to be tested, it first needs to be assembled by an assembler. For this, the Netwide Assembler (nasm) will be used. As an example, let’s add a _start to example_1.asm and start assembling it:
global _start
_start:
mov eax, 1
mov ebx, 1
int 0x80
To assemble this piece of code we run:
nasm -f elf32 example_1.asm -o example_1.o
This assembles the code into an object file, which can be linked or used in other code. For now it can be linked using ld:
ld -m elf_i386 example_1.o -o example_1
Now executing example_1 will start the program, and immediately exit with status code 1. With echo, we can retrieve the last exit code of a running program:
echo $?
1
Key Takeaways / TL;DR
- Assembly has several different flavours of which
x86is compatible withx86-64; - Registers and Caches have a tradeoff, the more data a component can hold, the slower the component will be;
- Program Memory is divided into the Stack (LIFO) and the Heap (dynamic);
- Instructions define data movement, Arithmetic and logic, or Control flow;
- Basic Instructions are
movandint, of whichmovmoves one value into another place, andintcalles aninterrupt; - Assemblers like
nasmcan be used to assemble the assembly code into an object; - linkers like
ldcan be used to link the necessary libraries with the code and build an executable.