Write your own print function in x86 assembly - Part 1

Preface

A while ago I did a small course in assembly. I did this mostly because I keep getting drawn to the question “why does it work this way?”. And if you keep questioning this over and over, at some point you’ll end up staring at assembly code. However, for a while I haven’t done this, so I figured it might’ve been a good exercise to write some assembly code again. I decided on writing a print function callable from C code. Nothing fancy or complex, but fun enough to work on and share a thing or two about.

What this covers

Part 1 covers the assembly basics to get you up to speed with CPU basics, registers, basic x86 (32 bit) instructions, program memory and assembling;
Part 2 covers writing the actual Print function.

What is Assembly?

Assembly, more than a programming language, would be the closest human readable standard of machine code. It is readable enough to make sense of what’s going on, but it’s too low level to practically write large scale software in (with only a few well established exceptions). Instead of using a compiler we’d use in C code, when writing assembly we use an assembler, which translates assembly instructions into machine code (opcodes) for the target platform.

Assembly files usually have the .asm suffix. There are several flavours of assembly, some work nicely together (x86 works fine on a x86-64 architecture), but some do not (like ARM on x86-64). This blog will focus only on x86 as I wanted my print function to be compatible for both x86 and x86-64.

The CPU

The Central Processing Unit (CPU) lies at the core of modern day computing. It is an electronic circuit that executes instructions. A CPU is typically composed of several cores or only one core if you’re using old hardware. Every core has it’s own set of registers. The more cores a CPU has, the more concurrent tasks it can perform and workload it can spread across the cores the CPU has available.

The CPU also has it’s own set of caches (L1, L2, L3 and rarely L4).

L1 Cache is located as close to a CPU core as possible and thus offers the highest speed. Each CPU core has it’s own L1 cache. The downside of L1 caches is that these are typically only a few kilobytes in size;
L2 Cache is bigger then L1 in size (megabytes), but physically seperate from the CPU. It is slower than the L1 Cache and used per core;
L3 Cache is bigger then L1 and L2 and typically shared across multiple CPU cores. It is still faster then Random Access Memory (RAM), but slower then the L1 and L2 cache.

Registers

Registers are the fastest form of storage in the CPU. They hold small amounts of data (typically 64, 32, 16 or 8 bits depending on the register and architecture) and can manipulate the data to perform algorithimic, logic or controlling functions. The registers differ per architecture, but for the x86 architecture the following are the most important registers:

Name	32-bit	16-bit	8-bit high (8-15 bits)	8-bit low (0-7 bits)
Accumulator	EAX	AX	AH	AL
Base Pointer	EBX	BX	BH	BL
Counter	ECX	CX	CH	CL
Data	EDX	DX	DH	DL
Stack Pointer	ESP	SP	-	-
-	ESI	SI	-	-
-	EBP	BP	-	-

Hierarchy of memory speed

With this information in mind, we can make a model of fastest to slowest form of accessible memory to easily remember which kind of memory a machine has and how fast it can be retrieved. This information is great when writing code or reverse engineering a binary.

Level	Speed	Size	Shared?
Registers	Fastest	Bits	Per core
L1 Cache	Very fast	~KB	Per core
L2 Cache	Fast	~MB	Per core
L3 Cache	Moderate	~MB–GB	Shared
RAM	Slow	GB	Shared

Instructions

Instructions can be categorized in 3 different camps:

Data movement: e.g., mov, push, pop, xchg;
Arithmnetic and logic: e.g., add, sub, and, or, xor, not;
Control flow: e.g., jmp, call, ret, int.

A very basic assembly program can be seen below, where mov and int will be used:

mov: copies a value into a register. For example, mov eax, 1 places the number 1 into the eax register;
int: short for interrupt. It hands control over to a specific interrupt handler, identified by the number that comes after it.

With that in mind, here’s a small but functional program: example_1.asm

    mov eax, 1
    mov ebx, 1
    int 0x80

Step by step:

mov eax, 1 puts the number 1 into eax.
mov ebx, 1 puts the number 1 into ebx.
int 0x80 triggers interrupt 0x80, which on Linux is the handler for system calls.

When interrupt 0x80 is called, control is handed over to the kernel, which looks at eax to decide which system call to run. System call 1 is exit, afterwards the kernel looks at the ebx value to determine which exit code must be used. So this program exits with status 1.

Program Memory

While a program runs, it uses two main regions of memory for its information:

Stack: Which holds information based on a LIFO (Last In First Out) structure. Values are added with push and removed with pop, and the CPU tracks the current top via the stack pointer (ESP);
Heap: Which is used for dynamic allocation. This is memory requested at runtime, whose size or lifetime is unknown. However this category must be managed explicitly (free the memory up when it is no longer needed) to avoid unneccesary information lingering within the heap.

The Stack

Some examples in assembly of using the stack (assuming origin is 1000):

    push 0          ; put '0' on the stack on position 996
    push 1          ; put '1' on the stack on position 992
    mov eax, [esp]  ; read current stack pointer position (992) and move it into 'eax'
    mov ebx, [esp+4]; read current stack pointer position (992)+4 and move it into 'ebx'
    int 0x80

Notice when a variable is pushed on the stack the esp moves by 4, that is because in x86, a variable can only hold a maximum of 4 bytes.

It is important to note that the stack grows downwards in memory. If the origin of the stack is at address 10000, the stack will grow towards 9999, 9998, and so on. The stack pointer should never exceed its starting value (e.g., esp = 10001), otherwise a stack underflow occurs. The other way around if the stack grows outside of its bounds (e.g., esp = -1), a stack overflow occurs.

Stack memory is rather straight forward to manage. Especially in high level programming if the variables are no longer in scope, they may be removed or overwritten.

{
    int a = 1;                  // A is initialized and will be in the scope.
    printf("int a = %d", a);    // A is being used.
}
// A is no longer in scope and its previous stack location may be reused.

The heap

The heap is less organized than the stack. Unlike the stack, the heap grows upwards in memory and does not follow a strict order. Memory on the heap must be manually requested and released by the programmer.

In low level languages like C, this is done explicitly:

int *a = malloc(4);   // Request 4 bytes on the heap, store the address in 'a'.
*a = 42;              // Write the value 42 to that address.
free(a);              // Release the memory back to the heap.

If memory is allocated on the heap but never released, it will remain occupied for the lifetime of the program. This is known as a memory leak.

Unlike the stack, heap memory is not automatically cleaned up when a variable goes out of scope. The programmer is fully responsible for managing it. This is why higher level languages like Java or Python use a garbage collector, which automatically detects and frees heap memory that is no longer referenced.

In assembly, heap memory is requested from the operating system via system calls. On Linux x86, for example, the brk or mmap syscall can be used to request a region of memory:

    mov eax, 45       ; syscall number for 'brk'
    mov ebx, 0        ; pass 0 to get the current heap boundary
    int 0x80          ; the OS returns the current end of the heap in eax

Because heap memory is unstructured, accessing memory that has already been freed or was never allocated leads to undefined behavior and is a common source of bugs and security vulnerabilities.

Assembling

Once assembly code has been written and needs to be tested, it first needs to be assembled by an assembler. For this, the Netwide Assembler (nasm) will be used. As an example, let’s add a _start to example_1.asm and start assembling it:

global _start
_start:
    mov eax, 1
    mov ebx, 1
    int 0x80

To assemble this piece of code we run:

nasm -f elf32 example_1.asm -o example_1.o

This assembles the code into an object file, which can be linked or used in other code. For now it can be linked using ld:

ld -m elf_i386 example_1.o -o example_1

Now executing example_1 will start the program, and immediately exit with status code 1. With echo, we can retrieve the last exit code of a running program:

echo $?
1

Key Takeaways / TL;DR

Assembly has several different flavours of which x86 is compatible with x86-64;
Registers and Caches have a tradeoff, the more data a component can hold, the slower the component will be;
Program Memory is divided into the Stack (LIFO) and the Heap (dynamic);
Instructions define data movement, Arithmetic and logic, or Control flow;
Basic Instructions are mov and int, of which mov moves one value into another place, and int calles an interrupt;
Assemblers like nasm can be used to assemble the assembly code into an object;
linkers like ld can be used to link the necessary libraries with the code and build an executable.