A Guide To x86 Assembly
A beginners guide to the X86 Assembly Language, also known as 'assembler', a low level programming language.
Assembly language (or assembler), is any low-level programming language in which there is a very strong correspondence between the program's statements and the architecture's machine code instructions." Some of you may know it from your computer science courses where you were expected to read lots of ones and zeros.
What exactly is a low-level programming language?
A low level programming language is a programming language that provides little to no abstraction from the computer's instruction set architecture. Low level programming languages run generally on instructions and commands or functions in this low level language closely map to the processor's instruction set. The word "Low" means there is little to no abstraction between the language and processor.
Second generation programming languages typically code programs in Assembler, which in turn generates machine code to be executed, machine code is the only language computers can understand without any processing. Below is an example of machine code in hexadecimal form for generating fibonacci series terms.
Just as we can't really understand machine code, a computer processor can't really understand our language, that's where Assembly Language comes into play.
Assembler + linker combined are called Translator which takes the assembly mnemonics we provide and converts them to machine code we can then execute.
What are Mnemonics?
In programming, a mnemonic is a name assigned to a machine function or an abbreviation for an operation. Each mnemonic represents a low level machine instruction or opcode in assembly. add, mul, lea, cmp, and je are examples of mnemonics.
What Are Registers?
Registers in assembly programming can be considered to be global variables we use in higher level programming languages for general operations.
Some Different Types of Registers :
- General purpose - Eax, Ebx, Esp, Ebp
- Segment - CS, CD
- Control - EIP
General Purpose Registers
These are some of the general purpose registers in x86 architecture, each of the above register has capacity of storing 32 bit of data. Think of an EAX register with 32 bit, Lower part of EAX is called AX which contains 16 bit of data, AX is also further divided in two parts AH and AL, each with 8 bits in size, the same goes with EBX, ECX and EDX.
EAX - Accumulator Register - used for storing operands and result data
EBX- Base register - Points to data
ECX - Counter Register - Loop operations
Unlike registers we saw before, the above registers (ESP, EBP, ESI, EDI can not be divided in small sizes of 8 bits, however they are divided in upper and lower 16 bits of register.Registers in a cpu are limited, you can't use them to store larger chunks of data and that's where memory comes to play. Data can be stored in memory in a stack data structure, the ESP register serves as an indirect memory operand pointing to the top of the stack at any time. Consider a stack which contains data, ESP points to the top of that stack. Consider that a stack currently contains integer value 2 only. so 2 would be at the top of the stack. The ESP register would point to integer value 2 and in the same way, EBP points to the base of a stack.
What doesn't fit in registers lives in memory
Memory is accessed either with loads and stores at addresses as if it were a big array, or through PUSH and POP operations on a stack.
This is general memory hierarchy of a computer, Registers are at the top of it being fastest than rest but smaller in size as well, while moving down the hierarchy, storage size increases as well as speed decreases
How are DataTypes are stored in memory?
There are several ways in which multibyte data types can be stored, the two most common ways to store DataTypes in memory are Little Endian and Big Endian.
Little Endian Data Storage type is generally used in intel based processors where main focus is processing speed not the amount of power consumed. However Arm makes processors for mobile devices where battery and power consumption plays an important role, so Big endian is used with arm processors.
The above image is the representation of how 0x01234567 would be stored in memory. In Big Endian the data is stored as given, but in Little Endian Bytes are written in another order, from 0x01234567, 67 is written first, then 45, then 23 and at last 01.
Let's talk about Memory Segments!
-
Text
- Contains Instructions for program
-
Data
- Contains Data For Program i.e. Message Strings
-
BSS
- Contains all uninitialized global variables
In the above image is the structure of a Hello world program in assembly.
Entry point of program is a global variable called _start: and the program execution is started from there. The Text section contains the instructions to print and exit the program, the Data section contains the Message string "Hello World!" which is used in Instruction of print in text section.
One of the Most important Registers : EIP
As we discussed before, assembly is executed instruction wise and instructions are written in an orderly fashion.
_start:
- mov $5, ecx
- mov $5, edx
- cmp ecx, edx
In above given assembly program, Execution is started with the symbol _start:
EIP points to the next instruction to execute
Before the 1st instruction of "mov $5, ecx" is executed, EIP points to the address of the first instruction. After it is executed, EIP is then incremented by 1, so it will now point to the second instruction. Program execution would flow this way, as an attacker if we want to take control of the program, we should manipulate the value of EIP. Same as if else statements in higher level programming languages, assembly also provides mnemonics to control the flow of program, but let's first understand some basic mnemonics of assembly.
These are some of the many many provided with a processor and they are pretty much self explanatory. Let's discuss the Jmp instruction.
jmp - it's like goto function in C, it would jump to the specified location unconditionally. Consider this code I give you below.
- mov $5, ecx
- mov $5, edx
- jmp 5
- mov $6, ecx
- cmp ecx, edx
- je function
- function :
In above given snippet of code, 1st instruction and 2nd instruction would be executed one after another, resulting 5 in ecx and edx. The jmp 5 instruction is encountered, so flow is directly transffered to instruction number 5. So, instruction number 4 won't ever be executed. Now lets see the cmp instruction, after executing the 3rd instruction, execution comes to the 5th instruction.
cmp ecx, edx
Which will compare ecx and edx by substracting one out of another, if substraction is zero, it means both values stored in registers ecx, and edx are same.
So zero flag is set to one, indicating that result is zero.
Now a JE instruction is encountered.
JE instruction will check for the zero flag of above executed instruction. JE simply means jump if equal as the above instruction, if ecx and eds are equal, je redirects flow to the function:
Computers contain a layered structure
-
Level 3
- Application Level Libraries
-
Level 2
- System LEvel Libraries
-
Level 1
- Operating System
-
Level 0
- Bare Metal
The OS contains libraries and drivers
In order to interact with the OS we have to use a System Call (syscall), the operating system offers some services to the application running on it. This services are accessible using these system calls for opening files, mapping memory, reading directory content, etc. All these actions require interaction with the hardware (the hard drive, the memory management unit) and are managed by the OS.
Every possible linux system call is enumerated, so they can be referenced by the numbers when making the calls in assembly.
i.e. EXIT - 1
WRITE - 4
How do system calls work?
The image below gives brief information on how system calls work.
User space program calls for a system call by invoking an Interrupt. That interrupt is then passed to Interrupt Handlers Table, which invokes system call handler which in turn invokes specific system call, there are mainly two modes of invoking a SystemCall
Int 0x8; and
SYSENTER
Every syscall takes some arguments, so before executing a syscall we need our parameters ready in registers.
EAX contains the syscall number and rest of the registers contain other arguments, we can get details about a specific syscall by visiting its man page on linux with "man (syscall name)".
I.e. man write
So, for write syscall, we'd need to store our syscall number in EAX, which is 4 then store EBX, file descriptor, and we'd need ECX to point to our string which we need to print. and at last, edx to contain the length we need to print. After storing all that we'd simply invoke interrupt with int 0x80.
Now let's try writing our first program of printing Hello world! in assembly
global _start
section .text
_start:
mov eax, 0x4
mov ebx, 0x1
mov ecx, message
mov edx, 12
int 0x80
mov eax, 0x1
mov ebx, 0x5
int 0x80
section .data
message: db "Hello, World!"; define byte
We first declared _start as our global varible,
then started text section with .text
to execute write, we pushed syscall number of write, which is "4" into eax
then, file descriptor "1" into ebx.
then from our .data section, message pointer to ecx.
we need to print 12 bytes, so pushed 12 in edx
then called interrupt, resulting to print Hello world!
now we pushed 1 into eax, which is the syscall number of exit.
we want to exit with status code 5, so pushed 5 in ebx
then 0x80 to execute
in data section,
message: db "Hello world!" means we are defining message as a double word of "Hello world!"
It seems we have successfully written our first program in X86 assembly!
Thanks for reading, ask me questions by messaging me directly on twitter!
twitter : @malav_vyas1 | github : github.com/malavyas | web : malavvyas.tk
A special thanks to these creators for blogs and videos helping this article: Security Tube, LiveOverflow and 0x00sec.