Sysenter Based System Call Mechanism in Linux 2.6

Starting with version 2.5, linux kernel introduced a new system call entry mechanism on Pentium II+ processors. Due to performance issues on Pentium IV processors with existing software interrupt method, an alternative system call entry mechanism was implemented using SYSENTER/SYSEXIT instructions available on Pentium II+ processors. This article explores this new mechanism. Discussion is limited to x86 architecture and all source code listings are based on linux kernel 2.6.15.6.

1. What are system calls?

System calls provide userland processes a way to request services from the kernel. What kind of services? Services which are managed by operating system like storage, memory, network, process management etc. For example if a user process wants to read a file, it will have to make 'open' and 'read' system calls. Generally system calls are not called by processes directly. C library provides an interface to all system calls.

2. What happens in a system call?

A kernel code snippet is run on request of a user process. This code runs in ring 0 (with current privilege level -CPL- 0), which is the highest level of privilege in x86 architecture. All user processes run in ring 3 (CPL 3). So, to implement system call mechanism, what we need is 1) a way to call ring 0 code from ring 3 and 2) some kernel code to service the request.

3. Good old way of doing it

Until some time back, linux used to implement system calls on all x86 platforms using software interrupts. To execute a system call, user process will copy desired system call number to %eax and will execute 'int 0x80'. This will generate interrupt 0x80 and an interrupt service routine will be called. For interrupt 0x80, this routine is an "all system calls handling" routine. This routine will execute in ring 0. This routine, as defined in the file /usr/src/linux/arch/i386/kernel/entry.S, will save the current state and call appropriate system call handler based on the value in %eax.

4. New shiny way of doing it

It was found out that this software interrupt method was much slower on Pentium IV processors. To solve this issue, Linus implemented an alternative system call mechanism to take advantage of SYSENTER/SYSEXIT instructions provided by all Pentium II+ processors. Before going further with this new way of doing it, let's make ourselves more familiar with these instructions.

4.1. SYSENTER/SYSEXIT instructions:

Let's look at the authorized source, Intel manual itself. Intel manual says:

The SYSENTER instruction is part of the "Fast System Call" facility introduced on the Pentium® II processor. The SYSENTER instruction is optimized to provide the maximum performance for transitions to protection ring 0 (CPL = 0). The SYSENTER instruction sets the following registers according to values specified by the operating system in certain model-specific registers.
CS register set to the value of (SYSENTER_CS_MSR)
EIP register set to the value of (SYSENTER_EIP_MSR)
SS register set to the sum of (8 plus the value in SYSENTER_CS_MSR)
ESP register set to the value of (SYSENTER_ESP_MSR)

Looks like processor is trying to help us. Let's look at SYSEXIT also very quickly:

The SYSEXIT instruction is part of the "Fast System Call" facility introduced on the Pentium® II processor. The SYSEXIT instruction is optimized to provide the maximum performance for transitions to protection ring 3 (CPL = 3) from protection ring 0 (CPL = 0). The SYSEXIT instruction sets the following registers according to values specified by the operating system in certain model-specific or general purpose registers.

CS register set to the sum of (16 plus the value in SYSENTER_CS_MSR)
EIP register set to the value contained in the EDX register
SS register set to the sum of (24 plus the value in SYSENTER_CS_MSR)
ESP register set to the value contained in the ECX register

SYSENTER_CS_MSR,SYSENTER_ESP_MSR, and SYSENTER_EIP_MSR are not really names of the registers. Intel just defines the address of these registers as:

SYSENTER_CS_MSR   174h
SYSENTER_ESP_MSR  175h
SYSENTER_EIP_MSR  176h

In linux these registers are named as:

/usr/src/linux/include/asm/msr.h:
    101 #define MSR_IA32_SYSENTER_CS            0x174
    102 #define MSR_IA32_SYSENTER_ESP           0x175
    103 #define MSR_IA32_SYSENTER_EIP           0x176

4.2. How does linux 2.6 uses these instructions?

Linux sets up these registers during initialization itself.
```
/usr/src/linux/arch/i386/kernel/sysenter.c:
     36         wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);
     37         wrmsr(MSR_IA32_SYSENTER_ESP, tss->esp1, 0);
     38         wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long) sysenter_entry, 0);
```
Please note that 'tss' refers to the Task State Segment (TSS) and tss->esp1 thus points to the kernel mode stack. [4] explains the use of TSS in linux as:
The x86 architecture includes a specific segment type called the Task State Segment (TSS), to store hardware contexts. Although Linux doesn't use hardware context switches, it is nonetheless forced to set up a TSS for each distinct CPU in the system. This is done for two main reasons:
- When an 80 x 86 CPU switches from User Mode to Kernel Mode, it fetches the address of the Kernel Mode stack from the TSS.
- When a User Mode process attempts to access an I/O port by means of an in or out instruction, the CPU may need to access an I/O Permission Bitmap stored in the TSS to verify whether the process is allowed to address the port.
So during initialization kernel sets up these registers such that after SYSENTER instruction, ESP is set to kernel mode stack and EIP is set to sysenter_entry.

Kernel also setups system call entry/exit points for user processes. Kernel creates a single page in the memory and attaches it to all processes' address space when they are loaded into memory. This page contains the actual implementation of the system call entry/exit mechanism. Definition of this page can be found in the file /usr/src/linux/arch/i386/kernel/vsyscall-sysenter.S. Kernel calls this page virtual dynamic shared object (vdso). Existence of this page can be confirmed by looking at cat /proc/`pid`/maps:

slax ~ # cat /proc/self/maps
08048000-0804c000 r-xp 00000000 07:00 13         /bin/cat
0804c000-0804d000 rwxp 00003000 07:00 13         /bin/cat
0804d000-0806e000 rwxp 0804d000 00:00 0          [heap]
b7ea0000-b7ea1000 rwxp b7ea0000 00:00 0
b7ea1000-b7fca000 r-xp 00000000 07:03 1840       /lib/tls/libc-2.3.6.so
b7fca000-b7fcb000 r-xp 00128000 07:03 1840       /lib/tls/libc-2.3.6.so
b7fcb000-b7fce000 rwxp 00129000 07:03 1840       /lib/tls/libc-2.3.6.so
b7fce000-b7fd1000 rwxp b7fce000 00:00 0
b7fe7000-b7ffd000 r-xp 00000000 07:03 1730       /lib/ld-2.3.6.so
b7ffd000-b7fff000 rwxp 00015000 07:03 1730       /lib/ld-2.3.6.so
bffe7000-bfffd000 rwxp bffe7000 00:00 0          [stack]
ffffe000-fffff000 ---p 00000000 00:00 0          [vdso]

For binaries using shared libraries, this page can be seen using ldd also:

slax ~ # ldd /bin/ls
        linux-gate.so.1 =>  (0xffffe000)
        librt.so.1 => /lib/tls/librt.so.1 (0xb7f5f000)
        ...

Observe linux-gate.so.1. This is no physical file. Content of this vdso can be seen as follows:

==> dd if=/proc/self/mem of=linux-gate.dso bs=4096 skip=1048574 count=1
1+0 records in
1+0 records out

==> objdump -d --start-address=0xffffe400 --stop-address=0xffffe414 linux-gate.dso
ffffe400 <__kernel_vsyscall>:
ffffe400:       51                      push   %ecx
ffffe401:       52                      push   %edx
ffffe402:       55                      push   %ebp
ffffe403:       89 e5                   mov    %esp,%ebp
ffffe405:       0f 34                   sysenter 
...
ffffe40d:       90                      nop    
ffffe40e:       eb f3                   jmp    ffffe403 <__kernel_vsyscall+0x3>
ffffe410:       5d                      pop    %ebp
ffffe411:       5a                      pop    %edx
ffffe412:       59                      pop    %ecx
ffffe413:       c3                      ret

In all listings, ... stands for omitted irrelevant code.

Initiation: Userland processes (or C library on their behalf) call __kernel_vsyscall to execute system calls. Address of __kernel_vsyscall is not fixed. Kernel passes this address to userland processes using AT_SYSINFO elf parameter. AT_ elf parameters, a.k.a. elf auxiliary vectors, are loaded on the process stack at the time of startup, alongwith the process arguments and the environment variables. Look at [1] for more information on Elf auxiliary vectors.

After moving to this address, registers %ecx, %edx and %ebp are saved on the user stack and %esp is copied to %ebp before executing sysenter. This %ebp later helps kernel in restoring userland stack back. After executing sysenter instruction, processor starts execution at sysenter_entry. sysenter_entry is defined in /usr/src/linux/arch/i386/kernel/entry.S as: (See my comments in [ ])

    179 ENTRY(sysenter_entry)
    180         movl TSS_sysenter_esp0(%esp),%esp
    181 sysenter_past_esp:
    182         sti
    183         pushl $(__USER_DS)
    184         pushl %ebp			[%ebp contains userland %esp]
    185         pushfl
    186         pushl $(__USER_CS)
    187         pushl $SYSENTER_RETURN		[%userland return addr]
    188
		....
    201         pushl %eax			
    202         SAVE_ALL			[pushes registers on to stack]
    203         GET_THREAD_INFO(%ebp)
    204
    205         /* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not testb */
    206         testw $(_TIF_SYSCALL_EMU|_TIF_SYSCALL_TRACE|_TIF_SECCOMP|_TIF_SYSCALL_AUDIT),
                                                                             TI_flags(%ebp)
    207         jnz syscall_trace_entry
    208         cmpl $(nr_syscalls), %eax
    209         jae syscall_badsys
    210         call *sys_call_table(,%eax,4)
    211         movl %eax,EAX(%esp)
		......

Inside sysenter_entry: between line 183 and 202, kernel is saving the current state by pushing register values on to the stack.
Observe that $SYSENTER_RETURN is the userland return address as defined inside /usr/src/linux/arch/i386/kernel/vsyscall-sysenter.S and %ebp contains userland ESP as %esp was copied to %ebp before calling sysenter.
After saving the state, kernel validates the system call number stored in %eax. Finally appropriate system call is called using instruction:
```
    210 call *sys_call_table(,%eax,4)
```
This is very much similar to old way.

After system call is complete, processor resumes execution at line 211. Looking further in sysenter_entry definition:

    210         call *sys_call_table(,%eax,4)
    211         movl %eax,EAX(%esp)
    212         cli
    213         movl TI_flags(%ebp), %ecx
    214         testw $_TIF_ALLWORK_MASK, %cx
    215         jne syscall_exit_work
    216 /* if something modifies registers it must also disable sysexit */
    217         movl EIP(%esp), %edx			(EIP is 0x28)
    218         movl OLDESP(%esp), %ecx			(OLD ESP is 0x34)
    219         xorl %ebp,%ebp
    220         sti
    221         sysexit

Copies value in %eax to stack. Userland ESP and return address (to-be EIP) are copied from kernel stack to %edx and %ecx respectively. Observe that the userland return address, $SYSENTER_RETURN was pushed on to stack in line 187. After that 0x28 bytes have been pushed on to the stack. That's why 0x28(%esp) points to $SYSENTER_RETURN.
After that SYSEXIT instruction is executed. As we know from previous section, sysexit copies value in %edx to EIP and value in %ecx to ESP. sysexit transfers processor back to ring 3 and processor resumes execution in userland.

5. Some Code

#include <stdio.h>

int pid;

int main() {
        __asm__(
                "movl $20, %eax    \n"
                "call *%gs:0x10    \n"   /* offset 0x10 is not fixed across the systems */
                "movl %eax, pid    \n"
        );
        printf("pid is %d\n", pid);
        return 0;
}

This does the getpid() system call (__NR_getpid is 20) using __kernel_vsyscall instead of int 0x80. Why %gs:0x10? Parsing process stack to find out AT_SYSINFO's value can be a cumbersome task. So, when libc.so (C library) is loaded, it copies the value of AT_SYSINFO from the process stack to the TCB (Thread Control Block). Segment register %gs refers to the TCB.

Please note that the offset 0x10 is not fixed across the systems. I found it out for my system using GDB. A system independent way to find out AT_SYSINFO is given in [1].

Note: This example is taken from http://www.win.tue.nl/~aeb/linux/lk/lk-4.html after little modification to make it work on my system.

6. References

Here are some references that helped me understand this.

About Elf auxiliary vectors By Manu Garg
What is linux-gate.so.1? By Johan Peterson
This Linux kernel: System Calls By Andries Brouwer
Understanding the Linux Kernel, By Daniel P. Bovet, Marco Cesati
Linux Kernel source code