Introduction

Problem description:

The initial observation was, that the linux vm86 syscall, which allows to use the virtual-8086 mode from userspace for emulating of old 8086 software as done with dosemu, was prone to trigger FPU errors. Closer analysis showed, that in general, the handling of the FPU control register and unhandled FPU-exception could trigger CPU-exceptions at unexpected locations, also in ring-0 code. Key player is the emms instruction, which will fault when e.g. cr0 has bits set due to unhandled errors. This only affects kernels on some processor architectures, currently only AMD K7/K8 seems to be relevant.

Methods

Virtual86SwitchToEmmsFault.c was the first POC, that triggers kernel-panic via vm86 syscall. Depending on task layout and kernel scheduler timing, the program might just cause an OOPS without heavy side-effects on the system. OOPS might happen up to 1min after invocation, depending on the scheduler operation and which of the other tasks are using the FPU. Sometimes it causes recursive page faults, thus locking up the entire machine.

To allow reproducible tests on at least a local machine, the random code execution test tool (Virtual86RandomCode.c) might be useful. It still uses the vm86-syscall, but executes random code, thus causing the FPU and task schedule to trigger a multitude of faults and to faster lock-up the system. When executed via network, executed random data can be recorded and replayed even when target machine locks up completely. Network test:

socat TCP4-LISTEN:1234,reuseaddr=1,fork=1 EXEC:./Virtual86RandomCode,nofork=1 tee TestInput < /dev/urandom | socat - TCP4:x.x.x.x:1234 > ProcessedBlocks

An improved version allows to bring the FPU into the same state without using the vm86-syscall. The key instruction is fldcw (floating point unit load control word). When enabling exceptions in one process just before exit, the task switch of two other processes later on might fail. It seems that due to that failure, the task->nsproxy ends up being NULL, thus causing NULL-pointer dereference in exit_shm during do_exit.
When the NULL-page is mapped, the NULL-dereference could be used to fake a rw-semaphore data structure. In exit_shm, the kernel attemts to down_write the semaphore, which adds the value 0xffff0001 at a user-controllable location. Since the NULL-dereference does not allow arbitrary reads, the task memory layout is unknown, thus standard change of EUID of running task is not possible. Apart from that, we are in do_exit, so we would have to change another task. A suitable target is the shmem_xattr_handlers list, which is at an address known from System.map. Usually it contains two valid handlers and a NULL value to terminate the list. As we are lucky, the value after NULL is 1, thus adding 0xffff0001 to the position of the NULL-value plus 2 will will turn the NULL into 0x10000 (the first address above mmap_min_addr) and the following 1 value into NULL, thus terminating the handler list correctly again.
The code to perform those steps can be found in FpuStateTaskSwitchShmemXattrHandlersOverwriteWithNullPage.c

The modification of the shmem_xattr_handlers list is completely silent (could be a nice data-only backdoor) until someone performs a getxattr call on a mounted tempfs. Since such a file-system is mounted by default at /run/shm, another program can turn this into arbitrary ring-0 code execution. To avoid searching the process list to give EUID=0, an alternative approach was tested. When invoking the xattr-handlers, a single integer value write to another static address known from System.map (modprobe_path) will change the default modprobe userspace helper pathname from /sbin/modprobe to /tmp//modprobe. When unknown executable formats or network protocols are requested, the program /tmp//modprobe is executed as root, this demo just adds a script to turn /bin/dd into a SUID-binary. dd could then be used to modify libc to plant another backdoor there. The code to perform those steps can be found in ManipulatedXattrHandlerForPrivEscalation.c.

Results, Discussion

A closer analysis of the initial vm86-syscall problem showed, that root cause was missing handling of FPU exceptions during task switch at emms instruction. That was confirmed by Borislav Petkov. According to discussion on LKML, the problem should affect only AMD CPUs, both in i386 and amd64-mode, see also patch.

Exploitation of the bug allowed to kill tasks at random on all affected kernels under test, thus leading to a denial of service when killing e.g. init or the idle task. At least on i386 architectures without SMP (Debian Sid i486 kernel), the bug also triggers a NULL-pointer dereference, which was proven useful to gain root privileges when NULL-page was mapped, see POC.

Since only some architectures are affected and local exploitation usually only results in DOS (mmap_min_addr set to zero should be rare on most systems, so escalation should be very uncommon) this vulnerability should not have a high impact on total system security.

Timeline

Material, References

Test tools and output (sorted chronologically):

Mailing-list reports:

Bug reports:

Others:

Last modified 20171228
Contact e-mail: me (%) halfdog.net