Linux Kernel学习笔记

Loadable Kernel Modules(LKMs)

可加载核心模块 (或直接称为内核模块) 就像运行在内核空间的可执行程序，包括:

驱动程序（Device drivers）
- 设备驱动
- 文件系统驱动
- …
内核扩展模块 (modules)
LKMs 的文件格式和用户态的可执行程序相同，Linux 下为 ELF，Windows 下为 exe/dll，mac 下为 MACH-O，因此我们可以用 IDA 等工具来分析内核模块。
相关指令
insmod: 讲指定模块加载到内核中
rmmod: 从内核中卸载指定模块
lsmod: 列出已经加载的模块
modprobe: 添加或删除模块，modprobe 在加载模块时会查找依赖关系

ioctl

int ioctl(int fd, unsigned long request, …) 的第一个参数为打开设备 (open) 返回的文件描述符，第二个参数为用户程序对设备的控制命令，再后边的参数则是一些补充参数，与设备有关。

使用 ioctl 进行通信的原因：
操作系统提供了内核访问标准外部设备的系统调用，因为大多数硬件设备只能够在内核空间内直接寻址, 但是当访问非标准硬件设备这些系统调用显得不合适, 有时候用户模式可能需要直接访问设备。
比如，一个系统管理员可能要修改网卡的配置。现代操作系统提供了各种各样设备的支持，有一些设备可能没有被内核设计者考虑到，如此一来提供一个这样的系统调用来使用设备就变得不可能了。
为了解决这个问题，内核被设计成可扩展的，可以加入一个称为设备驱动的模块，驱动的代码允许在内核空间运行而且可以对设备直接寻址。一个 Ioctl 接口是一个独立的系统调用，通过它用户空间可以跟设备驱动沟通。对设备驱动的请求是一个以设备和请求号码为参数的 Ioctl 调用，如此内核就允许用户空间访问设备驱动进而访问设备而不需要了解具体的设备细节，同时也不需要一大堆针对不同设备的系统调用。

状态切换

user space to kernel space

当发生系统调用，产生异常，外设产生中断等事件时，会发生用户态到内核态的切换，具体的过程为：

1. 通过 swapgs 切换 GS 段寄存器，将 GS 寄存器值和一个特定位置的值进行交换，目的是保存 GS 值，同时将该位置的值作为内核执行时的 GS 值使用。
1. 将当前栈顶（用户空间栈顶）记录在 CPU 独占变量区域里，将 CPU 独占区域里记录的内核栈顶放入 rsp/esp。(这里和SROP的Signale Frame有些关系, 即sigreturn系统调用改不了rsp)

通过 push 保存各寄存器值，具体的代码如下:

ENTRY(entry_SYSCALL_64)
 /* SWAPGS_UNSAFE_STACK是一个宏，x86直接定义为swapgs指令 */
 SWAPGS_UNSAFE_STACK

 /* 保存栈值，并设置内核栈 */
 movq %rsp, PER_CPU_VAR(rsp_scratch)
 movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp


/* 通过push保存寄存器值，形成一个pt_regs结构 */
/* Construct struct pt_regs on stack */
pushq  $__USER_DS      /* pt_regs->ss */
pushq  PER_CPU_VAR(rsp_scratch)  /* pt_regs->sp */
pushq  %r11             /* pt_regs->flags */
pushq  $__USER_CS      /* pt_regs->cs */
pushq  %rcx             /* pt_regs->ip */
pushq  %rax             /* pt_regs->orig_ax */
pushq  %rdi             /* pt_regs->di */
pushq  %rsi             /* pt_regs->si */
pushq  %rdx             /* pt_regs->dx */
pushq  %rcx tuichu    /* pt_regs->cx */
pushq  $-ENOSYS        /* pt_regs->ax */
pushq  %r8              /* pt_regs->r8 */
pushq  %r9              /* pt_regs->r9 */
pushq  %r10             /* pt_regs->r10 */
pushq  %r11             /* pt_regs->r11 */
sub $(6*8), %rsp      /* pt_regs->bp, bx, r12-15 not saved */

1. 通过汇编指令判断是否为 x32_abi。
1. 通过系统调用号，跳到全局变量 sys_call_table 相应位置继续执行系统调用。
  kernel space to user space
1. 通过swapgs恢复GS值.
1. 通过 sysretq 或者 iretq 恢复到用户控件继续执行。如果使用 iretq 还需要给出用户空间的一些信息(CS, eflags/rflags, esp/rsp 等)

struct cred

之前提到 kernel 记录了进程的权限，更具体的，是用 cred 结构体记录的，每个进程中都有一个 cred 结构，这个结构保存了该进程的权限等信息（uid，gid 等），如果能修改某个进程的 cred，那么也就修改了这个进程的权限。

struct cred {
    atomic_t    usage;
#ifdef CONFIG_DEBUG_CREDENTIALS
    atomic_t    subscribers;    /* number of processes subscribed */
    void        *put_addr;
    unsigned    magic;
#define CRED_MAGIC  0x43736564
#define CRED_MAGIC_DEAD 0x44656144
#endif
    kuid_t      uid;        /* real UID of the task */
    kgid_t      gid;        /* real GID of the task */
    kuid_t      suid;       /* saved UID of the task */
    kgid_t      sgid;       /* saved GID of the task */
    kuid_t      euid;       /* effective UID of the task */
    kgid_t      egid;       /* effective GID of the task */
    kuid_t      fsuid;      /* UID for VFS ops */
    kgid_t      fsgid;      /* GID for VFS ops */
    unsigned    securebits; /* SUID-less security management */
    kernel_cap_t    cap_inheritable; /* caps our children can inherit */
    kernel_cap_t    cap_permitted;  /* caps we're permitted */
    kernel_cap_t    cap_effective;  /* caps we can actually use */
    kernel_cap_t    cap_bset;   /* capability bounding set */
    kernel_cap_t    cap_ambient;    /* Ambient capability set */
#ifdef CONFIG_KEYS
    unsigned char   jit_keyring;    /* default keyring to attach requested
                     * keys to */
    struct key __rcu *session_keyring; /* keyring inherited over fork */
    struct key  *process_keyring; /* keyring private to this process */
    struct key  *thread_keyring; /* keyring private to this thread */
    struct key  *request_key_auth; /* assumed request_key authority */
#endif
#ifdef CONFIG_SECURITY
    void        *security;  /* subjective LSM security */
#endif
    struct user_struct *user;   /* real user ID subscription */
    struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
    struct group_info *group_info;  /* supplementary groups for euid/fsgid */
    struct rcu_head rcu;        /* RCU deletion hook */
} __randomize_layout;

内核态函数

printf() -> printk(), printk()不一定会把内容显示到终端上, 但是会在Kernel缓冲区中，可以通过dmesg查看.
memcpy() -> copy_from_user()/copy_to_user()
malloc() -> kmalloc() 使用的是slab/slub分配器.
free() -> kfree()
管理权限的函数:
int commit_creds(struct cred *new)
struct cred prepare_kernel_cred(struct task_struct daemon)
从函数名也可以看出，执行 commit_creds(prepare_kernel_cred(0)) 即可获得 root 权限，0 表示以 0 号进程作为参考准备新的 credentials.
这样我们就找到了常用的提权手段.
执行 commit_creds(prepare_kernel_cred(0)) 也是最常用的提权手段，两个函数的地址都可以在 /proc/kallsyms 中查看（较老的内核版本中是 /proc/ksyms.
Mitigation
这个暂时不知道干啥用的
smep: Supervisor Mode Execution Protection，当处理器处于 ring0 模式，执行用户空间的代码会触发页错误。（在 arm 中该保护称为 PXN)
smap: Superivisor Mode Access Protection，类似于 smep，通常是在访问数据时。

Loadable Kernel Modules(LKMs)

相关指令

ioctl

状态切换

user space to kernel space

kernel space to user space

struct cred

内核态函数

Mitigation