seccomp介绍

本文主要基于[2], 提取出核心的描述,并理清思路

seccomp的前世今生

seccomp最早出现在Linux kernel 2.6.12,时间是2005年。2012年的Linux 3.5版本,加入了"seccomp mode 2" (or "seccomp filter mode")功能。
It added a second mode for seccomp: SECCOMP_MODE_FILTER. Using that mode, processes can specify which system calls are permitted. By using a mini-program in the Berkeley packet filter (BPF) language, processes could restrict system calls entirely or only for certain argument values.
所以我们现在用的就是这个seccomp2,或者叫seccomp-bpf。那么为啥叫seccomp-bpf?BPF到底是什么?
BPF全称是Berkeley Packet Filter。
最初构想提出于1992年,其目的是为了提供一种过滤包的方法,并且要避免从内核空间到用户空间的无用的数据包复制行为。它最初是由从用户空间注入到内核的一个简单的字节码构成,它在那个位置利用一个校验器进行检查 —— 以避免内核崩溃或者安全问题 —— 并附着到一个套接字上。其简化的语言以及存在于内核中的即时编译器(JIT),使 BPF 成为一个性能卓越的工具。
为啥这两个东西会绑到一起呢?
因为seccomp在过滤syscall的时候,借助了BPF定义的过滤规则,以及处于内核的用BPF language写的mini-program。所以采用了BPF方法对syscall进行过滤的seccomp就是seccomp-bpf。

seccomp怎么用

The seccomp filter system uses the Berkley Packet Filter system. Combined with argument checking and the many possible filter return values (kill, trap, trace, errno), this is allows for extensive logic.[2]
[2]以及本文主要阐述了最基本的case,seccomp的高阶用法可以参考seccomp的内核文档

Detect seccomp

root@kube-master:~# cat /boot/config-`uname -r` | grep CONFIG_SECCOMP
CONFIG_SECCOMP_FILTER=y
CONFIG_SECCOMP=y
也可以参考这个链接(Detecting seccomp features at runtime)写一个Linux应用程序调用prctl系统调用来检测。

Basic seccomp filtering

基本的SYSCALL过滤非常简单,大致流程如下:
后面两步都差不多,可见关键的是第一步,也就是定义SYSCALL过滤列表。
struct sock_filter filter[] = {
/* Validate architecture. */
VALIDATE_ARCHITECTURE,
/* Grab the system call number. */
EXAMINE_SYSCALL,
/* List allowed syscalls. */
ALLOW_SYSCALL(rt_sigreturn),
...
KILL_PROCESS,
};
struct sock_fprog prog = {
.len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
.filter = filter,
};

if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
perror("prctl(NO_NEW_PRIVS)");
goto failed;
}
if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
perror("prctl(SECCOMP)");
goto failed;
}

BPF

这里值得注意的是EXAMINE_SYSCALL, ALLOW_SYSCALLKILL_PROCESS都不是标准库提供的。看本项目的seccomp-bpf.h可以知道。这些都是自定义的宏:
#define VALIDATE_ARCHITECTURE \
    BPF_STMT(BPF_LD+BPF_W+BPF_ABS, arch_nr), \
    BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ARCH_NR, 1, 0), \
    BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL)

#define EXAMINE_SYSCALL \
    BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_nr)

#define ALLOW_SYSCALL(name) \
    BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_##name, 0, 1), \
    BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW)

#define KILL_PROCESS \
    BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL)
查看Linux源码可知, BPF_STMT
./include/uapi/linux/filter.h
48:#ifndef BPF_STMT
49:#define BPF_STMT(code, k) { (unsigned short)(code), 0, 0, k }
If one is "feeling masochistic", they could write BPF programs numerically, but there are some constants and macros available to make it easier. [3]
BPF提供了一套指令集来实现filter功能。你可以自行写二进制(汇编),也可以采用Linux抽象的宏,更高阶一点的API,可以参考Libseccomp。里面也有不少sample代码。Android使用的是minijai
There are a number of tools and resources that can make it easier to work with seccomp filters and BPF. Libseccomp provides a higher-level API for creating filters. He noted that the project also has man pages (for example, seccomp_init()) with lots of examples. [3]
除了高阶API可用,[3]中还提及了一些编译BPF指令的方法:
There is also a BPF compiler (bpfc) that is part of the netsniff-ng toolkit project. LLVM has a BPF backend as of its 3.7 release that compiles a subset of C to BPF, though he noted that there is little documentation as yet.
Finally, the kernel has a just-in-time (JIT) compiler that turns the BPF bytecode into native machine code, which can achieve 2-3x performance (or even better in some cases). The JIT compiler is disabled by default, but it can be enabled by writing a "1" to: /proc/sys/net/core/bpf_jit_enable
Kerrisk's slides have a wealth of information, including additional >resources for more information.

seccomp高阶用法简介

seccomp的高阶用法还是要参考seccomp的内核文档,本节仅做一些罗列和提示。
在[2]中的example代码里,不符合要求的调用,会被直接kill process。其实还有很多返回值:
  • SECCOMP_RET_KILL_PROCESS
  • SECCOMP_RET_KILL_THREAD
  • SECCOMP_RET_TRAP
  • SECCOMP_RET_ERRNO
  • SECCOMP_RET_USER_NOTIF
  • SECCOMP_RET_TRACE
  • SECCOMP_RET_LOG
  • SECCOMP_RET_ALLOW
If multiple filters exist, the return value for the evaluation of a given system call will always use the highest precedent value.
Precedence is only determined using the SECCOMP_RET_ACTION mask. When multiple filters return values of the same precedence, only the SECCOMP_RET_DATA from the most recently installed filter will be returned.

坑(Pitfalls)

来自[5],我只是搬运工
The biggest pitfall to avoid during use is filtering on system call number without checking the architecture value. Why? On any architecture that supports multiple system call invocation conventions, the system call numbers may vary based on the specific invocation. If the numbers in the different calling conventions overlap, then checks in the filters may be abused. Always check the arch value!
翻译一下,这里的意思是提醒seccomp的使用者,在设置filtering之前要记得验证体系架构,因为不同的体系架构,系统调用号可能是不同的。例如[2]的example中有:
struct sock_filter filter[] = {
/* Validate architecture. */
VALIDATE_ARCHITECTURE,
...
};

// in seccomp-bpf.h
#define syscall_nr (offsetof(struct seccomp_data, nr))
#define arch_nr (offsetof(struct seccomp_data, arch))

#if defined(__i386__)
# define REG_SYSCALL    REG_EAX
# define ARCH_NR    AUDIT_ARCH_I386
#elif defined(__x86_64__)
# define REG_SYSCALL    REG_RAX
# define ARCH_NR    AUDIT_ARCH_X86_64
#else
# warning "Platform does not support seccomp filter yet"
# define REG_SYSCALL    0
# define ARCH_NR    0
#endif

#define VALIDATE_ARCHITECTURE \
    BPF_STMT(BPF_LD+BPF_W+BPF_ABS, arch_nr), \
    BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ARCH_NR, 1, 0), \
    BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL)
seccomp_data是啥?
/**
* struct seccomp_data - the format the BPF program executes over.
* @nr: the system call number
* @arch: indicates system call convention as an AUDIT_ARCH_* value
* as defined in <linux/audit.h>.
* @instruction_pointer: at the time of the system call.
* @args: up to 6 system call arguments always stored as 64-bit values
* regardless of the architecture.
*/
struct seccomp_data {
    int nr;
    __u32 arch;
    __u64 instruction_pointer;
    __u64 args[6];
};
我的理解:应用程序(example.c)把sock_filter结构体写入内核(通过prctl),之后内核接受SYSCALL的时候,构造出seccomp_data数据结构来调用这一段BPF代码,并获得过滤的结果。

总结

seccomp是内核提供的一种SYSCALL过滤机制,它基于BPF过滤方法,通过写入BPF过滤器代码来达到过滤的目的。BPF规则语言原生是为了过滤网络包,情景比较复杂。针对SYSCALL场景,语法比较固定,可以自行撰写,也可以基于Libseccomp库提供的API来编写。
因为程序在fork/clone或execve时,BPF filter会从父进程继承到子进程,所以如果想控制第三方的程序调用SYSCALL,只需要在fork/clone或者execve时,传入合适的sock_filter即可。

参考文献

Linux%E9%87%8C%E9%9D%A2%E6%9C%89%E5%BE%88%E5%A4%9A%E5%AE%89%E5%85%A8feature%EF%BC%8C%E5%8F%AF%E4%BB%A5%E7%94%A8%E6%9D%A5%E6%9E%84%E6%88%90%E5%BA%94%E7%94%A8%E6%B2%99%E7%AE%B1%EF%BC%8C%E6%9D%A5%E8%BE%BE%E5%88%B0%E6%8E%A7%E5%88%B6%E7%AC%AC%E4%B8%89%E6%96%B9%E5%8F%91%E8%A1%8C%E7%9A%84app%E7%9A%84%E8%AE%BF%E9%97%AE%E6%9D%83%E9%99%90%E3%80%82%E5%8F%82%E8%80%83%5B1%5D%E5%8F%AF%E4%BB%A5%E6%9C%89%E4%B8%80%E4%B8%AA%E5%A4%A7%E8%87%B4%E7%9A%84%E4%BA%86%E8%A7%A3%E3%80%82Secure%20Computing%2C%20%E7%AE%80%E7%A7%B0seccomp%E6%98%AF%E5%85%B6%E4%B8%AD%E4%B8%80%E7%A7%8D%E6%96%B9%E6%B3%95%E3%80%82%E9%80%9A%E8%BF%87seccomp%E5%8F%AF%E4%BB%A5%E6%8E%A7%E5%88%B6%E5%BA%94%E7%94%A8%E7%A8%8B%E5%BA%8F%E5%AF%B9%E7%B3%BB%E7%BB%9F%E8%B0%83%E7%94%A8%E7%9A%84%E8%AE%BF%E9%97%AE%E3%80%82%0A%3E%20seccomp%E6%B2%99%E7%AE%B1%E4%B8%BB%E8%A6%81%E6%9C%89%E4%B8%A4%E7%A7%8D%E6%A8%A1%E5%BC%8F%EF%BC%8CSECCOMP_SET_MODE_STRICT%E5%8F%AA%E8%BF%90%E8%A1%8C%E8%B0%83%E7%94%A84%E4%B8%AA%E7%B3%BB%E7%BB%9F%E8%B0%83%E7%94%A8read(2)%2C%20write(2)%2C%20%5C_exit(2)%2C%20sigreturn(2)%E5%9B%9B%E4%B8%AA%E7%B3%BB%E7%BB%9F%E8%B0%83%E7%94%A8%EF%BC%8C%E8%80%8CSECCOMP_SET_MODE_FILTER%E5%88%99%E5%85%81%E8%AE%B8%E9%80%9A%E8%BF%87BPF%E6%8C%87%E5%AE%9A%E7%B3%BB%E7%BB%9F%E8%B0%83%E7%94%A8%E7%9A%84%E9%BB%91%E5%90%8D%E5%8D%95%E6%88%96%E8%80%85%E7%99%BD%E5%90%8D%E5%8D%95%5B1%5D%0A%0A%E6%96%87%E7%8C%AE%5B2%5D%E6%8F%90%E4%BE%9B%E4%BA%86seccomp%E5%85%A5%E9%97%A8%E9%9D%9E%E5%B8%B8%E5%A5%BD%E7%9A%84%E4%B8%80%E4%B8%AAexample%E9%A1%B9%E7%9B%AE%E3%80%82%E9%80%9A%E8%BF%87%E4%B8%80%E4%B8%AA%E6%9C%80%E5%B0%8F%E7%9A%84%E5%BA%94%E7%94%A8%E7%A8%8B%E5%BA%8F%EF%BC%8Cstep%20by%20step%E5%9C%B0%E6%BC%94%E7%A4%BA%E4%BA%86seccomp%E6%98%AF%E5%A6%82%E4%BD%95%E8%BF%87%E6%BB%A4%E7%B3%BB%E7%BB%9F%E8%B0%83%E7%94%A8%E7%9A%84%E3%80%82%0A%E6%9C%AC%E6%96%87%E4%B8%BB%E8%A6%81%E5%9F%BA%E4%BA%8E%5B2%5D%2C%20%E6%8F%90%E5%8F%96%E5%87%BA%E6%A0%B8%E5%BF%83%E7%9A%84%E6%8F%8F%E8%BF%B0%EF%BC%8C%E5%B9%B6%E7%90%86%E6%B8%85%E6%80%9D%E8%B7%AF%0A%0A%23%23%20seccomp%E7%9A%84%E5%89%8D%E4%B8%96%E4%BB%8A%E7%94%9F%0Aseccomp%E6%9C%80%E6%97%A9%E5%87%BA%E7%8E%B0%E5%9C%A8Linux%20kernel%202.6.12%EF%BC%8C%E6%97%B6%E9%97%B4%E6%98%AF2005%E5%B9%B4%E3%80%822012%E5%B9%B4%E7%9A%84Linux%203.5%E7%89%88%E6%9C%AC%EF%BC%8C%E5%8A%A0%E5%85%A5%E4%BA%86%22seccomp%20mode%202%22%20(or%20%22seccomp%20filter%20mode%22)%E5%8A%9F%E8%83%BD%E3%80%82%0A%3E%20It%20added%20a%20second%20mode%20for%20seccomp%3A%20**SECCOMP_MODE_FILTER**.%20Using%20that%20mode%2C%20processes%20can%20specify%20which%20system%20calls%20are%20permitted.%20By%20using%20a%20mini-program%20in%20the%20Berkeley%20packet%20filter%20(BPF)%20language%2C%20processes%20could%20restrict%20system%20calls%20entirely%20or%20only%20for%20certain%20argument%20values.%20%0A%0A%E6%89%80%E4%BB%A5%E6%88%91%E4%BB%AC%E7%8E%B0%E5%9C%A8%E7%94%A8%E7%9A%84%E5%B0%B1%E6%98%AF%E8%BF%99%E4%B8%AAseccomp2%EF%BC%8C%E6%88%96%E8%80%85%E5%8F%ABseccomp-bpf%E3%80%82%E9%82%A3%E4%B9%88%E4%B8%BA%E5%95%A5%E5%8F%ABseccomp-bpf%3FBPF%E5%88%B0%E5%BA%95%E6%98%AF%E4%BB%80%E4%B9%88%EF%BC%9F%0ABPF%E5%85%A8%E7%A7%B0%E6%98%AFBerkeley%20Packet%20Filter%E3%80%82%0A%3E%20%E6%9C%80%E5%88%9D%E6%9E%84%E6%83%B3%E6%8F%90%E5%87%BA%E4%BA%8E1992%E5%B9%B4%EF%BC%8C%E5%85%B6%E7%9B%AE%E7%9A%84%E6%98%AF%E4%B8%BA%E4%BA%86%E6%8F%90%E4%BE%9B%E4%B8%80%E7%A7%8D%E8%BF%87%E6%BB%A4%E5%8C%85%E7%9A%84%E6%96%B9%E6%B3%95%EF%BC%8C%E5%B9%B6%E4%B8%94%E8%A6%81%E9%81%BF%E5%85%8D%E4%BB%8E%E5%86%85%E6%A0%B8%E7%A9%BA%E9%97%B4%E5%88%B0%E7%94%A8%E6%88%B7%E7%A9%BA%E9%97%B4%E7%9A%84%E6%97%A0%E7%94%A8%E7%9A%84%E6%95%B0%E6%8D%AE%E5%8C%85%E5%A4%8D%E5%88%B6%E8%A1%8C%E4%B8%BA%E3%80%82%E5%AE%83%E6%9C%80%E5%88%9D%E6%98%AF%E7%94%B1%E4%BB%8E%E7%94%A8%E6%88%B7%E7%A9%BA%E9%97%B4%E6%B3%A8%E5%85%A5%E5%88%B0%E5%86%85%E6%A0%B8%E7%9A%84%E4%B8%80%E4%B8%AA%E7%AE%80%E5%8D%95%E7%9A%84%E5%AD%97%E8%8A%82%E7%A0%81%E6%9E%84%E6%88%90%EF%BC%8C%E5%AE%83%E5%9C%A8%E9%82%A3%E4%B8%AA%E4%BD%8D%E7%BD%AE%E5%88%A9%E7%94%A8%E4%B8%80%E4%B8%AA%E6%A0%A1%E9%AA%8C%E5%99%A8%E8%BF%9B%E8%A1%8C%E6%A3%80%E6%9F%A5%20%E2%80%94%E2%80%94%20%E4%BB%A5%E9%81%BF%E5%85%8D%E5%86%85%E6%A0%B8%E5%B4%A9%E6%BA%83%E6%88%96%E8%80%85%E5%AE%89%E5%85%A8%E9%97%AE%E9%A2%98%20%E2%80%94%E2%80%94%20%E5%B9%B6%E9%99%84%E7%9D%80%E5%88%B0%E4%B8%80%E4%B8%AA%E5%A5%97%E6%8E%A5%E5%AD%97%E4%B8%8A%E3%80%82%E5%85%B6%E7%AE%80%E5%8C%96%E7%9A%84%E8%AF%AD%E8%A8%80%E4%BB%A5%E5%8F%8A%E5%AD%98%E5%9C%A8%E4%BA%8E%E5%86%85%E6%A0%B8%E4%B8%AD%E7%9A%84%E5%8D%B3%E6%97%B6%E7%BC%96%E8%AF%91%E5%99%A8%EF%BC%88JIT%EF%BC%89%EF%BC%8C%E4%BD%BF%20BPF%20%E6%88%90%E4%B8%BA%E4%B8%80%E4%B8%AA%E6%80%A7%E8%83%BD%E5%8D%93%E8%B6%8A%E7%9A%84%E5%B7%A5%E5%85%B7%E3%80%82%0A%0A%E4%B8%BA%E5%95%A5%E8%BF%99%E4%B8%A4%E4%B8%AA%E4%B8%9C%E8%A5%BF%E4%BC%9A%E7%BB%91%E5%88%B0%E4%B8%80%E8%B5%B7%E5%91%A2%EF%BC%9F%0A%E5%9B%A0%E4%B8%BAseccomp%E5%9C%A8%E8%BF%87%E6%BB%A4syscall%E7%9A%84%E6%97%B6%E5%80%99%EF%BC%8C%E5%80%9F%E5%8A%A9%E4%BA%86BPF%E5%AE%9A%E4%B9%89%E7%9A%84%E8%BF%87%E6%BB%A4%E8%A7%84%E5%88%99%EF%BC%8C%E4%BB%A5%E5%8F%8A%E5%A4%84%E4%BA%8E%E5%86%85%E6%A0%B8%E7%9A%84%E7%94%A8BPF%20language%E5%86%99%E7%9A%84mini-program%E3%80%82%E6%89%80%E4%BB%A5%E9%87%87%E7%94%A8%E4%BA%86BPF%E6%96%B9%E6%B3%95%E5%AF%B9syscall%E8%BF%9B%E8%A1%8C%E8%BF%87%E6%BB%A4%E7%9A%84seccomp%E5%B0%B1%E6%98%AFseccomp-bpf%E3%80%82%0A%23%23%20seccomp%E6%80%8E%E4%B9%88%E7%94%A8%0A%3E%20The%20seccomp%20filter%20system%20uses%20the%20Berkley%20Packet%20Filter%20system.%20Combined%20with%20argument%20checking%20and%20the%20many%20possible%20filter%20return%20values%20(kill%2C%20trap%2C%20trace%2C%20errno)%2C%20this%20is%20allows%20for%20extensive%20logic.%5B2%5D%0A%0A%5B2%5D%E4%BB%A5%E5%8F%8A%E6%9C%AC%E6%96%87%E4%B8%BB%E8%A6%81%E9%98%90%E8%BF%B0%E4%BA%86%E6%9C%80%E5%9F%BA%E6%9C%AC%E7%9A%84case%EF%BC%8Cseccomp%E7%9A%84%E9%AB%98%E9%98%B6%E7%94%A8%E6%B3%95%E5%8F%AF%E4%BB%A5%E5%8F%82%E8%80%83%5Bseccomp%E7%9A%84%E5%86%85%E6%A0%B8%E6%96%87%E6%A1%A3%5D(https%3A%2F%2Fwww.kernel.org%2Fdoc%2Fhtml%2Flatest%2Fuserspace-api%2Fseccomp_filter.html)%E3%80%82%0A%23%23%23%20Detect%20seccomp%0A%60%60%60%0Aroot%40kube-master%3A~%23%20%20cat%20%2Fboot%2Fconfig-%60uname%20-r%60%20%7C%20grep%20CONFIG_SECCOMP%0ACONFIG_SECCOMP_FILTER%3Dy%0ACONFIG_SECCOMP%3Dy%0A%60%60%60%0A%E4%B9%9F%E5%8F%AF%E4%BB%A5%E5%8F%82%E8%80%83%E8%BF%99%E4%B8%AA%E9%93%BE%E6%8E%A5%EF%BC%88%5BDetecting%20seccomp%20features%20at%20runtime%5D(https%3A%2F%2Foutflux.net%2Fteach-seccomp%2Fautodetect.html)%EF%BC%89%E5%86%99%E4%B8%80%E4%B8%AALinux%E5%BA%94%E7%94%A8%E7%A8%8B%E5%BA%8F%E8%B0%83%E7%94%A8prctl%E7%B3%BB%E7%BB%9F%E8%B0%83%E7%94%A8%E6%9D%A5%E6%A3%80%E6%B5%8B%E3%80%82%0A%0A%23%23%23%20Basic%20seccomp%20filtering%0A%E5%9F%BA%E6%9C%AC%E7%9A%84SYSCALL%E8%BF%87%E6%BB%A4%E9%9D%9E%E5%B8%B8%E7%AE%80%E5%8D%95%EF%BC%8C%E5%A4%A7%E8%87%B4%E6%B5%81%E7%A8%8B%E5%A6%82%E4%B8%8B%EF%BC%9A%0A%60%60%60mermaid%0Agraph%20TD%0AA%5B%22%E5%AE%9A%E4%B9%89filter%E6%95%B0%E7%BB%84%22%5D%0AA%20--%3E%20B%5B%22%E5%AE%9A%E4%B9%89prog%E5%8F%82%E6%95%B0%22%5D%0AB%20--%3E%20C%5B%22prctl(PR_SET_SECCOMP%2C%20SECCOMP_MODE_FILTER%2C%20%26prog)%22%5D%0A%60%60%60%0A%E5%90%8E%E9%9D%A2%E4%B8%A4%E6%AD%A5%E9%83%BD%E5%B7%AE%E4%B8%8D%E5%A4%9A%EF%BC%8C%E5%8F%AF%E8%A7%81%E5%85%B3%E9%94%AE%E7%9A%84%E6%98%AF%E7%AC%AC%E4%B8%80%E6%AD%A5%EF%BC%8C%E4%B9%9F%E5%B0%B1%E6%98%AF%E5%AE%9A%E4%B9%89SYSCALL%E8%BF%87%E6%BB%A4%E5%88%97%E8%A1%A8%E3%80%82%0A%60%60%60c%0Astruct%20sock_filter%20filter%5B%5D%20%3D%20%7B%0A%20%20%20%20%2F*%20Validate%20architecture.%20*%2F%0A%20%20%20%20VALIDATE_ARCHITECTURE%2C%0A%20%20%20%20%2F*%20Grab%20the%20system%20call%20number.%20*%2F%0A%20%20%20%20EXAMINE_SYSCALL%2C%0A%20%20%20%20%2F*%20List%20allowed%20syscalls.%20*%2F%0A%20%20%20%20ALLOW_SYSCALL(rt_sigreturn)%2C%0A%20%20%20%20...%0A%20%20%20%20KILL_PROCESS%2C%0A%7D%3B%0Astruct%20sock_fprog%20prog%20%3D%20%7B%0A%20%20%20%20.len%20%3D%20(unsigned%20short)(sizeof(filter)%2Fsizeof(filter%5B0%5D))%2C%0A%20%20%20%20.filter%20%3D%20filter%2C%0A%7D%3B%0A%0Aif%20(prctl(PR_SET_NO_NEW_PRIVS%2C%201%2C%200%2C%200%2C%200))%20%7B%0A%20%20%20%20perror(%22prctl(NO_NEW_PRIVS)%22)%3B%0A%20%20%20%20goto%20failed%3B%0A%7D%0Aif%20(prctl(PR_SET_SECCOMP%2C%20SECCOMP_MODE_FILTER%2C%20%26prog))%20%7B%0A%20%20%20%20perror(%22prctl(SECCOMP)%22)%3B%0A%20%20%20%20goto%20failed%3B%0A%7D%0A%60%60%60%0A%0A%23%23%23%20BPF%0A%E8%BF%99%E9%87%8C%E5%80%BC%E5%BE%97%E6%B3%A8%E6%84%8F%E7%9A%84%E6%98%AF%60EXAMINE_SYSCALL%60%2C%20%60ALLOW_SYSCALL%60%E5%92%8C%20%60KILL_PROCESS%60%E9%83%BD%E4%B8%8D%E6%98%AF%E6%A0%87%E5%87%86%E5%BA%93%E6%8F%90%E4%BE%9B%E7%9A%84%E3%80%82%E7%9C%8B%E6%9C%AC%E9%A1%B9%E7%9B%AE%E7%9A%84%5Bseccomp-bpf.h%5D(https%3A%2F%2Foutflux.net%2Fteach-seccomp%2Fstep-3%2Fseccomp-bpf.h)%E5%8F%AF%E4%BB%A5%E7%9F%A5%E9%81%93%E3%80%82%E8%BF%99%E4%BA%9B%E9%83%BD%E6%98%AF%E8%87%AA%E5%AE%9A%E4%B9%89%E7%9A%84%E5%AE%8F%EF%BC%9A%0A%60%60%60C%0A%23define%20VALIDATE_ARCHITECTURE%20%5C%0A%09BPF_STMT(BPF_LD%2BBPF_W%2BBPF_ABS%2C%20arch_nr)%2C%20%5C%0A%09BPF_JUMP(BPF_JMP%2BBPF_JEQ%2BBPF_K%2C%20ARCH_NR%2C%201%2C%200)%2C%20%5C%0A%09BPF_STMT(BPF_RET%2BBPF_K%2C%20SECCOMP_RET_KILL)%0A%0A%23define%20EXAMINE_SYSCALL%20%5C%0A%09BPF_STMT(BPF_LD%2BBPF_W%2BBPF_ABS%2C%20syscall_nr)%0A%0A%23define%20ALLOW_SYSCALL(name)%20%5C%0A%09BPF_JUMP(BPF_JMP%2BBPF_JEQ%2BBPF_K%2C%20__NR_%23%23name%2C%200%2C%201)%2C%20%5C%0A%09BPF_STMT(BPF_RET%2BBPF_K%2C%20SECCOMP_RET_ALLOW)%0A%0A%23define%20KILL_PROCESS%20%5C%0A%09BPF_STMT(BPF_RET%2BBPF_K%2C%20SECCOMP_RET_KILL)%0A%60%60%60%0A%E6%9F%A5%E7%9C%8BLinux%E6%BA%90%E7%A0%81%E5%8F%AF%E7%9F%A5%2C%20BPF_STMT%0A%60%60%60%0A.%2Finclude%2Fuapi%2Flinux%2Ffilter.h%0A48%3A%23ifndef%20BPF_STMT%0A49%3A%23define%20BPF_STMT(code%2C%20k)%20%7B%20(unsigned%20short)(code)%2C%200%2C%200%2C%20k%20%7D%0A%60%60%60%0A%3EIf%20one%20is%20%22feeling%20masochistic%22%2C%20they%20could%20write%20BPF%20programs%20numerically%2C%20but%20**there%20are%20some%20constants%20and%20macros%20available%20to%20make%20it%20easier**.%20%5B3%5D%0A%0ABPF%E6%8F%90%E4%BE%9B%E4%BA%86%E4%B8%80%E5%A5%97%E6%8C%87%E4%BB%A4%E9%9B%86%E6%9D%A5%E5%AE%9E%E7%8E%B0filter%E5%8A%9F%E8%83%BD%E3%80%82%E4%BD%A0%E5%8F%AF%E4%BB%A5%E8%87%AA%E8%A1%8C%E5%86%99%E4%BA%8C%E8%BF%9B%E5%88%B6%EF%BC%88%E6%B1%87%E7%BC%96%EF%BC%89%EF%BC%8C%E4%B9%9F%E5%8F%AF%E4%BB%A5%E9%87%87%E7%94%A8Linux%E6%8A%BD%E8%B1%A1%E7%9A%84%E5%AE%8F%EF%BC%8C%E6%9B%B4%E9%AB%98%E9%98%B6%E4%B8%80%E7%82%B9%E7%9A%84API%EF%BC%8C%E5%8F%AF%E4%BB%A5%E5%8F%82%E8%80%83Libseccomp%E3%80%82%E9%87%8C%E9%9D%A2%E4%B9%9F%E6%9C%89%E4%B8%8D%E5%B0%91sample%E4%BB%A3%E7%A0%81%E3%80%82Android%E4%BD%BF%E7%94%A8%E7%9A%84%E6%98%AF%5Bminijai%5D(https%3A%2F%2Fgithub.com%2Fgoogle%2Fminijail)%E3%80%82%0A%3EThere%20are%20a%20number%20of%20tools%20and%20resources%20that%20can%20make%20it%20easier%20to%20work%20with%20seccomp%20filters%20and%20BPF.%20Libseccomp%20provides%20a%20higher-level%20API%20for%20creating%20filters.%20He%20noted%20that%20the%20project%20also%20has%20man%20pages%20(for%20example%2C%20seccomp_init())%20with%20lots%20of%20examples.%20%5B3%5D%0A%0A%0A%E9%99%A4%E4%BA%86%E9%AB%98%E9%98%B6API%E5%8F%AF%E7%94%A8%EF%BC%8C%5B3%5D%E4%B8%AD%E8%BF%98%E6%8F%90%E5%8F%8A%E4%BA%86%E4%B8%80%E4%BA%9B%E7%BC%96%E8%AF%91BPF%E6%8C%87%E4%BB%A4%E7%9A%84%E6%96%B9%E6%B3%95%EF%BC%9A%0A%3EThere%20is%20also%20a%20**BPF%20compiler%20(bpfc)**%20that%20is%20part%20of%20the%20netsniff-ng%20toolkit%20project.%20LLVM%20has%20a%20BPF%20backend%20as%20of%20its%203.7%20release%20that%20compiles%20a%20subset%20of%20C%20to%20BPF%2C%20though%20he%20noted%20that%20there%20is%20little%20documentation%20as%20yet.%0A%3E%0A%3EFinally%2C%20the%20kernel%20has%20a%20**just-in-time%20(JIT)%20compiler**%20that%20turns%20the%20BPF%20bytecode%20into%20native%20machine%20code%2C%20which%20can%20achieve%202-3x%20performance%20(or%20even%20better%20in%20some%20cases).%20The%20JIT%20compiler%20is%20disabled%20by%20default%2C%20but%20it%20can%20be%20enabled%20by%20writing%20a%20%221%22%20to%3A%20%60%2Fproc%2Fsys%2Fnet%2Fcore%2Fbpf_jit_enable%60%0A%3EKerrisk's%20slides%20have%20a%20wealth%20of%20information%2C%20including%20additional%20%3Eresources%20for%20more%20information.%0A%0A%23%23%23%20seccomp%E9%AB%98%E9%98%B6%E7%94%A8%E6%B3%95%E7%AE%80%E4%BB%8B%0Aseccomp%E7%9A%84%E9%AB%98%E9%98%B6%E7%94%A8%E6%B3%95%E8%BF%98%E6%98%AF%E8%A6%81%E5%8F%82%E8%80%83%5Bseccomp%E7%9A%84%E5%86%85%E6%A0%B8%E6%96%87%E6%A1%A3%5D(https%3A%2F%2Fwww.kernel.org%2Fdoc%2Fhtml%2Flatest%2Fuserspace-api%2Fseccomp_filter.html)%EF%BC%8C%E6%9C%AC%E8%8A%82%E4%BB%85%E5%81%9A%E4%B8%80%E4%BA%9B%E7%BD%97%E5%88%97%E5%92%8C%E6%8F%90%E7%A4%BA%E3%80%82%0A%E5%9C%A8%5B2%5D%E4%B8%AD%E7%9A%84example%E4%BB%A3%E7%A0%81%E9%87%8C%EF%BC%8C%E4%B8%8D%E7%AC%A6%E5%90%88%E8%A6%81%E6%B1%82%E7%9A%84%E8%B0%83%E7%94%A8%EF%BC%8C%E4%BC%9A%E8%A2%AB%E7%9B%B4%E6%8E%A5kill%20process%E3%80%82%E5%85%B6%E5%AE%9E%E8%BF%98%E6%9C%89%E5%BE%88%E5%A4%9A%E8%BF%94%E5%9B%9E%E5%80%BC%EF%BC%9A%0A-%20SECCOMP_RET_KILL_PROCESS%0A-%20SECCOMP_RET_KILL_THREAD%0A-%20SECCOMP_RET_TRAP%0A-%20SECCOMP_RET_ERRNO%0A-%20SECCOMP_RET_USER_NOTIF%0A-%20SECCOMP_RET_TRACE%0A-%20SECCOMP_RET_LOG%0A-%20SECCOMP_RET_ALLOW%0A%0A%3EIf%20multiple%20filters%20exist%2C%20the%20return%20value%20for%20the%20evaluation%20of%20a%20given%20system%20call%20will%20always%20use%20the%20highest%20precedent%20value.%0A%3E%0A%3EPrecedence%20is%20only%20determined%20using%20the%20SECCOMP_RET_ACTION%20mask.%20When%20multiple%20filters%20return%20values%20of%20the%20same%20precedence%2C%20only%20the%20SECCOMP_RET_DATA%20from%20the%20most%20recently%20installed%20filter%20will%20be%20returned.%0A%0A%23%23%23%20%E5%9D%91%EF%BC%88Pitfalls%EF%BC%89%0A%E6%9D%A5%E8%87%AA%5B5%5D%EF%BC%8C%E6%88%91%E5%8F%AA%E6%98%AF%E6%90%AC%E8%BF%90%E5%B7%A5%0A%3E%20The%20biggest%20pitfall%20to%20avoid%20during%20use%20is%20filtering%20on%20system%20call%20number%20without%20checking%20the%20architecture%20value.%20Why%3F%20On%20any%20architecture%20that%20supports%20multiple%20system%20call%20invocation%20conventions%2C%20the%20system%20call%20numbers%20may%20vary%20based%20on%20the%20specific%20invocation.%20If%20the%20numbers%20in%20the%20different%20calling%20conventions%20overlap%2C%20then%20checks%20in%20the%20filters%20may%20be%20abused.%20Always%20check%20the%20arch%20value!%0A%0A%E7%BF%BB%E8%AF%91%E4%B8%80%E4%B8%8B%EF%BC%8C%E8%BF%99%E9%87%8C%E7%9A%84%E6%84%8F%E6%80%9D%E6%98%AF%E6%8F%90%E9%86%92seccomp%E7%9A%84%E4%BD%BF%E7%94%A8%E8%80%85%EF%BC%8C%E5%9C%A8%E8%AE%BE%E7%BD%AEfiltering%E4%B9%8B%E5%89%8D%E8%A6%81%E8%AE%B0%E5%BE%97%E9%AA%8C%E8%AF%81%E4%BD%93%E7%B3%BB%E6%9E%B6%E6%9E%84%EF%BC%8C%E5%9B%A0%E4%B8%BA%E4%B8%8D%E5%90%8C%E7%9A%84%E4%BD%93%E7%B3%BB%E6%9E%B6%E6%9E%84%EF%BC%8C%E7%B3%BB%E7%BB%9F%E8%B0%83%E7%94%A8%E5%8F%B7%E5%8F%AF%E8%83%BD%E6%98%AF%E4%B8%8D%E5%90%8C%E7%9A%84%E3%80%82%E4%BE%8B%E5%A6%82%5B2%5D%E7%9A%84example%E4%B8%AD%E6%9C%89%EF%BC%9A%0A%60%60%60c%0Astruct%20sock_filter%20filter%5B%5D%20%3D%20%7B%0A%20%20%20%20%2F*%20Validate%20architecture.%20*%2F%0A%20%20%20%20VALIDATE_ARCHITECTURE%2C%0A%20%20%20%20...%0A%7D%3B%0A%0A%2F%2F%20in%20seccomp-bpf.h%0A%23define%20syscall_nr%20(offsetof(struct%20seccomp_data%2C%20nr))%0A%23define%20arch_nr%20(offsetof(struct%20seccomp_data%2C%20arch))%0A%0A%23if%20defined(__i386__)%0A%23%20define%20REG_SYSCALL%09REG_EAX%0A%23%20define%20ARCH_NR%09AUDIT_ARCH_I386%0A%23elif%20defined(__x86_64__)%0A%23%20define%20REG_SYSCALL%09REG_RAX%0A%23%20define%20ARCH_NR%09AUDIT_ARCH_X86_64%0A%23else%0A%23%20warning%20%22Platform%20does%20not%20support%20seccomp%20filter%20yet%22%0A%23%20define%20REG_SYSCALL%090%0A%23%20define%20ARCH_NR%090%0A%23endif%0A%0A%23define%20VALIDATE_ARCHITECTURE%20%5C%0A%09BPF_STMT(BPF_LD%2BBPF_W%2BBPF_ABS%2C%20arch_nr)%2C%20%5C%0A%09BPF_JUMP(BPF_JMP%2BBPF_JEQ%2BBPF_K%2C%20ARCH_NR%2C%201%2C%200)%2C%20%5C%0A%09BPF_STMT(BPF_RET%2BBPF_K%2C%20SECCOMP_RET_KILL)%0A%60%60%60%0A%0Aseccomp_data%E6%98%AF%E5%95%A5%EF%BC%9F%0A%60%60%60%0A%2F**%0A%20*%20struct%20seccomp_data%20-%20the%20format%20the%20BPF%20program%20executes%20over.%0A%20*%20%40nr%3A%20the%20system%20call%20number%0A%20*%20%40arch%3A%20indicates%20system%20call%20convention%20as%20an%20AUDIT_ARCH_*%20value%0A%20*%20%20%20%20%20%20%20%20as%20defined%20in%20%3Clinux%2Faudit.h%3E.%0A%20*%20%40instruction_pointer%3A%20at%20the%20time%20of%20the%20system%20call.%0A%20*%20%40args%3A%20up%20to%206%20system%20call%20arguments%20always%20stored%20as%2064-bit%20values%0A%20*%20%20%20%20%20%20%20%20regardless%20of%20the%20architecture.%0A%20*%2F%0Astruct%20seccomp_data%20%7B%0A%09int%20nr%3B%0A%09__u32%20arch%3B%0A%09__u64%20instruction_pointer%3B%0A%09__u64%20args%5B6%5D%3B%0A%7D%3B%0A%60%60%60%0A%E6%88%91%E7%9A%84%E7%90%86%E8%A7%A3%EF%BC%9A%E5%BA%94%E7%94%A8%E7%A8%8B%E5%BA%8F(example.c)%E6%8A%8Asock_filter%E7%BB%93%E6%9E%84%E4%BD%93%E5%86%99%E5%85%A5%E5%86%85%E6%A0%B8(%E9%80%9A%E8%BF%87prctl)%EF%BC%8C%E4%B9%8B%E5%90%8E%E5%86%85%E6%A0%B8%E6%8E%A5%E5%8F%97SYSCALL%E7%9A%84%E6%97%B6%E5%80%99%EF%BC%8C%E6%9E%84%E9%80%A0%E5%87%BAseccomp_data%E6%95%B0%E6%8D%AE%E7%BB%93%E6%9E%84%E6%9D%A5%E8%B0%83%E7%94%A8%E8%BF%99%E4%B8%80%E6%AE%B5BPF%E4%BB%A3%E7%A0%81%EF%BC%8C%E6%9D%A5%E8%8E%B7%E5%BE%97%E8%BF%87%E6%BB%A4%E7%9A%84%E7%BB%93%E6%9E%9C%E3%80%82%0A%0A%23%23%20%E6%80%BB%E7%BB%93%0Aseccomp%E6%98%AF%E5%86%85%E6%A0%B8%E6%8F%90%E4%BE%9B%E7%9A%84%E4%B8%80%E7%A7%8DSYSCALL%E8%BF%87%E6%BB%A4%E6%9C%BA%E5%88%B6%EF%BC%8C%E5%AE%83%E5%9F%BA%E4%BA%8EBPF%E8%BF%87%E6%BB%A4%E6%96%B9%E6%B3%95%EF%BC%8C%E9%80%9A%E8%BF%87%E5%86%99%E5%85%A5BPF%E8%BF%87%E6%BB%A4%E5%99%A8%E4%BB%A3%E7%A0%81%E6%9D%A5%E8%BE%BE%E5%88%B0%E8%BF%87%E6%BB%A4%E7%9A%84%E7%9B%AE%E7%9A%84%E3%80%82BPF%E8%A7%84%E5%88%99%E8%AF%AD%E8%A8%80%E5%8E%9F%E7%94%9F%E6%98%AF%E4%B8%BA%E4%BA%86%E8%BF%87%E6%BB%A4%E7%BD%91%E7%BB%9C%E5%8C%85%EF%BC%8C%E6%83%85%E6%99%AF%E6%AF%94%E8%BE%83%E5%A4%8D%E6%9D%82%E3%80%82%E9%92%88%E5%AF%B9SYSCALL%E5%9C%BA%E6%99%AF%EF%BC%8C%E8%AF%AD%E6%B3%95%E6%AF%94%E8%BE%83%E5%9B%BA%E5%AE%9A%EF%BC%8C%E5%8F%AF%E4%BB%A5%E8%87%AA%E8%A1%8C%E6%92%B0%E5%86%99%EF%BC%8C%E4%B9%9F%E5%8F%AF%E4%BB%A5%E5%9F%BA%E4%BA%8ELibseccomp%E5%BA%93%E6%8F%90%E4%BE%9B%E7%9A%84API%E6%9D%A5%E7%BC%96%E5%86%99%E3%80%82%0A%E5%9B%A0%E4%B8%BA%E7%A8%8B%E5%BA%8F%E5%9C%A8fork%2Fclone%E6%88%96execve%E6%97%B6%EF%BC%8CBPF%20filter%E4%BC%9A%E4%BB%8E%E7%88%B6%E8%BF%9B%E7%A8%8B%E7%BB%A7%E6%89%BF%E5%88%B0%E5%AD%90%E8%BF%9B%E7%A8%8B%EF%BC%8C%E6%89%80%E4%BB%A5%E5%A6%82%E6%9E%9C%E6%83%B3%E6%8E%A7%E5%88%B6%E7%AC%AC%E4%B8%89%E6%96%B9%E7%9A%84%E7%A8%8B%E5%BA%8F%E8%B0%83%E7%94%A8SYSCALL%EF%BC%8C%E5%8F%AA%E9%9C%80%E8%A6%81%E5%9C%A8fork%2Fclone%E6%88%96%E8%80%85execve%E6%97%B6%EF%BC%8C%E4%BC%A0%E5%85%A5%E5%90%88%E9%80%82%E7%9A%84sock_filter%E5%8D%B3%E5%8F%AF%E3%80%82%0A%0A%23%23%20%E5%8F%82%E8%80%83%E6%96%87%E7%8C%AE%0A1.%20%5Blinux%E4%B8%AD%E7%9A%84%E5%AE%B9%E5%99%A8%E4%B8%8E%E6%B2%99%E7%AE%B1%E5%88%9D%E6%8E%A2%5D(http%3A%2F%2Fatum.li%2F2017%2F04%2F25%2Flinuxsandbox%2F%23linux%25E4%25B8%25AD%25E7%259A%2584%25E6%25B2%2599%25E7%25AE%25B1%25E6%258A%2580%25E6%259C%25AF)%0A2.%20%5BUsing%20simple%20seccomp%20filters%5D(https%3A%2F%2Foutflux.net%2Fteach-seccomp%2F)%0A3.%20%5BA%20seccomp%20overview%5D(https%3A%2F%2Flwn.net%2FArticles%2F656307%2F)%0A4.%20%5Bseccomp%E7%9A%84%E5%86%85%E6%A0%B8%E6%96%87%E6%A1%A3%5D(https%3A%2F%2Fwww.kernel.org%2Fdoc%2Fhtml%2Flatest%2Fuserspace-api%2Fseccomp_filter.html)%E3%80%82%0A5.%20%5Bkernel%20doc%20-%20Seccomp%20BPF%20(SECure%20COMPuting%20with%20filters)%5D(https%3A%2F%2Fwww.kernel.org%2Fdoc%2Fhtml%2Flatest%2Fuserspace-api%2Fseccomp_filter.html)%0A6.%20%5BAndroid%E4%B8%AD%E7%9A%84seccomp%5D(https%3A%2F%2Fwww.jianshu.com%2Fp%2F62ede45cfb2e)