Commit 7b1ea147 authored by Cyrill Gorcunov's avatar Cyrill Gorcunov

Drop kernel/ directory

The kernel patches are carried in a separate
repo anyway, so a second place for patches
might be confusing.
Signed-off-by: 's avatarCyrill Gorcunov <gorcunov@openvz.org>
parent 681ef94f
crtools internals
=================
What CRtools is
---------------
In short -- crtools is an utility to checkpoint/restore (CR) processes. Unlike CR
implemented completely in kernel space, it tries to achieve the same goal opreating
in user space.
Since this tool (and overall concept) is under heavily development stage, there are
some known limitations
- Only pure x86-64 environment is supported, no IA32 emulation.
- There is no way to use cgroups freezer facility.
- No network or IPC CR supported.
At moment CR of the following resources are supported
- Process tree
- Files (with some limitations)
- Pipes
- Memory
Basic design
------------
Checkpoint
~~~~~~~~~~
Checkpoint procedure relies on /proc file system (it's a general place
where crtools takes all the information needed). Which includes
- File descriptors (via /proc/$pid/fd and /proc/$pid/fdinfo).
- Pipes parameters.
- Memory maps (via /proc/$pid/maps).
Process dumper (lets call it "dumper") does the following steps during
checkpoint stage
- A $pid of a process group leader is obtained from the command line.
- By using this $pid the dumper walks though /proc/$pid/status and gathers
children $pid's recursively. At the end we will have a complete process tree.
- Then it takes every $pid from a process tree, sends SIGSTOP to the every process
found and performs the following steps on each $pid
- Collects VMA areas by parsing /proc/$pid/maps.
- Seizes a task via relatively new ptrace interface. Seizing a task means to
put it into a special state when the task have no idea if it's being operated
by the ptrace.
- Core parameters of a task (such as registers and friends) are being dumped via
ptrace interface and parsing the /proc/$pid/stat entry.
- The dumper injects a parasite code into a task via ptrace interface. This allows
us to dump pages of a task right from within the task's address space.
An injection procedure is pretty simple one
- The dumper scans executable VMA areas of a task (which were previously collected)
and tests if there a place for a few instructions.
- Then (by ptrace as well) it substitutes an original code with new instructions
and creates a new VMA area inside process address space.
- Finally parasite code get copied into the new VMA and the former code which was
being modified during the parasite bootstrap procedure -- restored.
- Then the dumper flushes contents of a task's pages to a file, and drops out
the parasite code block completely, since we don't need it anymore.
- Once the parasite code removed a task get unseized via ptrace call but remains
stopped still.
- The dumper writes out parameters of opened files and pipes (flushing data on disk
if needed).
- SIGCONT is sent to every task in the process tree (to continue execution).
Restore
~~~~~~~
Restore procedure (aka restorer) proceed by the following steps
- The process tree read from a file.
- To restore the process tree the restorer executes clone(CLONE_CHILD_USEPID)
syscall which creates a process with $pid specified. Note if for some reason
you already have a process with the same $pid up and running, the restoration
procedure will refuse to proceed.
- Files and pipes are restored (ie opened with file descriptors they had at
checkpoint time and positioned exactly as they were before. In case if the pipe
had some data buffered before checkpoint -- data will be sent back to the pipe).
- Restoration of virtual memory (and memory pages) is a bit tricky and implemented
by the following steps
- The restorer analyzes the current VMA map by parsing /proc/$pid/maps file.
- Since we are to create completely new memory map the restorer enumerates
all VMA entries and figures out where is the place (or hole) between VMAs
which could be big enough to hold all code and parameters needed for the
rest of the restore procedure.
- Once such area found the restorer copies own code and data to a new place.
- Then the restorer pass execution there, which in turn does
- Unmaps current active VMAs and maps areas the process had at
the checkpoint time.
- Reads pages contents back to newly mapped memory.
- Prepares rt-sigreturn frame on stack and yields __NR_rt_sigreturn
syscall, so in result the process start execution from the former
IP it had at checkpoint time.
Kernel area
-----------
While CR is implemented in user-space still some help from the Linux kernel
is needed, so the following patches are needed
- New directory /proc/$pid/map_files, which allows the CR to find and restore
anonymous shared memory areas.
- Explicit "Children:" line in /proc/$pid/stat file added. This simplifies code
significantly (and kernel already has this information but simply not yet
exported).
- An ability to call clone() with specified $pid.
- start_data, end_data and a few more members of mm_struct.
- Export added to /proc/$pid/stat.
- Import implemented via new prctl codes.
- An ability to map vDSO at predefined address (implemented via
new prctl code as well).
...@@ -17,28 +17,16 @@ Licensed under GPLv2 (http://www.gnu.org/licenses/gpl-2.0.txt) ...@@ -17,28 +17,16 @@ Licensed under GPLv2 (http://www.gnu.org/licenses/gpl-2.0.txt)
Kernel patching Kernel patching
=============== ===============
To have crtools up and running either To have crtools up and running clone
1) use patches from kernel/ directory git://github.com/cyrillos/linux-2.6.git
2) or clone git://github.com/cyrillos/linux-2.6.git
and switch to branch "crtools". Note these patches
are guaranteed to be up to date only with major
release of crtool. If you're testing development
version -- make sure you're applying series from
kernel/ directory.
It's based on Linux and switch to branch "crtools".
| commit 1ea6b8f48918282bdca0b32a34095504ee65bab5
| Author: Linus Torvalds <torvalds@linux-foundation.org>
| Date: Mon Nov 7 16:16:02 2011 -0800
|
| Linux 3.2-rc1
The following patches are already in -mm tree It's based on Linux
fs-proc-Make-proc_get_link-to-use-dentry commit 384703b8e6cd4c8ef08512e596024e028c91c339
fs-proc-Introduce-the-proc-pid-map_files-directory Author: Linus Torvalds <torvalds@linux-foundation.org>
procfs-introduce-the-proc-pid-map_files-directory-checkpatch Date: Fri Dec 16 18:36:26 2011 -0800
sysfs-add-kernel.ns_last_pid
Linux 3.2-rc6
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: c/r: introduce CHECKPOINT_RESTORE symbol
For checkpoint/restore we need auxilary features being compiled into the
kernel, such as additional prctl codes, /proc/<pid>/map_files and etc...
but same time these features are not mandatory for a regular kernel so
CHECKPOINT_RESTORE config symbol should bring a way to disable them all at
once if one wish to get rid of additional functionality.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
init/Kconfig | 11 +++++++++++
1 file changed, 11 insertions(+)
Index: linux-2.6.git/init/Kconfig
===================================================================
--- linux-2.6.git.orig/init/Kconfig
+++ linux-2.6.git/init/Kconfig
@@ -773,6 +773,17 @@ config DEBUG_BLK_CGROUP
endif # CGROUPS
+config CHECKPOINT_RESTORE
+ bool "Checkpoint/restore support" if EXPERT
+ default n
+ help
+ Enables additional kernel features in a sake of checkpoint/restore.
+ In particular it adds auxiliary prctl codes to setup process text,
+ data and heap segment sizes, and a few additional /proc filesystem
+ entries.
+
+ If unsure, say N here.
+
menuconfig NAMESPACES
bool "Namespaces support" if EXPERT
default !EXPERT
From: Andrew Morton <akpm@linux-foundation.org>
Subject: c-r-prctl-add-pr_set_mm-codes-to-set-up-mm_struct-entries-fix
cache current->mm in a local, saving 200 bytes text
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
kernel/sys.c | 33 +++++++++++++++++----------------
1 file changed, 17 insertions(+), 16 deletions(-)
Index: linux-2.6.git/kernel/sys.c
===================================================================
--- linux-2.6.git.orig/kernel/sys.c
+++ linux-2.6.git/kernel/sys.c
@@ -1701,6 +1701,7 @@ static int prctl_set_mm(int opt, unsigne
unsigned long vm_bad_flags;
struct vm_area_struct *vma;
int error = 0;
+ struct mm_struct *mm = current->mm;
if (arg4 | arg5)
return -EINVAL;
@@ -1711,8 +1712,8 @@ static int prctl_set_mm(int opt, unsigne
if (addr >= TASK_SIZE)
return -EINVAL;
- down_read(&current->mm->mmap_sem);
- vma = find_vma(current->mm, addr);
+ down_read(&mm->mmap_sem);
+ vma = find_vma(mm, addr);
if (opt != PR_SET_MM_START_BRK && opt != PR_SET_MM_BRK) {
/* It must be existing VMA */
@@ -1732,9 +1733,9 @@ static int prctl_set_mm(int opt, unsigne
goto out;
if (opt == PR_SET_MM_START_CODE)
- current->mm->start_code = addr;
+ mm->start_code = addr;
else
- current->mm->end_code = addr;
+ mm->end_code = addr;
break;
case PR_SET_MM_START_DATA:
@@ -1747,9 +1748,9 @@ static int prctl_set_mm(int opt, unsigne
goto out;
if (opt == PR_SET_MM_START_DATA)
- current->mm->start_data = addr;
+ mm->start_data = addr;
else
- current->mm->end_data = addr;
+ mm->end_data = addr;
break;
case PR_SET_MM_START_STACK:
@@ -1762,31 +1763,31 @@ static int prctl_set_mm(int opt, unsigne
if ((vma->vm_flags & vm_req_flags) != vm_req_flags)
goto out;
- current->mm->start_stack = addr;
+ mm->start_stack = addr;
break;
case PR_SET_MM_START_BRK:
- if (addr <= current->mm->end_data)
+ if (addr <= mm->end_data)
goto out;
if (rlim < RLIM_INFINITY &&
- (current->mm->brk - addr) +
- (current->mm->end_data - current->mm->start_data) > rlim)
+ (mm->brk - addr) +
+ (mm->end_data - mm->start_data) > rlim)
goto out;
- current->mm->start_brk = addr;
+ mm->start_brk = addr;
break;
case PR_SET_MM_BRK:
- if (addr <= current->mm->end_data)
+ if (addr <= mm->end_data)
goto out;
if (rlim < RLIM_INFINITY &&
- (addr - current->mm->start_brk) +
- (current->mm->end_data - current->mm->start_data) > rlim)
+ (addr - mm->start_brk) +
+ (mm->end_data - mm->start_data) > rlim)
goto out;
- current->mm->brk = addr;
+ mm->brk = addr;
break;
default:
@@ -1797,7 +1798,7 @@ static int prctl_set_mm(int opt, unsigne
error = 0;
out:
- up_read(&current->mm->mmap_sem);
+ up_read(&mm->mmap_sem);
return error;
}
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: c/r: prctl: add PR_SET_MM codes to set up mm_struct entries
When we restore a task we need to set up text, data and data heap sizes
from userspace to the values a task had at checkpoint time. This patch
adds auxilary prctl codes for that.
While most of them have a statistical nature (their values are involved
into calculation of /proc/<pid>/statm output) the start_brk and brk values
are used to compute an allowed size of program data segment expansion.
Which means an arbitrary changes of this values might be dangerous
operation. So to restrict access the following requirements applied to
prctl calls:
- The process has to have CAP_SYS_ADMIN capability granted.
- For all opcodes except start_brk/brk members an appropriate
VMA area must exist and should fit certain VMA flags,
such as:
- code segment must be executable but not writable;
- data segment must not be executable.
start_brk/brk values must not intersect with data segment and must not
exceed RLIMIT_DATA resource limit.
Still the main guard is CAP_SYS_ADMIN capability check.
Note the kernel should be compiled with CONFIG_CHECKPOINT_RESTORE support
otherwise these prctl calls will return -EINVAL.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/prctl.h | 12 +++++
kernel/sys.c | 120 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 132 insertions(+)
Index: linux-2.6.git/include/linux/prctl.h
===================================================================
--- linux-2.6.git.orig/include/linux/prctl.h
+++ linux-2.6.git/include/linux/prctl.h
@@ -102,4 +102,16 @@
#define PR_MCE_KILL_GET 34
+/*
+ * Tune up process memory map specifics.
+ */
+#define PR_SET_MM 35
+# define PR_SET_MM_START_CODE 1
+# define PR_SET_MM_END_CODE 2
+# define PR_SET_MM_START_DATA 3
+# define PR_SET_MM_END_DATA 4
+# define PR_SET_MM_START_STACK 5
+# define PR_SET_MM_START_BRK 6
+# define PR_SET_MM_BRK 7
+
#endif /* _LINUX_PRCTL_H */
Index: linux-2.6.git/kernel/sys.c
===================================================================
--- linux-2.6.git.orig/kernel/sys.c
+++ linux-2.6.git/kernel/sys.c
@@ -1692,6 +1692,123 @@ SYSCALL_DEFINE1(umask, int, mask)
return mask;
}
+#ifdef CONFIG_CHECKPOINT_RESTORE
+static int prctl_set_mm(int opt, unsigned long addr,
+ unsigned long arg4, unsigned long arg5)
+{
+ unsigned long rlim = rlimit(RLIMIT_DATA);
+ unsigned long vm_req_flags;
+ unsigned long vm_bad_flags;
+ struct vm_area_struct *vma;
+ int error = 0;
+
+ if (arg4 | arg5)
+ return -EINVAL;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ if (addr >= TASK_SIZE)
+ return -EINVAL;
+
+ down_read(&current->mm->mmap_sem);
+ vma = find_vma(current->mm, addr);
+
+ if (opt != PR_SET_MM_START_BRK && opt != PR_SET_MM_BRK) {
+ /* It must be existing VMA */
+ if (!vma || vma->vm_start > addr)
+ goto out;
+ }
+
+ error = -EINVAL;
+ switch (opt) {
+ case PR_SET_MM_START_CODE:
+ case PR_SET_MM_END_CODE:
+ vm_req_flags = VM_READ | VM_EXEC;
+ vm_bad_flags = VM_WRITE | VM_MAYSHARE;
+
+ if ((vma->vm_flags & vm_req_flags) != vm_req_flags ||
+ (vma->vm_flags & vm_bad_flags))
+ goto out;
+
+ if (opt == PR_SET_MM_START_CODE)
+ current->mm->start_code = addr;
+ else
+ current->mm->end_code = addr;
+ break;
+
+ case PR_SET_MM_START_DATA:
+ case PR_SET_MM_END_DATA:
+ vm_req_flags = VM_READ | VM_WRITE;
+ vm_bad_flags = VM_EXEC | VM_MAYSHARE;
+
+ if ((vma->vm_flags & vm_req_flags) != vm_req_flags ||
+ (vma->vm_flags & vm_bad_flags))
+ goto out;
+
+ if (opt == PR_SET_MM_START_DATA)
+ current->mm->start_data = addr;
+ else
+ current->mm->end_data = addr;
+ break;
+
+ case PR_SET_MM_START_STACK:
+
+#ifdef CONFIG_STACK_GROWSUP
+ vm_req_flags = VM_READ | VM_WRITE | VM_GROWSUP;
+#else
+ vm_req_flags = VM_READ | VM_WRITE | VM_GROWSDOWN;
+#endif
+ if ((vma->vm_flags & vm_req_flags) != vm_req_flags)
+ goto out;
+
+ current->mm->start_stack = addr;
+ break;
+
+ case PR_SET_MM_START_BRK:
+ if (addr <= current->mm->end_data)
+ goto out;
+
+ if (rlim < RLIM_INFINITY &&
+ (current->mm->brk - addr) +
+ (current->mm->end_data - current->mm->start_data) > rlim)
+ goto out;
+
+ current->mm->start_brk = addr;
+ break;
+
+ case PR_SET_MM_BRK:
+ if (addr <= current->mm->end_data)
+ goto out;
+
+ if (rlim < RLIM_INFINITY &&
+ (addr - current->mm->start_brk) +
+ (current->mm->end_data - current->mm->start_data) > rlim)
+ goto out;
+
+ current->mm->brk = addr;
+ break;
+
+ default:
+ error = -EINVAL;
+ goto out;
+ }
+
+ error = 0;
+
+out:
+ up_read(&current->mm->mmap_sem);
+
+ return error;
+}
+#else /* CONFIG_CHECKPOINT_RESTORE */
+static int prctl_set_mm(int opt, unsigned long addr,
+ unsigned long arg4, unsigned long arg5)
+{
+ return -EINVAL;
+}
+#endif
+
SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
unsigned long, arg4, unsigned long, arg5)
{
@@ -1841,6 +1958,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
else
error = PR_MCE_KILL_DEFAULT;
break;
+ case PR_SET_MM:
+ error = prctl_set_mm(arg2, arg3, arg4, arg5);
+ break;
default:
error = -EINVAL;
break;
From: Andrew Morton <akpm@linux-foundation.org>
Subject: c-r-procfs-add-start_data-end_data-start_brk-members-to-proc-pid-stat-v4-fix
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
Documentation/filesystems/proc.txt | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Index: linux-2.6.git/Documentation/filesystems/proc.txt
===================================================================
--- linux-2.6.git.orig/Documentation/filesystems/proc.txt
+++ linux-2.6.git/Documentation/filesystems/proc.txt
@@ -307,7 +307,7 @@ Table 1-4: Contents of the stat files (a
cgtime guest time of the task children in jiffies
start_data address above which program data+bss is placed
end_data address below which program data+bss is placed
- start_brk address above which program heap can be expaned with brk() call
+ start_brk address above which program heap can be expanded with brk()
..............................................................................
The /proc/PID/maps file containing the currently mapped memory regions and
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: c/r: procfs: add start_data, end_data, start_brk members to /proc/$pid/stat v4
The mm->start_code/end_code, mm->start_data/end_data, mm->start_brk are
involved into calculation of program text/data segment sizes (which might
be seen in /proc/<pid>/statm) and into brk() call final address.
For restore we need to know all these values. While
mm->start_code/end_code already present in /proc/$pid/stat, the rest
members are not, so this patch brings them in.
The restore procedure of these members is addressed in another patch using
prctl().
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
Documentation/filesystems/proc.txt | 3 +++
fs/proc/array.c | 7 +++++--
2 files changed, 8 insertions(+), 2 deletions(-)
Index: linux-2.6.git/Documentation/filesystems/proc.txt
===================================================================
--- linux-2.6.git.orig/Documentation/filesystems/proc.txt
+++ linux-2.6.git/Documentation/filesystems/proc.txt
@@ -305,6 +305,9 @@ Table 1-4: Contents of the stat files (a
blkio_ticks time spent waiting for block IO
gtime guest time of the task in jiffies
cgtime guest time of the task children in jiffies
+ start_data address above which program data+bss is placed
+ end_data address below which program data+bss is placed
+ start_brk address above which program heap can be expaned with brk() call
..............................................................................
The /proc/PID/maps file containing the currently mapped memory regions and
Index: linux-2.6.git/fs/proc/array.c
===================================================================
--- linux-2.6.git.orig/fs/proc/array.c
+++ linux-2.6.git/fs/proc/array.c
@@ -464,7 +464,7 @@ static int do_task_stat(struct seq_file
seq_printf(m, "%d (%s) %c %d %d %d %d %d %u %lu \
%lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
-%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld\n",
+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld %lu %lu %lu\n",
pid_nr_ns(pid, ns),
tcomm,
state,
@@ -511,7 +511,10 @@ static int do_task_stat(struct seq_file
task->policy,
(unsigned long long)delayacct_blkio_ticks(task),
cputime_to_clock_t(gtime),
- cputime_to_clock_t(cgtime));
+ cputime_to_clock_t(cgtime),
+ (mm && permitted) ? mm->start_data : 0,
+ (mm && permitted) ? mm->end_data : 0,
+ (mm && permitted) ? mm->start_brk : 0);
if (mm)
mmput(mm);
return 0;
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: [PATCH] fs, proc: Make proc_get_link to use dentry instead of inode
From: Cyrill Gorcunov <gorcunov@openvz.org>
This patch prepares the ground for the next "map_files"
patch which needs a name of a link file to analyse.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Pavel Emelyanov <xemul@parallels.com>
CC: Tejun Heo <tj@kernel.org>
CC: Vasiliy Kulikov <segoon@openwall.com>
CC: "Kirill A. Shutemov" <kirill@shutemov.name>
CC: Alexey Dobriyan <adobriyan@gmail.com>
CC: Al Viro <viro@ZenIV.linux.org.uk>
CC: Andrew Morton <akpm@linux-foundation.org>
---
fs/proc/base.c | 20 ++++++++++----------
include/linux/proc_fs.h | 2 +-
2 files changed, 11 insertions(+), 11 deletions(-)
Index: linux-2.6.git/fs/proc/base.c
===================================================================
--- linux-2.6.git.orig/fs/proc/base.c
+++ linux-2.6.git/fs/proc/base.c
@@ -165,9 +165,9 @@ static int get_task_root(struct task_str
return result;
}
-static int proc_cwd_link(struct inode *inode, struct path *path)
+static int proc_cwd_link(struct dentry *dentry, struct path *path)
{
- struct task_struct *task = get_proc_task(inode);
+ struct task_struct *task = get_proc_task(dentry->d_inode);
int result = -ENOENT;
if (task) {
@@ -182,9 +182,9 @@ static int proc_cwd_link(struct inode *i
return result;
}
-static int proc_root_link(struct inode *inode, struct path *path)
+static int proc_root_link(struct dentry *dentry, struct path *path)
{
- struct task_struct *task = get_proc_task(inode);
+ struct task_struct *task = get_proc_task(dentry->d_inode);
int result = -ENOENT;
if (task) {
@@ -1567,13 +1567,13 @@ static const struct file_operations proc
.release = single_release,
};
-static int proc_exe_link(struct inode *inode, struct path *exe_path)
+static int proc_exe_link(struct dentry *dentry, struct path *exe_path)
{
struct task_struct *task;
struct mm_struct *mm;
struct file *exe_file;
- task = get_proc_task(inode);
+ task = get_proc_task(dentry->d_inode);
if (!task)
return -ENOENT;
mm = get_task_mm(task);
@@ -1603,7 +1603,7 @@ static void *proc_pid_follow_link(struct
if (!proc_fd_access_allowed(inode))
goto out;
- error = PROC_I(inode)->op.proc_get_link(inode, &nd->path);
+ error = PROC_I(inode)->op.proc_get_link(dentry, &nd->path);
out:
return ERR_PTR(error);
}
@@ -1642,7 +1642,7 @@ static int proc_pid_readlink(struct dent
if (!proc_fd_access_allowed(inode))
goto out;
- error = PROC_I(inode)->op.proc_get_link(inode, &path);
+ error = PROC_I(inode)->op.proc_get_link(dentry, &path);
if (error)
goto out;
@@ -1980,9 +1980,9 @@ out_task:
return rc;
}
-static int proc_fd_link(struct inode *inode, struct path *path)
+static int proc_fd_link(struct dentry *dentry, struct path *path)
{
- return proc_fd_info(inode, path, NULL);
+ return proc_fd_info(dentry->d_inode, path, NULL);
}
static int tid_fd_revalidate(struct dentry *dentry, struct nameidata *nd)
Index: linux-2.6.git/include/linux/proc_fs.h
===================================================================
--- linux-2.6.git.orig/include/linux/proc_fs.h
+++ linux-2.6.git/include/linux/proc_fs.h
@@ -253,7 +253,7 @@ extern const struct proc_ns_operations u
extern const struct proc_ns_operations ipcns_operations;
union proc_op {
- int (*proc_get_link)(struct inode *, struct path *);
+ int (*proc_get_link)(struct dentry *, struct path *);
int (*proc_read)(struct task_struct *task, char *page);
int (*proc_show)(struct seq_file *m,
struct pid_namespace *ns, struct pid *pid,
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: [PATCH] fs, proc: Introduce the /proc/<pid>/children entry v4
There is no easy way to make a reverse parent->children chain
from arbitrary <pid> (while parent pid is provided in "PPid"
field of /proc/<pid>/status).
So instead of walking over all pids in the system to figure out which
children a task have -- we add explicit /proc/<pid>/children entry,
because kernel already has this kind of information but it is not
yet exported. This is a first level children, not the whole process
tree, neither the process threads are identified with this interface.
v2:
- Kame suggested to use a separated /proc/<pid>/children entry
instead of poking /proc/<pid>/status
- Andew suggested to use rcu facility instead of locking
tasklist_lock
- Tejun pointed that non-seekable seq file might not be
enough for tasks with large number of children
v3:
- To be on a safe side use %lu format for pid_t printing
v4:
- New line get printed when sequence ends not at seq->stop,
a nit pointed by Tejun
- Documentation update
- tasklist_lock is back, Oleg pointed that ->children list
is actually not rcu-safe
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
---
Documentation/filesystems/proc.txt | 20 ++++
fs/proc/array.c | 163 +++++++++++++++++++++++++++++++++++++
fs/proc/base.c | 1
fs/proc/internal.h | 6 +
4 files changed, 190 insertions(+)
Index: linux-2.6.git/Documentation/filesystems/proc.txt
===================================================================
--- linux-2.6.git.orig/Documentation/filesystems/proc.txt
+++ linux-2.6.git/Documentation/filesystems/proc.txt
@@ -40,6 +40,7 @@ Table of Contents
3.4 /proc/<pid>/coredump_filter - Core dump filtering settings
3.5 /proc/<pid>/mountinfo - Information about mounts
3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm
+ 3.7 /proc/<pid>/children - Information about task children
------------------------------------------------------------------------------
@@ -1545,3 +1546,22 @@ a task to set its own or one of its thre
is limited in size compared to the cmdline value, so writing anything longer
then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated
comm value.
+
+3.7 /proc/<pid>/children - Information about task children
+--------------------------------------------------------------
+This file provides a fast way to retrieve first level children pids
+of a task pointed by <pid>. The format is a stream of pids separated
+by space with a new line at the end. If a task has no children at
+all -- only a new line returned.
+
+Note the "first level" here -- if a child has own children they will
+not be printed there, one need to read /proc/<children-pid>/children
+to obtain descendants. The same applies to threads -- they are not
+counted here.
+
+Because this interface is intended to be fast and cheap it doesn't
+guarantee to provide the precise results, which means if a child is
+exiting it might or might not be counted. The same applies to freshly
+created children -- they might or might not be counted. If one needs
+precise pids -- the task and children should be either stopped or
+frozen.
Index: linux-2.6.git/fs/proc/array.c
===================================================================
--- linux-2.6.git.orig/fs/proc/array.c
+++ linux-2.6.git/fs/proc/array.c
@@ -547,3 +547,166 @@ int proc_pid_statm(struct seq_file *m, s
return 0;
}
+
+static struct list_head *
+children_get_at(struct proc_pid_children_iter *iter, loff_t pos)
+{
+ struct task_struct *t = iter->group_leader;
+ struct task_struct *task;
+
+ rcu_read_lock();
+ do {
+ list_for_each_entry(task, &t->children, sibling) {
+ if (list_empty(&task->sibling))
+ break;
+ if (pos-- == 0) {
+ put_task_struct(iter->last_group);
+ iter->last_group = t;
+ get_task_struct(iter->last_group);
+ get_task_struct(task);
+ rcu_read_unlock();
+ return &task->sibling;
+ }
+ }
+ } while_each_thread(iter->group_leader, t);
+ rcu_read_unlock();
+
+ return NULL;
+}
+
+static int children_seq_show(struct seq_file *seq, void *v)
+{
+ struct task_struct *task = container_of(v, struct task_struct, sibling);
+ unsigned long pid;
+ int ret = -1;
+
+ rcu_read_lock();
+ if (pid_alive(task)) {
+ pid = (unsigned long)pid_vnr(task_pid(task));
+ ret = 0;
+ }
+ rcu_read_unlock();
+
+ if (!ret)
+ ret = seq_printf(seq, " %lu", pid);
+ return ret;
+}
+
+static void *children_seq_start(struct seq_file *seq, loff_t *pos)
+{
+ return children_get_at(seq->private, *pos);
+}
+
+static void *children_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+ struct proc_pid_children_iter *iter = seq->private;
+ struct task_struct *task = container_of(v, struct task_struct, sibling);
+ struct list_head *next = NULL;
+
+ if (!iter->last_group)
+ goto out;
+
+ rcu_read_lock();
+
+ if (list_empty(&task->sibling) ||
+ list_is_last(v, &iter->last_group->children)) {
+ struct task_struct *t = iter->last_group;
+
+ while_each_thread(iter->group_leader, t) {
+ if (!list_empty(&t->children)) {
+ put_task_struct(task);
+ next = t->children.next;
+ task = container_of(next, struct task_struct, sibling);
+ get_task_struct(task);
+ put_task_struct(iter->last_group);
+ iter->last_group = t;
+ get_task_struct(iter->last_group);
+ goto out_unlock;
+ }
+ }
+
+ put_task_struct(task);
+ put_task_struct(iter->last_group);
+ iter->last_group = NULL;
+ } else {
+ next = ((struct list_head *)v)->next;
+ put_task_struct(task);
+ task = container_of(next, struct task_struct, sibling);
+ get_task_struct(task);
+ }
+out_unlock:
+ rcu_read_unlock();
+out:
+ ++*pos;
+ if (!next)
+ seq_printf(seq, "\n");
+ return next;
+}
+
+static void children_seq_stop(struct seq_file *seq, void *v)
+{
+ struct proc_pid_children_iter *iter = seq->private;
+ if (iter->last_group)
+ put_task_struct(iter->last_group);
+ iter->last_group = NULL;
+}
+
+static const struct seq_operations children_seq_ops = {
+ .start = children_seq_start,
+ .next = children_seq_next,
+ .stop = children_seq_stop,
+ .show = children_seq_show,
+};
+
+static int children_seq_open(struct inode *inode, struct file *file)
+{
+ struct proc_pid_children_iter *iter = NULL;
+ struct task_struct *task = NULL;
+ int ret;
+
+ ret = -ENOMEM;
+ iter = kmalloc(sizeof(*iter), GFP_KERNEL);
+ if (!iter)
+ goto err;
+
+ ret = -ENOENT;
+ task = get_proc_task(inode);
+ if (!task)
+ goto err;
+
+ ret = seq_open(file, &children_seq_ops);
+ if (!ret) {
+ struct seq_file *m = file->private_data;
+ m->private = iter;
+ iter->group_leader = task;
+ iter->last_group = task;
+ get_task_struct(iter->last_group);
+ }
+
+err:
+ if (ret) {
+ if (task)
+ put_task_struct(task);
+ kfree(iter);
+ }
+
+ return ret;
+}
+
+int children_seq_release(struct inode *inode, struct file *file)
+{
+ struct seq_file *m = file->private_data;
+ struct proc_pid_children_iter *iter = m->private;
+
+ put_task_struct(iter->group_leader);
+ kfree(iter);
+ seq_release(inode, file);
+ return 0;
+}
+
+const struct file_operations proc_pid_children_operations = {
+ .open = children_seq_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = children_seq_release,
+};
Index: linux-2.6.git/fs/proc/base.c
===================================================================
--- linux-2.6.git.orig/fs/proc/base.c
+++ linux-2.6.git/fs/proc/base.c
@@ -3204,6 +3204,7 @@ static const struct pid_entry tgid_base_
INF("cmdline", S_IRUGO, proc_pid_cmdline),
ONE("stat", S_IRUGO, proc_tgid_stat),
ONE("statm", S_IRUGO, proc_pid_statm),
+ REG("children", S_IRUGO, proc_pid_children_operations),
REG("maps", S_IRUGO, proc_maps_operations),
#ifdef CONFIG_NUMA
REG("numa_maps", S_IRUGO, proc_numa_maps_operations),
Index: linux-2.6.git/fs/proc/internal.h
===================================================================
--- linux-2.6.git.orig/fs/proc/internal.h
+++ linux-2.6.git/fs/proc/internal.h
@@ -53,6 +53,12 @@ extern int proc_pid_statm(struct seq_fil
struct pid *pid, struct task_struct *task);
extern loff_t mem_lseek(struct file *file, loff_t offset, int orig);
+struct proc_pid_children_iter {
+ struct task_struct *group_leader;
+ struct task_struct *last_group;
+};
+
+extern const struct file_operations proc_pid_children_operations;
extern const struct file_operations proc_maps_operations;
extern const struct file_operations proc_numa_maps_operations;
extern const struct file_operations proc_smaps_operations;
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: [PATCH] mincore: Add named constant for reported present bit
From: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
---
include/linux/mman.h | 2 ++
mm/huge_memory.c | 2 +-
mm/mincore.c | 10 +++++-----
3 files changed, 8 insertions(+), 6 deletions(-)
Index: linux-2.6.git/include/linux/mman.h
===================================================================
--- linux-2.6.git.orig/include/linux/mman.h
+++ linux-2.6.git/include/linux/mman.h
@@ -10,6 +10,8 @@
#define OVERCOMMIT_ALWAYS 1
#define OVERCOMMIT_NEVER 2
+#define MINCORE_RESIDENT 0x1
+
#ifdef __KERNEL__
#include <linux/mm.h>
#include <linux/percpu_counter.h>
Index: linux-2.6.git/mm/huge_memory.c
===================================================================
--- linux-2.6.git.orig/mm/huge_memory.c
+++ linux-2.6.git/mm/huge_memory.c
@@ -1045,7 +1045,7 @@ int mincore_huge_pmd(struct vm_area_stru
* All logical pages in the range are present
* if backed by a huge page.
*/
- memset(vec, 1, (end - addr) >> PAGE_SHIFT);
+ memset(vec, MINCORE_RESIDENT, (end - addr) >> PAGE_SHIFT);
}
} else
spin_unlock(&vma->vm_mm->page_table_lock);
Index: linux-2.6.git/mm/mincore.c
===================================================================
--- linux-2.6.git.orig/mm/mincore.c
+++ linux-2.6.git/mm/mincore.c
@@ -38,7 +38,7 @@ static void mincore_hugetlb_page_range(s
addr & huge_page_mask(h));
present = ptep && !huge_pte_none(huge_ptep_get(ptep));
while (1) {
- *vec = present;
+ *vec = (present ? MINCORE_RESIDENT : 0);
vec++;
addr += PAGE_SIZE;
if (addr == end)
@@ -83,7 +83,7 @@ static unsigned char mincore_page(struct
page_cache_release(page);
}
- return present;
+ return present ? MINCORE_RESIDENT : 0;
}
static void mincore_unmapped_range(struct vm_area_struct *vma,
@@ -122,7 +122,7 @@ static void mincore_pte_range(struct vm_
if (pte_none(pte))
mincore_unmapped_range(vma, addr, next, vec);
else if (pte_present(pte))
- *vec = 1;
+ *vec = MINCORE_RESIDENT;
else if (pte_file(pte)) {
pgoff = pte_to_pgoff(pte);
*vec = mincore_page(vma->vm_file->f_mapping, pgoff);
@@ -131,14 +131,14 @@ static void mincore_pte_range(struct vm_
if (is_migration_entry(entry)) {
/* migration entries are always uptodate */
- *vec = 1;
+ *vec = MINCORE_RESIDENT;
} else {
#ifdef CONFIG_SWAP
pgoff = entry.val;
*vec = mincore_page(&swapper_space, pgoff);
#else
WARN_ON(1);
- *vec = 1;
+ *vec = MINCORE_RESIDENT;
#endif
}
}
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: [PATCH] mincore: Report whether page is anon or not
From: Pavel Emelyanov <xemul@parallels.com>
This is required not to dump pages from private file mappings, that are
not mapped or not yet COW-ed.
The thing is that mincode reports bit 1 for pages that are in memory, regardless
of whether they are mapped or not. But in case we have mapped a file of 2 pages and
read a single page mincore will report 1 for both - the 1st one being mapped and
the 2nd one being in page cache due to readahead.
With this fix both pages will be !PageAnon and we can skip them from dumping.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
---
include/linux/mman.h | 1 +
mm/mincore.c | 15 +++++++++++++--
2 files changed, 14 insertions(+), 2 deletions(-)
Index: linux-2.6.git/include/linux/mman.h
===================================================================
--- linux-2.6.git.orig/include/linux/mman.h
+++ linux-2.6.git/include/linux/mman.h
@@ -11,6 +11,7 @@
#define OVERCOMMIT_NEVER 2
#define MINCORE_RESIDENT 0x1
+#define MINCORE_ANON 0x2
#ifdef __KERNEL__
#include <linux/mm.h>
Index: linux-2.6.git/mm/mincore.c
===================================================================
--- linux-2.6.git.orig/mm/mincore.c
+++ linux-2.6.git/mm/mincore.c
@@ -38,7 +38,7 @@ static void mincore_hugetlb_page_range(s
addr & huge_page_mask(h));
present = ptep && !huge_pte_none(huge_ptep_get(ptep));
while (1) {
- *vec = (present ? MINCORE_RESIDENT : 0);
+ *vec = (present ? MINCORE_RESIDENT : 0) | MINCORE_ANON;
vec++;
addr += PAGE_SIZE;
if (addr == end)
@@ -86,6 +86,17 @@ static unsigned char mincore_page(struct
return present ? MINCORE_RESIDENT : 0;
}
+static unsigned char mincore_pte(struct vm_area_struct *vma, unsigned long addr, pte_t pte)
+{
+ struct page *pg;
+
+ pg = vm_normal_page(vma, addr, pte);
+ if (!pg)
+ return 0;
+ else
+ return PageAnon(pg) ? MINCORE_ANON : 0;
+}
+
static void mincore_unmapped_range(struct vm_area_struct *vma,
unsigned long addr, unsigned long end,
unsigned char *vec)
@@ -122,7 +133,7 @@ static void mincore_pte_range(struct vm_
if (pte_none(pte))
mincore_unmapped_range(vma, addr, next, vec);
else if (pte_present(pte))
- *vec = MINCORE_RESIDENT;
+ *vec = MINCORE_RESIDENT | mincore_pte(vma, addr, pte);
else if (pte_file(pte)) {
pgoff = pte_to_pgoff(pte);
*vec = mincore_page(vma->vm_file->f_mapping, pgoff);
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: [PATCH] procfs-introduce-the-proc-pid-map_files-directory-checkpatch-fixes
From: Andrew Morton <akpm@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
WARNING: line over 80 characters
#286: FILE: fs/proc/base.c:2433:
+static int proc_map_files_readdir(struct file *filp, void *dirent, filldir_t filldir)
WARNING: line over 80 characters
#351: FILE: fs/proc/base.c:2498:
+ fa = flex_array_alloc(sizeof(info), nr_files, GFP_KERNEL);
WARNING: line over 80 characters
#352: FILE: fs/proc/base.c:2499:
+ if (!fa || flex_array_prealloc(fa, 0, nr_files, GFP_KERNEL)) {
WARNING: line over 80 characters
#360: FILE: fs/proc/base.c:2507:
+ for (i = 0, vma = mm->mmap, pos = 2; vma; vma = vma->vm_next) {
WARNING: line over 80 characters
#368: FILE: fs/proc/base.c:2515:
+ info.len = snprintf(info.name, sizeof(info.name),
WARNING: line over 80 characters
#424: FILE: fs/proc/base.c:3179:
+ DIR("map_files", S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
WARNING: line over 80 characters
#437: FILE: include/linux/mm.h:1497:
+find_exact_vma(struct mm_struct *mm, unsigned long vm_start, unsigned long vm_end)
total: 0 errors, 7 warnings, 387 lines checked
./patches/procfs-introduce-the-proc-pid-map_files-directory.patch has style problems, please review.
If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
fs/proc/base.c | 18 +++++++++++-------
include/linux/mm.h | 4 ++--
2 files changed, 13 insertions(+), 9 deletions(-)
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2430,7 +2430,8 @@ static const struct inode_operations proc_map_files_inode_operations = {
.setattr = proc_setattr,
};
-static int proc_map_files_readdir(struct file *filp, void *dirent, filldir_t filldir)
+static int
+proc_map_files_readdir(struct file *filp, void *dirent, filldir_t filldir)
{
struct dentry *dentry = filp->f_path.dentry;
struct inode *inode = dentry->d_inode;
@@ -2495,8 +2496,10 @@ static int proc_map_files_readdir(struct file *filp, void *dirent, filldir_t fil
}
if (nr_files) {
- fa = flex_array_alloc(sizeof(info), nr_files, GFP_KERNEL);
- if (!fa || flex_array_prealloc(fa, 0, nr_files, GFP_KERNEL)) {
+ fa = flex_array_alloc(sizeof(info), nr_files,
+ GFP_KERNEL);
+ if (!fa || flex_array_prealloc(fa, 0, nr_files,
+ GFP_KERNEL)) {
ret = -ENOMEM;
if (fa)
flex_array_free(fa);
@@ -2504,7 +2507,8 @@ static int proc_map_files_readdir(struct file *filp, void *dirent, filldir_t fil
mmput(mm);
goto out_unlock;
}
- for (i = 0, vma = mm->mmap, pos = 2; vma; vma = vma->vm_next) {
+ for (i = 0, vma = mm->mmap, pos = 2; vma;
+ vma = vma->vm_next) {
if (!vma->vm_file)
continue;
if (++pos <= filp->f_pos)
@@ -2512,9 +2516,9 @@ static int proc_map_files_readdir(struct file *filp, void *dirent, filldir_t fil
get_file(vma->vm_file);
info.file = vma->vm_file;
- info.len = snprintf(info.name, sizeof(info.name),
- "%lx-%lx", vma->vm_start,
- vma->vm_end);
+ info.len = snprintf(info.name,
+ sizeof(info.name), "%lx-%lx",
+ vma->vm_start, vma->vm_end);
if (flex_array_put(fa, i++, &info, GFP_KERNEL))
BUG();
}
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1492,8 +1492,8 @@ static inline unsigned long vma_pages(struct vm_area_struct *vma)
}
/* Look up the first VMA which exactly match the interval vm_start ... vm_end */
-static inline struct vm_area_struct *
-find_exact_vma(struct mm_struct *mm, unsigned long vm_start, unsigned long vm_end)
+static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
+ unsigned long vm_start, unsigned long vm_end)
{
struct vm_area_struct *vma = find_vma(mm, vm_start);
--
1.7.7.3
fs-proc-Make-proc_get_link-to-use-dentry
fs-proc-Introduce-the-proc-pid-map_files-directory
procfs-introduce-the-proc-pid-map_files-directory-checkpatch
sysfs-add-kernel.ns_last_pid
c-r-introduce-checkpoint_restore-symbol.patch
c-r-procfs-add-start_data-end_data-start_brk-members-to-proc-pid-stat-v4.patch
c-r-procfs-add-start_data-end_data-start_brk-members-to-proc-pid-stat-v4-fix.patch
c-r-prctl-add-pr_set_mm-codes-to-set-up-mm_struct-entries.patch
c-r-prctl-add-pr_set_mm-codes-to-set-up-mm_struct-entries-fix.patch
mincore-Add-named-constant-for-reported-present-bit
mincore-Report-whether-page-is-anon-or-not
fs-proc-add-children-entry-11
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: [PATCH] sysctl: Add the kernel.ns_last_pid control
From: Pavel Emelyanov <xemul@parallels.com>
The sysctl works on the current task's pid namespace, getting and setting its
last_pid field.
Writing is allowed for CAP_SYS_ADMIN-capable tasks thus making it possible to
create a task with desired pid value. This ability is required badly for the
checkpoint/restore in userspace.
This approach suits all the parties for now.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Tejun Heo <tj@kernel.org>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
---
Documentation/sysctl/kernel.txt | 8 ++++++++
kernel/pid.c | 4 +++-
kernel/pid_namespace.c | 31 +++++++++++++++++++++++++++++++
3 files changed, 42 insertions(+), 1 deletion(-)
Index: linux-2.6.git/Documentation/sysctl/kernel.txt
===================================================================
--- linux-2.6.git.orig/Documentation/sysctl/kernel.txt
+++ linux-2.6.git/Documentation/sysctl/kernel.txt
@@ -401,6 +401,14 @@ PIDs of value pid_max or larger are not
==============================================================
+ns_last_pid:
+
+The last pid allocated in the current (the one task using this sysctl
+lives in) pid namespace. When selecting a pid for a next task on fork
+kernel tries to allocate a number starting from this one.
+
+==============================================================
+
powersave-nap: (PPC only)
If set, Linux-PPC will use the 'nap' mode of powersaving,
Index: linux-2.6.git/kernel/pid.c
===================================================================
--- linux-2.6.git.orig/kernel/pid.c
+++ linux-2.6.git/kernel/pid.c
@@ -137,7 +137,9 @@ static int pid_before(int base, int a, i
}
/*
- * We might be racing with someone else trying to set pid_ns->last_pid.
+ * We might be racing with someone else trying to set pid_ns->last_pid
+ * at the pid allocation time (there's also a sysctl for this, but racing
+ * with this one is OK, see comment in kernel/pid_namespace.c about it).
* We want the winner to have the "later" value, because if the
* "earlier" value prevails, then a pid may get reused immediately.
*
Index: linux-2.6.git/kernel/pid_namespace.c
===================================================================
--- linux-2.6.git.orig/kernel/pid_namespace.c
+++ linux-2.6.git/kernel/pid_namespace.c
@@ -191,9 +191,40 @@ void zap_pid_ns_processes(struct pid_nam
return;
}
+static int pid_ns_ctl_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct ctl_table tmp = *table;
+
+ if (write && !capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /*
+ * Writing directly to ns' last_pid field is OK, since this field
+ * is volatile in a living namespace anyway and a code writing to
+ * it should synchronize its usage with external means.
+ */
+
+ tmp.data = &current->nsproxy->pid_ns->last_pid;
+ return proc_dointvec(&tmp, write, buffer, lenp, ppos);
+}
+
+static struct ctl_table pid_ns_ctl_table[] = {
+ {
+ .procname = "ns_last_pid",
+ .maxlen = sizeof(int),
+ .mode = 0666, /* permissions are checked in the handler */
+ .proc_handler = pid_ns_ctl_handler,
+ },
+ { }
+};
+
+static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } };
+
static __init int pid_namespaces_init(void)
{
pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
+ register_sysctl_paths(kern_path, pid_ns_ctl_table);
return 0;
}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment