docs: Add internals details

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

docs: Add internals details
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
2e00d019 · Cyrill Gorcunov · 60c9235f · 2e00d019
Commit 2e00d019 authored Nov 22, 2011 by Cyrill Gorcunov
Show whitespace changes
Inline Side-by-side

Showing with 145 additions and 0 deletions

INTERNALS INTERNALS +145 -0

No files found.
--- a/INTERNALS
+++ b/INTERNALS
+crtools internals
+=================
+What CRtools is
+---------------
+In short -- crtools is an utility to checkpoint/restore (CR) processes. Unlike CR
+implemented completely in kernel space, it tries to achieve the same goal opreating
+in user space.
+Since this tool (and overall concept) is under heavily development stage, there are
+some known limitations
+ - Only pure x86-64 environment is supported, no IA32 emulation.
+ - There is no way to use cgroups freezer facility.
+ - No network or IPC CR supported.
+At moment CR of the following resources are supported
+ - Process tree
+ - Files (with some limitations)
+ - Pipes
+ - Memory
+Basic design
+------------
+Checkpoint
+~~~~~~~~~~
+Checkpoint procedure relies on /proc file system (it's a general place
+where crtools takes all the information needed). Which includes
+ - File descriptors (via /proc/$pid/fd and /proc/$pid/fdinfo).
+ - Pipes parameters.
+ - Memory maps (via /proc/$pid/maps).
+Process dumper (lets call it "dumper") does the following steps during
+checkpoint stage
+ - A $pid of a process group leader is obtained from the command line.
+ - By using this $pid the dumper walks though /proc/$pid/status and gathers
+   children $pid's recursively. At the end we will have a complete process tree.
+ - Then it takes every $pid from a process tree, sends SIGSTOP to the every process
+   found and performs the following steps on each $pid
+   - Collects VMA areas by parsing /proc/$pid/maps.
+   - Seizes a task via relatively new ptrace interface. Seizing a task means to
+     put it into a special state when the task have no idea if it's being operated
+     by the ptrace.
+   - Core parameters of a task (such as registers and friends) are being dumped via
+     ptrace interface and parsing the /proc/$pid/stat entry.
+   - The dumper injects a parasite code into a task via ptrace interface. This allows
+     us to dump pages of a task right from within the task's address space.
+     An injection procedure is pretty simple one
+     - The dumper scans executable VMA areas of a task (which were previously collected)
+       and tests if there a place for a few instructions.
+     - Then (by ptrace as well) it substitutes an original code with new instructions
+       and creates a new VMA area inside process address space.
+     - Finally parasite code get copied into the new VMA and the former code which was
+       being modified during the parasite bootstrap procedure -- restored.
+   - Then the dumper flushes contents of a task's pages to a file, and drops out
+     the parasite code block completely, since we don't need it anymore.
+   - Once the parasite code removed a task get unseized via ptrace call but remains
+     stopped still.
+   - The dumper writes out parameters of opened files and pipes (flushing data on disk
+     if needed).
+ - SIGCONT is sent to every task in the process tree (to continue execution).
+Restore
+~~~~~~~
+Restore procedure (aka restorer) proceed by the following steps
+ - The process tree read from a file.
+ - To restore the process tree the restorer executes clone(CLONE_CHILD_USEPID)
+   syscall which creates a process with $pid specified. Note if for some reason
+   you already have a process with the same $pid up and running, the restoration
+   procedure will refuse to proceed.
+ - Files and pipes are restored (ie opened with file descriptors they had at
+   checkpoint time and positioned exactly as they were before. In case if the pipe
+   had some data buffered before checkpoint -- data will be sent back to the pipe).
+ - Restoration of virtual memory (and memory pages) is a bit tricky and implemented
+   by the following steps
+   - The restorer analyzes the current VMA map by parsing /proc/$pid/maps file.
+   - Since we are to create completely new memory map the restorer enumerates
+     all VMA entries and figures out where is the place (or hole) between VMAs
+     which could be big enough to hold all code and parameters needed for the
+     rest of the restore procedure.
+   - Once such area found the restorer copies own code and data to a new place.
+   - Then the restorer pass execution there, which in turn does
+     - Unmaps current active VMAs and maps areas the process had at
+       the checkpoint time.
+     - Reads pages contents back to newly mapped memory.
+     - Prepares rt-sigreturn frame on stack and yields __NR_rt_sigreturn
+       syscall, so in result the process start execution from the former
+       IP it had at checkpoint time.
+Kernel area
+-----------
+While CR is implemented in user-space still some help from the Linux kernel
+is needed, so the following patches are needed
+ - New directory /proc/$pid/map_files, which allows the CR to find and restore
+   anonymous shared memory areas.
+ - Explicit "Children:" line in /proc/$pid/stat file added. This simplifies code
+   significantly (and kernel already has this information but simply not yet
+   exported).
+ - An ability to call clone() with specified $pid.
+ - start_data, end_data and a few more members of mm_struct.
+   - Export added to /proc/$pid/stat.
+   - Import implemented via new prctl codes.
+ - An ability to map vDSO at predefined address (implemented via
+   new prctl code as well).