crtools internals
=================


What CRtools is
---------------
In short -- crtools is an utility to checkpoint/restore (CR) processes. Unlike CR
implemented completely in kernel space, it tries to achieve the same goal opreating
in user space.

Since this tool (and overall concept) is under heavily development stage, there are
some known limitations

 - Only pure x86-64 environment is supported, no IA32 emulation.
 - There is no way to use cgroups freezer facility.
 - No network or IPC CR supported.

At moment CR of the following resources are supported

 - Process tree
 - Files (with some limitations)
 - Pipes
 - Memory


Basic design
------------

Checkpoint
~~~~~~~~~~

Checkpoint procedure relies on /proc file system (it's a general place
where crtools takes all the information needed). Which includes

 - File descriptors (via /proc/$pid/fd and /proc/$pid/fdinfo).
 - Pipes parameters.
 - Memory maps (via /proc/$pid/maps).

Process dumper (lets call it "dumper") does the following steps during
checkpoint stage

 - A $pid of a process group leader is obtained from the command line.

 - By using this $pid the dumper walks though /proc/$pid/status and gathers
   children $pid's recursively. At the end we will have a complete process tree.

 - Then it takes every $pid from a process tree, sends SIGSTOP to the every process
   found and performs the following steps on each $pid

   - Collects VMA areas by parsing /proc/$pid/maps.

   - Seizes a task via relatively new ptrace interface. Seizing a task means to
     put it into a special state when the task have no idea if it's being operated
     by the ptrace.

   - Core parameters of a task (such as registers and friends) are being dumped via
     ptrace interface and parsing the /proc/$pid/stat entry.

   - The dumper injects a parasite code into a task via ptrace interface. This allows
     us to dump pages of a task right from within the task's address space.
     An injection procedure is pretty simple one

     - The dumper scans executable VMA areas of a task (which were previously collected)
       and tests if there a place for a few instructions.

     - Then (by ptrace as well) it substitutes an original code with new instructions
       and creates a new VMA area inside process address space.

     - Finally parasite code get copied into the new VMA and the former code which was
       being modified during the parasite bootstrap procedure -- restored.

   - Then the dumper flushes contents of a task's pages to a file, and drops out
     the parasite code block completely, since we don't need it anymore.

   - Once the parasite code removed a task get unseized via ptrace call but remains
     stopped still.

   - The dumper writes out parameters of opened files and pipes (flushing data on disk
     if needed).

 - SIGCONT is sent to every task in the process tree (to continue execution).

Restore
~~~~~~~

Restore procedure (aka restorer) proceed by the following steps

 - The process tree read from a file.

 - To restore the process tree the restorer executes clone(CLONE_CHILD_USEPID)
   syscall which creates a process with $pid specified. Note if for some reason
   you already have a process with the same $pid up and running, the restoration
   procedure will refuse to proceed.

 - Files and pipes are restored (ie opened with file descriptors they had at
   checkpoint time and positioned exactly as they were before. In case if the pipe
   had some data buffered before checkpoint -- data will be sent back to the pipe).

 - Restoration of virtual memory (and memory pages) is a bit tricky and implemented
   by the following steps

   - The restorer analyzes the current VMA map by parsing /proc/$pid/maps file.

   - Since we are to create completely new memory map the restorer enumerates
     all VMA entries and figures out where is the place (or hole) between VMAs
     which could be big enough to hold all code and parameters needed for the
     rest of the restore procedure.

   - Once such area found the restorer copies own code and data to a new place.

   - Then the restorer pass execution there, which in turn does

     - Unmaps current active VMAs and maps areas the process had at
       the checkpoint time.

     - Reads pages contents back to newly mapped memory.

     - Prepares rt-sigreturn frame on stack and yields __NR_rt_sigreturn
       syscall, so in result the process start execution from the former
       IP it had at checkpoint time.


Kernel area
-----------

While CR is implemented in user-space still some help from the Linux kernel
is needed, so the following patches are needed

 - New directory /proc/$pid/map_files, which allows the CR to find and restore
   anonymous shared memory areas.

 - Explicit "Children:" line in /proc/$pid/stat file added. This simplifies code
   significantly (and kernel already has this information but simply not yet
   exported).

 - An ability to call clone() with specified $pid.

 - start_data, end_data and a few more members of mm_struct.

   - Export added to /proc/$pid/stat.

   - Import implemented via new prctl codes.

 - An ability to map vDSO at predefined address (implemented via
   new prctl code as well).
