Commits · e9d0499cd1b2f739d56708cb6c1c0ab03691bfd8 · zhul / criu

19 Sep, 2014 7 commits

test: add a test for remap_dead_pid · e9d0499c

Tycho Andersen authored Sep 17, 2014

Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

e9d0499c

remap: add a dead pid /proc remap · f020bef7

Tycho Andersen authored Sep 17, 2014

If a file like /proc/20/mountinfo is open, but 20 is a zombie (or doesn't exist
any more), we can't read this file at all, so a link remap won't work. Instead,
we add a new remap, called the dead process remap, which forks a TASK_HELPER as
that dead pid so that the restore task can open the new /proc/20/mountinfo
instead.

This commit also adds a new stage CR_STATE_RESTORE_SHARED. Since new
TASK_HELPERS are added when loading the shared resource images, we need to wait
to start forking tasks until after these resources are loaded.

v2: fix a mutex bug
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

f020bef7

restore: TASK_HELPERs live until RESTORE stage ends · c09ba04c

Tycho Andersen authored Sep 17, 2014

In order to use TASK_HELPERS to open files from dead processes, they should
persist until criu is done restoring the filesystem, which happens in the
RESTORE stage. To do this, we need to pass each helper's PIDs to the restorer
blob, so that it can wait() on them when the restore stage is done.

This commit is in preparation for the remap_dead_pid commits.

v2: wait() on helpers after restore stage is over
v3: add CR_STATE_RESTORE_FS stage
v4: CR_STATE_RESTORE_FS waits for nr_tasks + nr_helpers, not nr_threads
v5: ditch CR_STATE_RESTORE_FS in favor of passing helpers to restorer blob
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

c09ba04c

mount: skip the criu's mount namespace if tasks live in another mntns · 5a101d83

Andrey Vagin authored Sep 18, 2014

Currently here is a bug, because when we see criu's mount namespace,
we go to the "out" mark and don't validate mounts.
Reported-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

5a101d83

proc: Don't use FILE * to reach children · 1ebd56b0

Pavel Emelyanov authored Sep 17, 2014

The same reasoning as for personality file -- switch to
plan open + read + close.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

1ebd56b0

proc: Don't use FILE* for reading personality · f3bee6d5

Pavel Emelyanov authored Sep 17, 2014

It turned out, that fdopen (used in fopen_proc) always maps
a 4k buffer for reads and this buffer gets unmap-ed later
on fclose.

Taking into account the amount of proc files we read (~20
per task plus one file per opened file descriptor) this
mmap+munmap result in quite a lot of useless CPU time.

E.g. for a container of 20 tasks we have 1000 calls taking
~8% of total dump time.

So lets first stop doing this for simple cases -- one line
proc files.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

f3bee6d5

plugin: Explicit assign plugin hooks · d36c4058

Cyrill Gorcunov authored Sep 18, 2014

So it won't depend on the order in declaration.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

d36c4058

18 Sep, 2014 13 commits

helpers: Create helpers with shared files and fs · cc2f2ebb

Pavel Emelyanov authored Sep 12, 2014

They don't change these objects, so can share them
with parent (will be created slightly faster :) ).

The plan is to make them CLONE_VM, but it's not that
easy.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

cc2f2ebb

rst: Don't allocate page for child stack (v2) · cc4492e1

Pavel Emelyanov authored Sep 15, 2014

When clone-ing kids we can set their stack on current, as
it will anyway be COW-ed later. One thing to note -- we do
need to reserve some space on the stack for glibc's arguments
and retcode allocation. 128 bytes should be enough for 16
pointers while clone has 5 arguments.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

cc4492e1

proc: Use fopen_proc instead of fopen("/proc...") · d3b63428
Pavel Emelyanov authored Sep 17, 2014
```
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
```
d3b63428
proc: Use fopen_proc in fdinfo parsing · 6e960f1f
Pavel Emelyanov authored Sep 16, 2014
```
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
```
6e960f1f

remap: add remap_type field and use it · 6b70e4ad

Tycho Andersen authored Sep 17, 2014

Maintain backwards compatibility for old images, but don't set the REMAP_GHOST
bit going forward, only use the remap_type field.

v2: * preserve remap_id in GHOST_REMAP case
    * protobuf field is remap_type enum not u32
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

6b70e4ad

restore: return -1 if fail · 2dcafd14

Ruslan Kuprieiev authored Sep 12, 2014

In cr_dump_tasks() we expect restore_root_task to return < 0 if
error ocures.
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

2dcafd14

cg: Fix separator search in parse_task_cgroup · 64a7aa55

Cyrill Gorcunov authored Sep 16, 2014

If there is no separator in first place we should
avoid implicit + 1 which make @name = 1 in worst case.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

64a7aa55

fixed kernel version detection · fc983814

Matthias Neuer authored Sep 18, 2014

My debian testing produces the following output for uname:
$ uname -r
3.14-2-amd64

and so:
$ set -- `uname -r | sed 's/\./ /g'`
$ echo $1
3
$ echo $2
14-2-amd64

this causes zdtm.sh to fail for me on line 293:
[ $1 -eq 3 -a $2 -ge 11 ] && return 0

because "14-2-amd64 -ge 11" is false.
Signed-off-by: Matthias Neuer <matthias.neuer@uni-ulm.de>
Reviewed-by: Christopher Covington <cov@codeaurora.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

fc983814

security: change CR_FD_PERM from rw-rw-r-- to rw-r--r-- · ada46644

Ruslan Kuprieiev authored Sep 16, 2014

This makes only root to be able to modify images by default.
When using criu with suid bit set, group of the images is set
to user group, which is not safe, considering current CR_FD_PERM.
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

ada46644

timerfd: Setup @ticks only if nonzero · 8f2cb6b2

Cyrill Gorcunov authored Sep 16, 2014

If @ticks is zero the kernel returns error
because on creation the @ticks is already zero,
so simply setup @ticks if real value present.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

8f2cb6b2

ptrace: Skip GETREGS on exits from syscalls when possible · 6dc00746

Pavel Emelyanov authored Sep 15, 2014

The PTRACE_SYSCALL traps task twice -- first on enter into
and then on exit from syscall. If we trace a single task (and
we do it on dump two times per task) we may skip half of all
getregs calls -- on exit we don't need them.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>

6dc00746

util: mkdirp -- Print exactly what is failed · 19018622

Cyrill Gorcunov authored Sep 17, 2014

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

19018622

parasite: remove useless check from parasite_stop_on_syscall() (v2) · 232cb4b3

Andrey Vagin authored Sep 17, 2014

We have the same check a few lines above.

v2: fix the subject
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

232cb4b3

16 Sep, 2014 3 commits

mount: handle a circular reference in mount tree · 7fa98a30

Andrey Vagin authored Sep 12, 2014

$ cat /proc/self/mountinfo
...
1 1 0:2 / / rw - rootfs rootfs rw,size=373396k,nr_inodes=93349
...

You can see that mnt_id and parent_mnt_id are equals here.
This patch interpretes this case as a root mount in a tree.

0'th mount is rootfs, which is mounted in init_mount_tree().

We don't see it in cases when system makes chroot, because of

static int show_mountinfo(struct seq_file *m, struct vfsmount *mnt)
	...
	/* mountpoints outside of chroot jail will give SEQ_SKIP on this */
	err = seq_path_root(m, &mnt_path, &root, " \t\n\\");

Cc: beproject criu <beprojectcriu@gmail.com>
Cc: Christopher Covington <cov@codeaurora.org>
Reported-by: beproject criu <beprojectcriu@gmail.com>
Reviewed-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

7fa98a30

rst: Don't allocate PATH_MAX for /proc/self realink · 4eec4c6e
Pavel Emelyanov authored Sep 12, 2014
```
Pid is 10 chars maximum.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
```
4eec4c6e

cg: proc_parse -- Don't compare cgroup paths · b99b76b0

Cyrill Gorcunov authored Sep 15, 2014

When we compare sets in cg_set_compare() we presume that controller
names are properly sorted but because of use of strcmp(cc->path, path)
it's not true. In particular in case if there are two same sets which
differ in paths only

(00.126812) cg:  `- New css ID 2
(00.127051) cg:     `- [memory] -> [/vz-1]
(00.127079) cg:     `- [name=systemd] -> [/vz-1]
(00.127108) cg:     `- [net_cls] -> [/vz-1]

(00.239829) cg:  `- New css ID 3
(00.240067) cg:     `- [memory] -> [/vz-1]
(00.240096) cg:     `- [net_cls] -> [/vz-1]
(00.240154) cg:     `- [name=systemd] -> [/vz-1/system.slice/dbus.service]

we currently refuse to dump such configuretion. Thus remove
path comparision from the first place.

CC: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

b99b76b0

15 Sep, 2014 1 commit

mount: validate mounts only once on dump (v3) · aadc309a

Andrey Vagin authored Sep 12, 2014

mntinfo contains mounts from all namespaces, so we can validate it only
once after collecting mounts.

v2: add a fake comment about goto
v3: add a real comment about goto
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

aadc309a

12 Sep, 2014 2 commits

service: service should compile on Ubuntu 14.04 · 32b032b6

Tycho Andersen authored Sep 10, 2014

I'm not quite sure what the difference is (I have gcc 4.8, but there are
probably also header differences), but when I compile the service on 14.04 I
get:

  CC       cr-service.o
cr-service.c: In function ‘start_page_server_req’:
cr-service.c:536:8: error: ignoring return value of ‘write’, declared with attribute warn_unused_result [-Werror=unused-result]
   write(start_pipe[1], &ret, sizeof(ret));
        ^
cr-service.c:544:6: error: ignoring return value of ‘read’, declared with attribute warn_unused_result [-Werror=unused-result]
  read(start_pipe[0], &ret, sizeof(ret));
      ^
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Tested-by: https://travis-ci.org/avagin/criu/builds/34990769Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

32b032b6

zdtm: fix msg reporting · 7493eef9

Konstantin Neumoin authored Sep 11, 2014

avoid err() for regular msg reporting
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

7493eef9

10 Sep, 2014 5 commits

zdtm/mountpoints: add "unknown" options to reproduce the previous bug · 75768982

Andrey Vagin authored Sep 09, 2014

tmpfs has the "size" option, which is not standard.

Execute zdtm/live/static/mountpoints
./mountpoints --pidfile=mountpoints.pid --outfile=mountpoints.out
Dump 2737
WARNING: mountpoints returned 1 and left running for debug needs
Test: zdtm/live/static/mountpoints, Result: FAIL
==================================== ERROR ====================================
Test: zdtm/live/static/mountpoints, Namespace:
Dump log   : /root/git/criu/test/dump/static/mountpoints/2737/1/dump.log
--------------------------------- grep Error ---------------------------------
(00.146444) Error (mount.c:399): Two shared mounts 50, 67 have different sets of children
(00.146460) Error (mount.c:402): 67:./zdtm_mpts/dev/share-1 doesn't have a proper point for 54:./zdtm_mpts/dev/share-3/test.mnt.share
(00.146820) Error (cr-dump.c:1921): Dumping FAILED.
------------------------------------- END -------------------------------------
================================= ERROR OVER =================================
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Tested-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

75768982

mount: strip options for all mounts · f88d72d0

Andrey Vagin authored Sep 09, 2014

Currently we stript options only one of brothers, but
mount_equal() thinks that two brothers should have the same options.

Execute zdtm/live/static/mountpoints
./mountpoints --pidfile=mountpoints.pid --outfile=mountpoints.out
Dump 2737
WARNING: mountpoints returned 1 and left running for debug needs
Test: zdtm/live/static/mountpoints, Result: FAIL
==================================== ERROR ====================================
Test: zdtm/live/static/mountpoints, Namespace:
Dump log   : /root/git/criu/test/dump/static/mountpoints/2737/1/dump.log
--------------------------------- grep Error ---------------------------------
(00.146444) Error (mount.c:399): Two shared mounts 50, 67 have different sets of children
(00.146460) Error (mount.c:402): 67:./zdtm_mpts/dev/share-1 doesn't have a proper point for 54:./zdtm_mpts/dev/share-3/test.mnt.share
(00.146820) Error (cr-dump.c:1921): Dumping FAILED.
------------------------------------- END -------------------------------------
================================= ERROR OVER =================================
Reported-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Tested-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

f88d72d0

mount: don't skip checks in validate_mounts() · cb738520

Andrey Vagin authored Sep 10, 2014

"continue" is called by mistake, so we skip a few checks for shared
mounts without siblings.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

cb738520

restore: Introduce the --restore-sibling option · 53957fad

Pavel Emelyanov authored Sep 10, 2014

We have a slight mess with how criu restores root task.
Right now we have the following options.

1) CLI
	a) Usually
	task calling criu
	 `- criu
	     `- root restored task

	b) when --restore-detached AND root has pdeath_sig

	task calling criu
	 `- criu
	 `- root restored task

2) Library/SWRK
	task using lib/swrk
	 `- criu
	 `- root restored task

3) Standalone service
	a) Usually
	service
	 `- service sub task
	     `- root restored task

	b) when root has pdeath_sig
	criu service
	 `- criu sub task
	 `- root restored task

It would be better is CRIU always restored the root task as sibling,
but we have 3 constraints:

First, the case 1.a is kept for zdtm to run tests in pid namespaces
on 3.11, which in turn doesn't allow CLONE_PARENT | CLONE_NEWPID.

Second, CLI w/o --restore-detach waits for the restored task to die and
this behavior can be "expected" already.

Third, in case of standalone service tasks shouldn't become service's
children.

And I have one "plan". The p.haul project while live migrating tasks
on destination node starts a service, which uses library/swrk mode. In
this case the restored processes become p.haul service's kids which is
also not great.

That said, here's the option called --restore-child that pairs the
--restore-detach like this:

* detached AND child:

task
 `- criu restore (exits at the end)
 `- root task

The root task will become task's child.
This will be default to library/swrk.
This is what LXC needs.

* detach AND !child

task
 `- criu restore (exits at the end)
     `- root task

The root task will get re-parented to init.
This will be compatible with 1.3.
This will be default to standalone service and
to my wish with the p.haul case.

* !detach AND child

task
 `- criu restore (waits for root task to die)
 `- root task

This should be deprecated, so that criu restore doesn't mess
 with task <-> root task signalling.

* !detach AND !child

task
 `- criu restore (waits for root task to die)
     `- root task

This is how plain criu restore works now.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Andrew Vagin <avagin@openvz.org>

53957fad

restore: use root_as_sibling only after defining it · 1ff2500b

Tycho Andersen authored Sep 09, 2014

root_as_sibling was used in criu_signals_setup(), but was only defined later
(when forking the root task for the first time). This meant that the
SA_NOCLDSTOP was never masked off, which meant SIGCHLD was never delivered
after ptracing the root task. Thus, when the a child of the root task died
(e.g. from cr_system), the root task sat in PTRACE_STOP, and the restore task
never PTRACE_CONT'd, resulting in a deadlock.

Instead, we only unmask SA_NOCLDSTOP right before we PTRACE_SEIZE, after the
value is defined.

v2: re-work the condition for CLONE_PARENT
v3: move unmasking of SA_NOCLDSTOP to restore_root_task
v4: keep all the comments in the original code
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

1ff2500b

09 Sep, 2014 2 commits

scripts: Fix path assignment · f45060e1
Pavel Emelyanov authored Sep 09, 2014
```
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
```
f45060e1

Add a convenience shell script for Docker container C/R · bd8b5dd5

Saied Kazemi authored Sep 03, 2014

Since the command line for checkpointing and restoring Docker containers
is very long and there are some manual steps involved before restoring
a container, it's much easier to use a shell script to automate the work.

One would simply do:

$ sudo docker_cr.sh -c
$ sudo docker_cr.sh -r
Signed-off-by: Saied Kazemi <saied@google.com>
Acked-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

bd8b5dd5

05 Sep, 2014 7 commits

zdtm/inotify00: fix expected sets of events · 5fcebf2d

Andrey Vagin authored Sep 05, 2014

zdtm.sh with zero iterations of dumping/restoring checks correctness of
tests.

$ bash test/zdtm.sh -i 0 zdtm/inotify00
Output file: /root/git/orig/criu/test/zdtm/live/static/inotify00.out
------------------------------------------------------------------------------
19:16:29.601:  6905: 	unlink 02       : event      0x200 -> IN_DELETE
19:16:29.602:  6905: 	unlink 02       : event      0x200 -> IN_DELETE
19:16:29.602:  6905: 	unlink 02       : event        0x8 -> IN_CLOSE_WRITE
19:16:29.602:  6905: 	unlink 02       : event        0x8 -> IN_CLOSE_WRITE
19:16:29.602:  6905: 	unlink 02       : event      0x400 -> IN_DELETE_SELF
19:16:29.602:  6905: 	unlink 02       : event     0x8000 -> IN_IGNORED
19:16:29.602:  6905: 	unlink 02       : read  6 events
19:16:29.614:  6905: 	after           : event        0x8 -> IN_CLOSE_WRITE
19:16:29.614:  6905: 	after           : read  1 events
19:16:29.614:  6905: FAIL: inotify00.c:217: Unhandled events in emask 0x200 -> IN_DELETE (errno = 11 (Resource temporarily unavailable))
------------------------------------- END -------------------------------------
================================= ERROR OVER =================================

This patch removes logic about linked files, because it's useless.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

5fcebf2d

locks: fix up a device returned by stat() for btrfs (v4) · e248e65c

Andrew Vagin authored Sep 03, 2014

BTRFS returns subvolume dev-id instead of superblock dev-id,
in such case return device obtained from mountinfo (ie subvolume0).

v2: fix up devices only for btrfs files.
v3: use phys_stat_dev_match instead of phys_stat_resolve_dev
v4: fix cosmetic whims

Reported-by: Mr Jenkins
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

e248e65c

zdtm: fix /proc/lock parsing · 61dbc976

Konstantin Neumoin authored Sep 05, 2014

11: POSIX  ADVISORY  WRITE 1 b6:a4111:136512 0 EOF
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

61dbc976

fsnotify: Proceed dumping even if queue has data · 11d8642a

Cyrill Gorcunov authored Sep 05, 2014

It turns out that we can't be too strict about
queued events -- criu itself generates a number
of them and there is no clear way yet how to resolve
this situation. So defer "strict" mode for now
but print a warning.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

11d8642a

page-server: Don't setup options in parent task · b47b0201

Pavel Emelyanov authored Sep 04, 2014

When service starts page server all the preparations (log, wdir, img dir, etc.)
happen in parent task, then we fork page server.

This is OK for now, but when we will serve several requests per connection, all
these resources would be leaked in parent.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

b47b0201

service: Allow to server more requests after page-server start · 66170c6b

Pavel Emelyanov authored Sep 04, 2014

The problem with several requests is that criu leaks resources after
doing dump/restore. It's OK since process exits anyway, but for
multy requests per connection it's better to audit this thing.

For now -- allow to do requests after the page-server-start one only.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

66170c6b

service: Do one exit point from cr_service_work · eed38acc

Pavel Emelyanov authored Sep 04, 2014

That's preparation to "several requests per connection" patch.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

eed38acc