Commits · 69535342cb9a02d922e15cf55e96eda14a8aa19a · zhul / criu

30 Oct, 2018 40 commits

mount: use propagation groups in propagate_mount replacing excess search · 69535342

Pavel Tikhomirov authored Jul 10, 2018

These also fixes false-propagation problem of the mount to itself if it
is in parent's share.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

69535342

mount: improve can_mount_now using propagation groups · 545b0736

Pavel Tikhomirov authored Jul 10, 2018

1) redo waiting for parents of propagation group to be mounted using
pre-found propagation groups
2) for shared mount wait for children of that shared group which has no
propagation in our shared mount

(2) - effectively is a support of non-uniform shares, that means two
mounts of shared group can have different sets of children now - we will
mount them in the right order, but propagate_mount and validate_shared
are still preventing c/r-ing such shares, will fix the former and remove
the latter in separate(next) patches.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

545b0736

mount: put all mounts which propagate from each other to a list · bc930d12

Pavel Tikhomirov authored Jul 10, 2018

These information will help improving the restore of tricky mounts
configurations.

Function same_propagation_group checks if two mounts were created
simultaneousely through shared mount propagation, and the main part of
these - they should be in exaclty the same place inside the share of
their parents.

Function root_path_from_parent prints the mountpoint path
relative to the root of the parent's share, by first substracting
parent's mountpoint from our mountpoint and second prepending parents
root path (relative to the root of it's file system), e.g:

id	parent_id	root	mountpoint
1	0		/	/
2	1		/	/parent_a
3	1		/dir	/parent_b
4	2		/	/parent_a/dir/a
5	3		/	/parent_b/a

(Let 2 and 3 be a shared group)

For mount 4 root_path_from_parent gives:
"/parent_a/dir/a" - "/parent_a" == "/dir/a"
"/" + "/dir/a" == "/dir/a"

For mount 5:
"/parent_b/a" - "/parent_b" == "/a"
"/dir" + "/a" == "/dir/a"

So mounts 4 and 5 are a propagation group.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

bc930d12

zdtm: check children of shared slaves restore · 24bd5fcf

Pavel Tikhomirov authored Jul 10, 2018

495 494 0:62 / /zdtm/static/shared_slave_mount_children.test/share rw,relatime shared:235 - tmpfs share rw
496 494 0:62 / /zdtm/static/shared_slave_mount_children.test/slave1 rw,relatime shared:236 master:235 - tmpfs share rw
497 494 0:62 / /zdtm/static/shared_slave_mount_children.test/slave2 rw,relatime shared:236 master:235 - tmpfs share rw
498 496 0:63 / /zdtm/static/shared_slave_mount_children.test/slave1/child rw,relatime shared:237 - tmpfs child rw
499 497 0:63 / /zdtm/static/shared_slave_mount_children.test/slave2/child rw,relatime shared:237 - tmpfs child rw

Before the fix we had:

(00.167574)      1: Error (criu/mount.c:1769): mnt: A few mount points can't be mounted
(00.167577)      1: Error (criu/mount.c:1773): mnt: 498:496 / /tmp/.criu.mntns.o2Op5j/9-0000000000/zdtm/static/shared_slave_mount_children.test/slave1/child child
(00.167580)      1: Error (criu/mount.c:1773): mnt: 497:494 / /tmp/.criu.mntns.o2Op5j/9-0000000000/zdtm/static/shared_slave_mount_children.test/slave2 share
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

24bd5fcf

mount: fix can_mount_now to wait children of master's share properly · 64586567

Pavel Tikhomirov authored Jul 10, 2018

We should not use ->bind link for checking master's children. As if we
have two slaves shared between each other, the one mounted first will
replace ->bind link for the other - that will break restore.

Also while on it, if we do not want doubled mounts and want to
prohibit propagation to slaves on restore we likely want all children of
the whole master's share mounted before slave.

JFYI: Actually these restriction is very strict and some cases will fail
to restore, for instance (hope nobody does so):

mkdir /test
mount -t tmpfs test /test
mount --make-private /test
mkdir /test/{share,slave}
mount -t tmpfs share /test/share --make-shared
mount --bind /test/share/ /test/slave/
mount --make-slave  /test/slave
mount --make-shared /test/slave
mkdir /test/share/slave
mount --bind /test/slave/ /test/share/slave/

cat /proc/self/mountinfo | grep test
524 612 0:69 / /test rw,relatime - tmpfs test rw
570 524 0:73 / /test/share rw,relatime shared:879 - tmpfs share rw
571 524 0:73 / /test/slave rw,relatime shared:942 master:879 - tmpfs share rw
602 570 0:73 / /test/share/slave rw,relatime shared:942 master:879 - tmpfs share rw
603 571 0:73 / /test/slave/slave rw,relatime shared:943 master:942 - tmpfs share rw

Here 603 is a propagation of 602 from master 570 to slave 571, and it is
the only way to get such a mount as 571 and 602 are in one shared group
now and all later mounts to them will propagate between them and create
dublicated mounts. So to create real 603 without dups we need to have
/test/slave mounted before /test/share/slave, which contradicts with
current assumption.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

64586567

zdtm: add a test for unsupported children collision · fc01e18b

Pavel Tikhomirov authored Jul 10, 2018

These test is not automatic as after kernel v4.11 behaviour changes, on
older kernel we get children collision:

817 188 0:48 / /zdtm/static/unsupported_children_collision.test/share1 rw,relatime shared:942 - tmpfs share rw
> 818 817 0:124 / /zdtm/static/unsupported_children_collision.test/share1/child rw,relatime shared:943 - tmpfs child1 rw
819 188 0:48 / /zdtm/static/unsupported_children_collision.test/share2 rw,relatime shared:942 - tmpfs share rw
820 819 0:125 / /zdtm/static/unsupported_children_collision.test/share2/child rw,relatime shared:944 - tmpfs child2 rw
> 821 817 0:125 / /zdtm/static/unsupported_children_collision.test/share1/child rw,relatime shared:944 - tmpfs child2 rw
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

fc01e18b

mount: add helper to check unsupported children collision · f167a1dd

Pavel Tikhomirov authored Jul 10, 2018

See more detailed explanation inside in-code comment.

note: Actually before we remove validate_mounts (later in these
patchset) we likely won't get to these check and fail earlier, as having
children collision implies shared mounts with different sets of
children.

note: from v4.11 and ms kernel commit 1064f874abc0 ("mnt: Tuck mounts
under others instead of creating shadow/side mounts.") there will be no
more mount collision.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

f167a1dd

test rpc: remove unnecessary import, close fd · 35fbc373

Adrian Reber authored Jun 29, 2018

Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

35fbc373

fdstore: Unlimit fdstore queue on start · 0a859275

Cyrill Gorcunov authored Jun 27, 2018

We use fdstore intensively for example when handling
bindmounted sockets and ghost dgram sockets. The system
limit for per-socket queue may not be enough if someone
generate lots of ghost sockets (150 and more as been
detected on default fedora 27).

To make it operatable lets unlimit fdstore queue size
on startup.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

0a859275

travis: fix rawhide test by also installing sudo · 077409c1

Adrian Reber authored Jul 09, 2018

Signed-off-by: Adrian Reber <areber@redhat.com>
Acked-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Acked-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

077409c1

zdtm/static: add a test to check epoll file descriptors · bdbd7c8f
Andrei Vagin authored Jul 04, 2018
```
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
```
bdbd7c8f

epoll: Use epoll queues to speedup multiple duped fds · 4e8ca613

Cyrill Gorcunov authored Jul 04, 2018

When we are dumping epoll and one of target fd is been
duped we can reuse already collected fds rbtree to find
proper target. We handle it in a lazy way:

 - try use plain regular bsearch first, in case of all
   targets are not duped we checkpoint epoll immediately

 - if bsearch failed we put this epoll entry into a queue
   and run its dumping later when all other files in the
   process are already dumped. At this moment fds tree
   should already has all target files in rbtree thus
   we can simply lookup for it
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

4e8ca613

files: make_gen_id -- Promote to be general helper · 605b9be8

Cyrill Gorcunov authored Jul 04, 2018

It is used in files tree generation so we will need
reuse for epoll sake.

Also use the whole 64 bit offset to shuffle bits more.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

605b9be8

epoll: Add kid_lookup_epoll_tfd helper · 67bd254a

Cyrill Gorcunov authored Jul 04, 2018

To find target files with help of our collected
rbtree.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

67bd254a

epoll: Exit with error if tfd is missing · d026ba1c

Cyrill Gorcunov authored Jul 04, 2018

If we can't find target file descriptor we should
exit on dump with error instead of skipping it.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

d026ba1c

epoll: Save fields of target files in eventpoll_tfd_entry · fa989970

Cyrill Gorcunov authored Jul 04, 2018

We will use them to fast lookup of targets files.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

fa989970

epoll: Add kcmp_epoll check · ca8144b7

Cyrill Gorcunov authored Jul 04, 2018

To run epoll tests only where it is supported.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

ca8144b7

epoll: Align members in assignments · 071cc1e1

Cyrill Gorcunov authored Jul 04, 2018

For readability sake
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

071cc1e1

epoll: Print efd id when showing targets · 3d280e02

Cyrill Gorcunov authored Jul 04, 2018

To figure out efd:tfd mapping easier by reading the logs.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

3d280e02

epoll: Show tfd in decimal form · f15fe7cf

Cyrill Gorcunov authored Jul 04, 2018

For easier fd match when reading logs
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

f15fe7cf

epoll: Add support for multiple duped fds · 50e1be45

Cyrill Gorcunov authored Jul 04, 2018

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

50e1be45

epoll: Use kcmp to find proper target file · cfa3f405

Cyrill Gorcunov authored Jul 04, 2018

When target file obtained from epoll fdinfo (internally the
kernel keeps only file _number_ inside) we have to check its
identity to make sure it is exactly one which has been added
into epoll engine. The only proper way is to use kcmp syscall.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

cfa3f405

epoll: Use real file transferred for target fds check · 1804d6f3

Cyrill Gorcunov authored Jul 04, 2018

When we are checkpoiting epoll targets we assuming that this target
file is belonging to the process we are on. This is of course not
true. Without kernel support the only thing we can do is compare
fd numbers with ones present in epoll fdinfo. When fd numer match
we assume that it indeed the file which has been added into epoll.

This won't cover the case when file has been moved to some other
number and new one is reopened instead of it. Such scenario will
trigger false positive and we can't do anything about.

In next patches with kernel help we will make precise check for
files identity.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

1804d6f3

epoll: Pass drained fds to dump_one_file · 070720d4

Cyrill Gorcunov authored Jul 04, 2018

In epoll dumping we will need the whole set of fds to investigate
the targets, so pass this parameter down to epoll code.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

070720d4

epoll: kdat -- Check if we have KCMP_EPOLL_TFD support · 08603fa6

Cyrill Gorcunov authored Jul 04, 2018

We will need it to make sure the target files in epolls are present
in current process.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

08603fa6

kcmp: Add epoll definitions · 7c72478f

Cyrill Gorcunov authored Jul 04, 2018

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

7c72478f

kcmp: Drop empty line at EOF · 40c986ca

Cyrill Gorcunov authored Jul 04, 2018

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

40c986ca

kcmp: Beautify kcmp-ids.h · 77a6fcb9

Cyrill Gorcunov authored Jul 04, 2018

 - aling memebers
 - use pid_t type for PIDs
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

77a6fcb9

kcmp: Cleanup sources · 6cb36a1e

Cyrill Gorcunov authored Jul 04, 2018

 - switch to use uintX type (just to drop uX finally,
   it doesn't worth to carry this type)

 - instead of including huge util.h rather include the
   files which are really needed: log, xmalloc, compiler
   and bug
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

6cb36a1e

zdtm: handle errors of make · d6ec8347
Andrei Vagin authored Jul 02, 2018
```
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
```
d6ec8347
images: tty -- Reserve entries for multiple devpts support · b58eed2a
Cyrill Gorcunov authored Mar 17, 2017
```
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
```
b58eed2a
images: sk-netlink -- Reserve entries for netlink queued messages · c983d5f9
Cyrill Gorcunov authored Mar 17, 2017
```
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
```
c983d5f9

images: sk-inet -- Reserve entries for IP raw sockets · 7a680d7b

Cyrill Gorcunov authored Mar 17, 2017

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

7a680d7b

images: remap-file-path -- Reserve entries for spfs manager · 26aadb5a
Cyrill Gorcunov authored Mar 17, 2017
```
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
```
26aadb5a

forking: Use last_pid_mutex for synchronization during clone() · 53a11dfc

Kirill Tkhai authored May 16, 2017

Before this patch we used flock to order task creation,
but this way is not good. It took 5 syscalls to synchronize
a creation of a single child:

1)open()
2)flock(LOCK_EX)
3)flock(LOCK_UN)
4)close() in parent
5)close() in child

The patch introduces more effective way for synchronization,
which executes 2 syscalls only. We use last_pid_mutex,
and the syscalls number sounds definitely better.

v2: Don't use flock() at all
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

53a11dfc

forking: Introduce last_pid_mutex and helpers · 9a64e003

Kirill Tkhai authored May 16, 2017

Introduce mutex for synchronization ns_last_pid file
on restore.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

9a64e003

cr-check: Make compat_cr warning arch-independent · dc1e3b59

Dmitry Safonov authored Jul 25, 2017

I think, we should warn a user when we can't C/R compatible
applications. That's valid for different than x86 archs.
Let's correct the message the way it'll suit non-x86.
Reported-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

dc1e3b59

restore: don't call free_mappings for an uninitialized list · 33fb955e

Andrei Vagin authored Oct 20, 2017

    vma_area_list@entry=0x818) at criu/cr-dump.c:107
107             list_for_each_entry_safe(vma_area, p, &vma_area_list->h, list)
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

33fb955e

restore: set uid and git onto zero otherwise PR_SET_MM_EXE_FILE can fail · 4d164327

Andrei Vagin authored Jul 10, 2018

When a non-root user runs "criu restore" and criu has the suid bit,
a process will run with non-zero uid and gid.

Before the 4.13 kernel (4d28df6152aa "prctl: Allow local CAP_SYS_ADMIN
changing exe_file"), PR_SET_MM_EXE_FILE fails if uid or gid isn't zero.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

4d164327

tests: fix builds on alpine and centos · ae55a6cc

Adrian Reber authored Jun 28, 2018

Install sudo, create test user with ID 1000, install bash,
fix pidfile creation and pidfile chmod.

v2:
 * use sleep to give the criu daemon some time to start up

v3:
 * Andrei is of course right and sleep is not good solution.
   After adding --status-fd support to criu service, this
   is how we now detect that criu is ready.

v4:
 * This was much more complicated than expected which is related
   to the different versions of the tools on the different travis
   test targets. There seems to be a bug in bash on Ubuntu
    https://lists.gnu.org/archive/html/bug-bash/2017-07/msg00039.html
   which prevents using 'read -n1' on Ubuntu. As a workaround
   the result from CRIU's status FD is now read via python.

   Another problem was discovered on alpine with the loop restore test.
   CRIU says to use setsid even if the process is already using setsid.
   As a workaround, still with setsid, this process is now using
   shell-job true for checkpoint and restore.

Parts of v2 have been committed before. So the changes from this commit
are partially already in another commit.
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

ae55a6cc