Commits · 7b87f1635e311ef199e5ec57c5d53d821b7a0db4 · zhul / criu

02 Mar, 2018 36 commits

Kirill Tkhai authored Dec 28, 2017

Create a zombie with specific pgid and check that
pgid remains the same after restore.

This test hangs criu restore without any of two previous patches:
1)without "restore: Call prepare_fds() in restore_one_zombie()"
  in 100% cases;

2)without "restore: Split restore_one_helper() and wait exiting
  zombie children" fail is racy, but you can add something like

criu/cr-restore.c:
## -1130,6 +1130,8 @@ static int restore_one_zombie(CoreEntry *core)

        if (task_entries != NULL) {
                restore_finish_stage(task_entries, CR_STATE_RESTORE);
+               if (current->parent->pid->state == TASK_ALIVE)
+                       sleep(2);
                zombie_prepare_signals();
        }

and it will fail with almost 100% probability.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

7b87f163

zdtm: Add sys_clone_unified() · fa098801

Kirill Tkhai authored Jul 14, 2017

Cleanup fork() definition and make a generic function
for all archs. It may be useful, when you want to add
more clone flags to fork(), or if you want to pass more,
than one argument to child function (glibc's clone
alows only one).
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

fa098801

restore: Split restore_one_helper() and wait exiting zombie children · e23806f3

Kirill Tkhai authored Dec 28, 2017

Zombie is also can be choosen as a parent for task helper like
any other task.

If the task helper exits between restore_finish_stage(CR_STATE_RESTORE)
and zombie_prepare_signals()->SIG_UNBLOCK, the standard criu SIGCHLD
handler is called, and the restore fails:

(00.057762)     41: Error (criu/cr-restore.c:1557): 40 exited, status=0
(00.057815) Error (criu/cr-restore.c:2465): Restoring FAILED.

This patch makes restore_one_zombie() behave as restore_one_helper()
and to wait children exits before allowing SIGCHLD. This makes us
safe against races with exiting children.

See next patch for test details.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

e23806f3

restore: Call prepare_fds() in restore_one_zombie() · ec761499

Kirill Tkhai authored Dec 28, 2017

Zombie may be choosen as parent for task helper
during solving pgid dependences. In this situation,
it becomes to share fdt with the helper and it has
to call prepare_fds() to decrement fdt->nr.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

ec761499

zdtm: Export sys_clone_unified() to headers · 820bad96

Kirill Tkhai authored Dec 28, 2017

Make it possible to use this function by tests.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

820bad96

files: Kill unused CTL_TTY_OFF leftovers · 0da7d971

Kirill Tkhai authored Dec 28, 2017

CTL_TTY_OFF and reserve_service_fd() are unused now,
so purge them from the code.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

0da7d971

files: Move CTL_TTY_OFF fixup to generic file engine · 7a6fe6f0

Kirill Tkhai authored Dec 28, 2017

There are two problems. The first is CTL_TTY_OFF occupies
one of the biggest available fds in the system. It's a number
near service_fd_rlim_cur. Next patches want to allocate
service fds lower, than service_fd_rlim_cur, and they want
to know max used fd from file fles after the image reading.

But since one of fds is already set very big (CTL_TTY_OFF)
on a stage of collection fles, the only availabe service
fds are near service_fd_rlim_cur. It's vicious circle,
and the only way is to change ctl tty fd allocation way.

The second problem is ctl tty is ugly out of generic file
engine fixup (see open_fd()). This is made because ctl tty
is the only slave fle, which needs additional actions
(see tty_restore_ctl_terminal()). Another file types just
receive their slave fle, and do not do anything else.

This patch moves ctl tty to generic engine and solves all
the above problems. To do that, we implement new CTL_TTY
file type, which open method waits till slave tty is received
and then calls tty_restore_ctl_terminal() for that. It fits
to generic engine well, and allocates fd via find_unused_fd(),
and do not polute file table by big fd numbers.

Next patch will kill currently unneed CTL_TTY leftovers
and will remove CTL_TTY_OFF service fd from criu.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

7a6fe6f0

files: Move prepare_ctl_tty() to criu/tty.c · 8a946000

Kirill Tkhai authored Dec 28, 2017

Move the function and reduce its arguments number.
This is cleanup needed to keep all tty code together.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

8a946000

files: Close ctl tty via generic engine · e4c25f2b

Kirill Tkhai authored Dec 28, 2017

Just mark the fle as "fake" and the engine will do all the work.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

e4c25f2b

files: Fix crossing unused and service fds of shared fd tables · 832c9aef

Kirill Tkhai authored Jun 28, 2017

service_fd_id is id of a specific task, while other tasks
in shared fd table group may have bigger id numbers.
In this case given unused fd intersects with service fds
of such tasks. This leads to undefined behaviour. Fix that.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

832c9aef

restore: Do not iterate over parent's files to find leftovers · 66f55b5e

Kirill Tkhai authored Jun 28, 2017

This patch speeds up creation of child process by disabling
iteration over open files for the most cases. Really, we don't
need that now, as previous patches make parent files do not leak:

mnt namespace fds are stored in fdstore, pid proc files
are closed directly.

So, now we can skip closing old files for the most cases,
except some CLONE_FILES cases: we need that only if parent
have CLONE_FILES in its flags (and for root_item).
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

66f55b5e

restore: Use vpid in log_init_by_pid() instead of getpid() · f11a0ce0

Kirill Tkhai authored Jun 28, 2017

When task is in pid namespace, getpid() can't be used
to identify it. So, use vpid instead of that.

Also, move log_init_by_pid() above pid check.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

f11a0ce0

forking: Always close pid proc before child creation · 2fe60818

Kirill Tkhai authored Jun 28, 2017

Child does not know about parent's pid proc fd,
and it can't close it by fd. Next patch will do
close_old_files() optional, and it will base on
the fact there is no leftover fds. So, close pid
proc directly.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

2fe60818

mnt_ns: Use fdstore to keep mount namespaces · dcac6d66

Kirill Tkhai authored Jun 28, 2017

This allows to decrese number of file descriptors,
which are passed to children, and that is need to
close in close_old_files().

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>:
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

dcac6d66

mnt: Move ns_fd assignment down in prepare_mnt_ns() · 58b3b9ee

Kirill Tkhai authored Jun 28, 2017

No functional changes.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

58b3b9ee

utils: Introduce SWAP() helper to exchange two variables · d433a3e9
Kirill Tkhai authored Jun 28, 2017
```
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
```
d433a3e9

mnt_ns: Move open_proc() up in prepare_mnt_ns() · 4c68ed7d

Kirill Tkhai authored Jun 28, 2017

The both branches need this, so move it up.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

4c68ed7d

mnt: Put root fd to fdstore · bda944e1

Kirill Tkhai authored May 05, 2017

mntns_get_root_fd() may be called by a task from
!root_user_ns, and it fails if so.

Put root fd to fdstore to allow use it every task.

v3: New
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

bda944e1

proc: Close CR_PROC_FD_OFF and TRANSPORT_FD_OFF later · 0f7e6928

Kirill Tkhai authored Feb 23, 2017

CR_PROC_FD_OFF is need for accessing to foreign tasks
fds, and will be used in the future.

TRANSPORT_FD_OFF is for uniformity.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

0f7e6928

cr-restore: Open transport socket earlier · 7952c6a7

Kirill Tkhai authored May 05, 2017

I need named socket to communicate with pid_ns helpers
(see next patches) and receive answer from them
(it's impossible to send answer to unnamed socket).
As we already have transport socket, we'll reuse it
for the above goal too.

This patch makes transport sockets be created before
creation of children tasks. Also, now they are created
not only for alive tasks (so we need additional
manipulations for TASK_HELPERS, e.g., to call prepare_fdt()).

v5: Return CLONE_FILES clone() argument during task helpers
creation. Also get rid of fdt_mutex as CLONE_FILES processes
does not close old files after clone, and we don't have
intertersections between them. Also, socket() system call
can't return a fd in service fds range, which was the main
reason to have this mutex.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

7952c6a7

files: Make possible task helpers to use shared_fdt_prepare() · 72e50065

Kirill Tkhai authored May 05, 2017

Next patches will create transport sockets in task helpers.
As helpers are forked using CLONE_FILES, they must resolve
shared fds to create their own service fds. This patch allows
that.

I've digged in the code, and there is no a reason, we need
pid_rst_prio() during choosing of fdt restorer. So, this
case may be safely deleted, which guarantees, that in case
of TASK_HELPER, the restorer of fdt will be parent, i.e.,
no one TASK_HELPER will be restorer of fdt.

v5: New
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

72e50065

pstree: Change type of init_pstree_helper() and check for parent · 563efd3f

Kirill Tkhai authored May 05, 2017

This is refactoring, which will be used in next patches.
BUG_ON() just to mention that parent must be set before
call of this function.

v5: New
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

563efd3f

files: Do not close transport socket twice · 0fcaeea9

Kirill Tkhai authored Mar 20, 2017

We close it in sigreturn_restore() for unification with other
service fds, so kill the second close() from here.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

0fcaeea9

sfd: Lift up own fd limit on bootup · cc5dbf51

Cyrill Gorcunov authored May 30, 2017

This minimize chances to hit problem where files
used for page transfer are trying to use same number
reserved for service fd.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

cc5dbf51

kdat: Add fetching files stat · 28af7aa0

Cyrill Gorcunov authored May 30, 2017

Will need it to unlimit the files allocation
for service fd reserving and later for parasite code run
(which is implemented in vz7 instance and soon will be
ported into vanilla).
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

28af7aa0

files: Unexport collect_task_fd() · 8b517795

Kirill Tkhai authored Jun 01, 2017

It has only one user, so unexport it.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

8b517795

autofs: Add FD_TYPES__AUTOFS_PIPE type · 6d40803e

Kirill Tkhai authored Jun 01, 2017

Add a fake fd type for autofs. This allows functions
like find_file_desc() work as expected, without
having two different file_desc with the same type
and same id.

Also, later, it will allow to delete autofs_create_fle()
and to use generic helper.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

6d40803e

zdtm: improve tempfs_overmounted test · 8fdacca5

Pavel Tikhomirov authored Dec 11, 2017

Unchanged test provided by Andrew.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

8fdacca5

mount: do remaps for child-overmount of another overmount · 0709e3ce

Pavel Tikhomirov authored Dec 11, 2017

In case we have mounts:

1 /mnt/
2 /mnt/a with parent 1
3 /mnt/a/b with parent 1
4 /mnt/a with parent 2

We determine 2 as needing remap with does_mnt_overmount() and remap it.
Next we mount 4 on top of 2. Next in fixup_remap_mounts() we want to
move 2 back to it's parent 1, but instead move 4 there. So in these case
children-overmounts need to be remapped too.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

0709e3ce

mount: fix try_remap_mount · a9ec5829

Pavel Tikhomirov authored Dec 11, 2017

Remaps in mnt_remap_list should follow same descending order which was
setup in mnt_resort_siblings(), so don't reorder them.

For instance if we have sibling mounts with mountpoints:
1) /dir1/dir2/dir3
2) /dir1/dir2
3) /dir1
Here (2) is sibling-overmount for (1). Mount (3) is sibling-overmount
for both (1) and (2). So when we move overmounts back in
fixup_remap_mounts() we should first move (2) and only then (3).
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

a9ec5829

mount: fix mnt_resort_siblings to work as described · 84d6c730

Pavel Tikhomirov authored Dec 11, 2017

We should add new entry _before_ first entry with less depth to sort in
descending order.

e.g: entries in list have depths [7,5,3], adding new entry m with depth
4 we would break list_for_each_entry loop on p with depth 3, before
patch we would get [7,5,3,4] after list_add, which is wrong.

Also we can relax "<=" check to "<" to avoid unnecessary reordering.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

84d6c730

zdtm: now tempfs_overmounted will pass so remove crfail · dd104ddb

Pavel Tikhomirov authored Dec 11, 2017

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

dd104ddb

mount: make open_mountpoint handle overmouts properly · b364f4fd

Pavel Tikhomirov authored Dec 22, 2017

dump of VZ7 ct fails, if we have overmounted tmpfs inside:

[root@silo ~]# prlctl enter su-test-2
entered into CT
CT-829e7b28 /# mkdir /mnt/overmntedtmp
CT-829e7b28 /# mount -t tmpfs tmpfs /mnt/overmntedtmp/
CT-829e7b28 /# mount -t tmpfs tmpfs /mnt
CT-829e7b28 /# logout

[root@silo ~]# prlctl suspend su-test-2
Suspending the CT...
Failed to suspend the CT: PRL_ERR_VZCTL_OPERATION_FAILED (Details: Will skip in-flight TCP connections
(01.657913) Error (criu/mount.c:1202): mnt: Can't open ./mnt/overmntedtmp: No such file or directory
(01.662528) Error (criu/util.c:709): exited, status=1
(01.664329) Error (criu/util.c:709): exited, status=1
(01.664694) Error (criu/cr-dump.c:2005): Dumping FAILED.
Failed to checkpoint the Container
All dump files and logs were saved to /vz/private/829e7b28-f204-4bce-b09f-d203b99befd4/dump/Dump.fail
Checkpointing failed
)

Criu wants to dump the contents of /mnt/overmntedtmp/ mount but it is
unavailable. So we copy the mount namespace in such a case and unmount
overmounts to access what we want to dump.

Actual usecase here is dumping CT with active mariadb and ssh
connection. Together they happen to create such overmount. As by default
systemd creates a separate mount namespace for mysql and also mounts
tmpfs to /run/user in it, and when ssh(root) is connected - systemd also
mounts tmpfs in container root mount namespace to /run/user/0 for user
files. As /run is slave mount /run/user/0 also propagates to mysql's
mount namespace and initially becomes overmounted by /run/user.

https://jira.sw.ru/browse/PSBM-57362

remove __maybe_unused for mnt_is_overmounted and umount_overmounts

changes in v2:
1) Use clone not fork, share resources with parent same as in
call_in_child_process.
2) Do not enter userns (create helper) for non-overmounted mounts. Thus
return back setns/resorens logic.
3) Helper opens fd for parent directly due to CLONE_FILES, remove futex.
4) Check helper exit status properly.
5) Add get_clean_fd helper.
6) Add better comments.

changes in v3:
1) Pass fd from helper through args instead of ret code, fix ret code
checking.
2) Add \n to pr_err in open_mountpoint

changes in v5:
Make comments even better.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

b364f4fd

mount add umount_overmounts helper to make mount visible · 83df8649

Pavel Tikhomirov authored Dec 11, 2017

also remove __maybe_unused for __umount_children_overmounts

note: leave it __maybe_unused yet
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

83df8649

mount: add __umount_children_overmounts helper to make mount visible · d17bad63

Pavel Tikhomirov authored Dec 11, 2017

note: leave it __maybe_unused yet
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

d17bad63

mount: add mnt_is_overmounted helper to check mount visibility · 2bed6e9f

Pavel Tikhomirov authored Dec 11, 2017

note: leave it __maybe_unused yet
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

2bed6e9f

15 Feb, 2018 4 commits

kerndat: call kerndat_link_nsid() · 0d9bed0e

Andrei Vagin authored Jan 11, 2018

It was droped during one of rebases.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

0d9bed0e

kdat/net: Init kerndat even if nsid aren't supported · 677b6cb0

Dmitry Safonov authored Jan 27, 2018

We should continue even if kdat feature isn't supported:

[criu]# ./criu/criu dump -t `pidof pypy` --shell-job
Warn  (criu/kerndat.c:804): Can't load /run/criu.kdat
Warn  (criu/libnetlink.c:55): ERROR -95 reported by netlink
Error (criu/net.c:3042): Unable to create a veth pair: -95
Warn  (criu/net.c:3064): NSID isn't reported for network links

Cc: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>

677b6cb0

net: handle a case when --empty net is set only for criu dump · fc3ffd82

Andrei Vagin authored Oct 05, 2017

The origin idea was to set --empty net for criu dump and criu restore,
but before cde33dcb ("empty-ns: Don't C/R iptables too (v2)"),
criu restore worked without --empty net and we didn't notice that
docker doesn't set this option on restore.

After a small brainstorm, we decided that it is better to remove
this requirement. Docker has to set this option, but with this changes,
the docker issue will be less urgent.

https://github.com/checkpoint-restore/criu/issues/393Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

fc3ffd82

net: Fix links collection retcode · 71e2bdc9

Pavel Emelyanov authored Jun 16, 2017

There's a

   if (bad_thing) {
	   ret = -1;
	   break;
   }

code above this hunk, whose intention is to propagate -1 back to
caller. This propagation is obviously broken.
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

71e2bdc9