Files · 204c1ef9e03bbfa210c63dfa00b2272cb6918451 · zhul / criu

restore: Fix deadlock when helper's child dies · 204c1ef9

Dmitry Safonov authored Jul 20, 2017

Since commit ced9c529 ("restore: fix race with helpers' kids dying
too early"), we block SIGCHLD in helper tasks before CR_STATE_RESTORE.
This way we avoided default criu sighandler as it doesn't expect that
childs may die.

This is very racy as we wait on futex for another stage to be started,
but the next stage may start only when all the tasks complete previous
stage. If some children of helper dies, the helper may already have
blocked SIGCHLD and have started sleeping on the futex. Then the next
stage never comes and no one reads a pending SIGCHLD for helper.

A customer met this situation on the node, where the following
(non-related) problem has occured:
Unable to send a fin packet: libnet_write_raw_ipv6(): -1 bytes written (Network is unreachable)
Then child criu of the helper has exited with error-code and the
lockup has happened.

While we could fix it by aborting futex in the end of
restore_task_with_children() for each (non-root also) tasks,
that would be not completely correct:
1. All futex-waiting tasks will wake up after that and they
may not expect that some tasks are on the previous stage,
so they will spam into logs with unrelated errors and may
also die painfully.
2. Child may die and miss aborting of the futex due to:
o segfault
o OOM killer
o User-sended SIGKILL
o Other error-path we forgot to cover with abort futex

To fix this deadlock in TASK_HELPER, as suggested-by Kirill,
let's check if there are children deaths expected - if there
isn't any, don't block SIGCHLD, otherwise wait() and check if
death was on expected stage of restore (not CR_STATE_RESTORE).
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

Conflicts:
criu/cr-restore.c

204c1ef9

Name	Last commit	Last update
Documentation		Loading commit data...
compel		Loading commit data...
contrib		Loading commit data...
coredump		Loading commit data...
crit		Loading commit data...
criu		Loading commit data...
images		Loading commit data...
include/common		Loading commit data...
lib		Loading commit data...
scripts		Loading commit data...
soccr		Loading commit data...
test		Loading commit data...
.gitignore		Loading commit data...
.mailmap		Loading commit data...
.travis.yml		Loading commit data...
COPYING		Loading commit data...
CREDITS		Loading commit data...
INSTALL.md		Loading commit data...
Makefile		Loading commit data...
Makefile.compel		Loading commit data...
Makefile.config		Loading commit data...
Makefile.install		Loading commit data...
Makefile.versions		Loading commit data...
README.md		Loading commit data...

README.md