• Dmitry Safonov's avatar
    restore: Fix deadlock when helper's child dies · 204c1ef9
    Dmitry Safonov authored
    Since commit ced9c529 ("restore: fix race with helpers' kids dying
    too early"), we block SIGCHLD in helper tasks before CR_STATE_RESTORE.
    This way we avoided default criu sighandler as it doesn't expect that
    childs may die.
    
    This is very racy as we wait on futex for another stage to be started,
    but the next stage may start only when all the tasks complete previous
    stage. If some children of helper dies, the helper may already have
    blocked SIGCHLD and have started sleeping on the futex. Then the next
    stage never comes and no one reads a pending SIGCHLD for helper.
    
    A customer met this situation on the node, where the following
    (non-related) problem has occured:
    Unable to send a fin packet: libnet_write_raw_ipv6(): -1 bytes written (Network is unreachable)
    Then child criu of the helper has exited with error-code and the
    lockup has happened.
    
    While we could fix it by aborting futex in the end of
    restore_task_with_children() for each (non-root also) tasks,
    that would be not completely correct:
    1. All futex-waiting tasks will wake up after that and they
       may not expect that some tasks are on the previous stage,
       so they will spam into logs with unrelated errors and may
       also die painfully.
    2. Child may die and miss aborting of the futex due to:
       o segfault
       o OOM killer
       o User-sended SIGKILL
       o Other error-path we forgot to cover with abort futex
    
    To fix this deadlock in TASK_HELPER, as suggested-by Kirill,
    let's check if there are children deaths expected - if there
    isn't any, don't block SIGCHLD, otherwise wait() and check if
    death was on expected stage of restore (not CR_STATE_RESTORE).
    Reviewed-by: 's avatarKirill Tkhai <ktkhai@virtuozzo.com>
    Signed-off-by: 's avatarDmitry Safonov <dsafonov@virtuozzo.com>
    Signed-off-by: 's avatarAndrei Vagin <avagin@virtuozzo.com>
    
    Conflicts:
    	criu/cr-restore.c
    204c1ef9
Name
Last commit
Last update
Documentation Loading commit data...
compel Loading commit data...
contrib Loading commit data...
coredump Loading commit data...
crit Loading commit data...
criu Loading commit data...
images Loading commit data...
include/common Loading commit data...
lib Loading commit data...
scripts Loading commit data...
soccr Loading commit data...
test Loading commit data...
.gitignore Loading commit data...
.mailmap Loading commit data...
.travis.yml Loading commit data...
COPYING Loading commit data...
CREDITS Loading commit data...
INSTALL.md Loading commit data...
Makefile Loading commit data...
Makefile.compel Loading commit data...
Makefile.config Loading commit data...
Makefile.install Loading commit data...
Makefile.versions Loading commit data...
README.md Loading commit data...