Commits · 8d91b2e3fceff2da4afccdeb1c43a0e5f61fdb32 · zhul / criu

12 May, 2018 40 commits

lazy-pages: make uffd_io_complete more robust · 8d91b2e3

Mike Rapoport authored May 09, 2018

Make sure we handle various corner cases:
* we received less pages than requested
* the request was capped because of unmap/remap etc
* the process has exited underneath us

Currently we are freeing the request once we've found the address to use
with uffd_copy(). Instead, let's keep the request object around, use it to
properly calculate number of pages we pass to uffd_copy() and then re-add
tailing range (if any) to the IOVs list.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

8d91b2e3

lazy-pages: factor out insertion to sorted IOV list · 7e10b43b

Mike Rapoport authored May 09, 2018

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

7e10b43b

lazy-pages: fork: fix duplication of IOV lists · 95ea4e92

Mike Rapoport authored May 07, 2018

Instead of merging unfinished requests with child's IOVs we queued them
into parent's IOV list. Fix it.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

95ea4e92

lazy-pages: actually return to epoll_wait after completing forks · 7ffaf883

Mike Rapoport authored Apr 30, 2018

Commit 9cb20327aa4 ("return to epoll_wait after completing forks") was only
half way there. Adding the other half.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

7ffaf883

lazy-pages: don't try to uffd_copy to removed memory regions · fb7873c9

Mike Rapoport authored Apr 16, 2018

It is possible that when pages request from the remove source arrive, part
of the memory range covered by the request would be already gone because of
madvise(MADV_DONTNEED), mremap() etc.
Ensure we are not trying to uffd_copy more than we are allowed.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

fb7873c9

lazy-pages: return to epoll_wait after completing forks · 312e97f1

Mike Rapoport authored Apr 16, 2018

If we get fork() event just before transferring last IOV of the parent
process, continuing to background fetch after completing fork event
handling will cause lazy-pages daemon to exit and nothing will monitor the
child process memory.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

312e97f1

lazy-pages: update events handling to take requests into account · b6917ef4

Mike Rapoport authored Apr 16, 2018

Since the memory mapping is now split between ->iovs and ->reqs lists, any
update to memory layout should take into account both lists.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

b6917ef4

lazy-pages: cache buffer size in the lazy_pages_info · 93f3fa34

Mike Rapoport authored Apr 16, 2018

Instead of recalculating required for lazy_pages_info->buf when copying
IOVs at fork() time, keep the size of the buffer in the lazy_pages_info
struct.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

93f3fa34

lazy-pages: handle_requests: fix return value propagation · 2ad16d4c

Mike Rapoport authored Apr 16, 2018

When we return from epoll_run_rfds with positive return value it means that
event handling loop was interrupted because the event should be handled
outside of that loop. Is always the case with UFFD_EVENT_FORK.

It may happen that the event occurred after we've completed the memory
transfer and we are on the way to successful return from the
handle_requests() function, but instead of returning 0 we will return the
positive value we've got from epoll_run_rfds.

Explicitly assigning return value of complete_forks() fixes this issue.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

2ad16d4c

lazy-pages: merge_iov_lists: fix corner case of empty destination · 315c4418
Mike Rapoport authored Apr 16, 2018
```
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
```
315c4418

lazy-pages: introduce merge_iov_lists helper · 82b7e843

Mike Rapoport authored Apr 16, 2018

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

82b7e843

test: lazy-pages: exclude maps007 · bb2af521

Mike Rapoport authored Apr 16, 2018

With userfaultfd we cannot reliably service process_vm_readv calls. The
maps007 test that uses these calls passed previously by sheer luck.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

bb2af521

lazy-pages: kill POLL_TIMEOUT · 13f955cd

Mike Rapoport authored Mar 25, 2018

In the current model we haven't started the background page transfer until
POLL_TIMEOUT time has elapsed since the last uffd or socket event. If the
restored process will do memory access one in (POLL_TIMEOUT - eplsilon) the
filling of its memory can take ages.

This patch changes them model in the following way:
* poll for the events indefinitely until the restore is complete
* the restore completion event causes reset of the poll timeout to zero and
* starts the background transfers
* after each transfer we return to check if there are any uffd events to
handle
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

13f955cd

lazy-pages: add ability to limit background transfer size · 33300542

Mike Rapoport authored Mar 25, 2018

Currently, once we get to transfer pages in the "background", we try to
fetch the entire IOV at once. For large IOVs this may impact #PF latency
for the #PF events occurred during the transfer.

Let's add a simple heuristic for controlling size of the background
transfers. Initially, the transfer will be limited to some default value.
Every time we transfer a chunk we increase the transfer size until it
reaches a pre-defined maximal size. A page fault event resets the
background transfer size to its initial value.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

33300542

lazy-pages: make complete_forks more robust · f68dcdde

Mike Rapoport authored Mar 25, 2018

The complete_forks function presumes that it always has a work to do
because we assume that fork event is the only case when we drop out of
epoll_run_rfds with positive return value.

Teach complete_forks to bail out when there is no pending forks to process
to allow exiting epoll_run_rfds for different reasons.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

f68dcdde

lazy-pages: simplify background transfer logic · b11f5796

Mike Rapoport authored Mar 25, 2018

First check if there are pages we need to transfer and only afterwards
check if there are outstanding requests. Also, instead checking 'bool
remaining' to see if there is more work to do we can simply check if all
the lpi's have been already serviced.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

b11f5796

lazy-pages: rename handle_remaining_pages to xfer_pages · 71a3f9aa

Mike Rapoport authored Mar 25, 2018

The intention is to use this function for transferring all the pages that
didn't cause a #PF.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

71a3f9aa

lazy-pages: rename first_pending_iov to pick_next_range · cc5aa764

Mike Rapoport authored Mar 25, 2018

The function anyway pick the next page range to transfer it's just doing it
in very simple FIFO manner.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

cc5aa764

lazy-pages: rework requests queueing · cbee7c04

Mike Rapoport authored Mar 25, 2018

We already have a queue for the requested memory ranges which contains
'lp_req' objects. These objects hold the same information as the lazy_iov:
start address of the range, end address and the address that the range had
at the dump time.

Rather than keep this information twice and use double bookkeeping, we can
extract the requested range from lpi->iovs and move it to lpi->reqs.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

cbee7c04

lazy-pages: rename iov->*base to iov->*start · 971e395e

Mike Rapoport authored Mar 25, 2018

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

971e395e

lazy-pages: lazy_iov: use end instead of len · 8e2f9574

Mike Rapoport authored Mar 25, 2018

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

8e2f9574

lazy-pages: split_iov: always create the new iov above the one being split · eff067a3
Mike Rapoport authored Mar 25, 2018
```
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
```
eff067a3

lazy-pages: explicitly set process exited condition · 6115934b

Mike Rapoport authored Mar 25, 2018

Instead of relying on length of various lists add a boolean variable to
lazy_pages_info to make it clean when the process has exited
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

6115934b

zdtm: check an exit code of a straced restore · 6a66b87e

Andrey Vagin authored Mar 07, 2018

Currently zdtm doesn't detect when restore failed, if it is executed
with strace. With this patch, fake-restore.sh creates a test file, and
zdtm is able to distinguish when restore failed.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

6a66b87e

zdtm.py: fix a logic about determing a test flavor in a error case · 4960d441

Andrei Vagin authored Mar 22, 2018

The get() method requires a key and now we are using an index. That
will never work correctly as it is now.
Acked-by: Adrian Reber <adrian@lisas.de>
Reported-by: Adrian Reber <adrian@lisas.de>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

4960d441

unix: split dump_external_sockets() for readability · faf4f72a

Andrey Vagin authored Mar 16, 2018

Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

faf4f72a

unix: fix an error code in bind_unix_sk() · 1b52bb43

Andrey Vagin authored Mar 16, 2018

Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

1b52bb43

unit: don't check ui->ue->name.len twice in bind_unix_sk() · c1ad0f8f
Andrey Vagin authored Mar 16, 2018
```
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
```
c1ad0f8f

unix: split bind_unix_sk() for readability · 3347c6ef

Andrey Vagin authored Mar 16, 2018

Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

3347c6ef

unix: restore sockets on correct mount points · 019ebec0

Andrey Vagin authored Mar 16, 2018

Currently we restore all sockets in the root mount namespace, because we
were not able to get any information about a mount point where a socket
is bound. It is obviously incorrect in some cases.

In 4.10 kernel, we added the SIOCUNIXFILE ioctl for unix sockets.  This
ioctl opens a file to which a socket is bound and returns a file
descriptor.

This new ioctl allows us to get mnt_id by reading fdinfo, and mnt_id
is enough to find a proper mount point and a mount namespace.

The logic of this patch is straight forward. On dump, we save mnt_id for
sockets, on restore we find a mount namespace by mnt_id and restore this
socket in its mount namespace.
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

019ebec0

unix: resolve a socket file when a socket descriptor is available · 6d785e6c

Andrey Vagin authored Mar 16, 2018

unix_process_name() are called when sockets are being collected,
but at this moment we don't have socket descriptors.

A socket descriptor is reuired to get mnt_id, what will allow to resolve
a socket path in its mount namespace.
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

6d785e6c

kerndat: check the SIOCUNIXFILE ioctl for unix sockets · 0286752b

Andrey Vagin authored Mar 16, 2018

This ioctl opens a file to which a socket is bound and
returns a file descriptor. This file descriptor can be used to get
mnt_id and a file path.
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

0286752b

unix: handle sockets with USK_CALLBACK as external sockets · 8ebf1c48

Andrey Vagin authored Mar 16, 2018

The USK_CALLBACK flag means that a socket is externel and will be
restored by a plugin. open_unixsk_standalone should not be called to
these sockets.

$ make -C test/others/unix-callback/ run
...
(00.109338)   7471: sk unix: Opening standalone socket (id 0xd ino 0 peer 0x63b)
(00.109376)   7471: Error (criu/sk-unix.c:1128): sk unix: BUG at criu/sk-unix.c:1128
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

8ebf1c48

zdtm: check unix sockets in two mount namespaces · 7da537fd

Andrey Vagin authored Mar 16, 2018

Unix file sockets have to be restored in proper mount namespaces.
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>

7da537fd

unix: Fix nil dereference in find_queuer_for · a3f152be

Cyrill Gorcunov authored Mar 22, 2018

When walking over unix sockets make sure the
queuer is present before accessing it.

https://jira.sw.ru/browse/PSBM-82796Reported-by: Vitaly Ostrosablin <vostrosablin@virtuozzo.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@virtuozzo.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>

a3f152be

scripts/build/binfmt_misc: fix for bash · 0284e70f

Kir Kolyshkin authored Mar 15, 2018

There was a "; done" leftover here, somehow ignored by dash
but not bash. Remove it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

0284e70f

scripts/build/Dockerfile.rawhide: rm · a2ab074e

Kir Kolyshkin authored Mar 15, 2018

It is not used, probably was committed by mistake.

Fixes: 2d093a17 ("travis: add a job to test on the fedora rawhide")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

a2ab074e

CI: fix Fedora rawhide · c75cb2b5

Kir Kolyshkin authored Mar 15, 2018

Fix Fedora rawhide CI failure caused by coreutils-single and our
way of running under QEMU.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

c75cb2b5

scripts/build/Dockerfiles: nitpicks · 7c4ddebc

Kir Kolyshkin authored Mar 15, 2018

1. Sort lists of packages to be installed, unify indentation.

2. Merge "ccache -s" and "ccache -z".
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

7c4ddebc

Fix zdtm with Ubuntu Bionic/arm/clang · 0a4d8379

Kir Kolyshkin authored Mar 15, 2018

In Ubuntu Bionic for armhf, clang is compiled for armv8l rather than
armv7l (as it was and still is for gcc) and so it uses armv8 by default.

This breaks compilation of tests using smp_mb():

> error: instruction requires: data-barriers

The fix is to add "-march=armv7-a" to CFLAGS which we already do,
except not for the tests.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

0a4d8379