Commits · 295090c1ea4de9b71ac248cdb605cd180dabb486 · zhul / criu

30 Sep, 2014 11 commits

img: Introduce the struct cr_img · 295090c1

Pavel Emelyanov authored Sep 29, 2014

We want to have buffered images to speed up dump and,
slightly, restore. Right now we use plan file descriptors
to write and read images to/from. Making them buffered
cannot be gracefully done on plain fds, so introduce
a new class.

This will also help if (when?) we will want to do more
complex changes with images, e.g. store them all in one
file or send them directly to the network.

For now the cr_img just contains one int _fd variable.

This patch chages the prototype of open_image() to
return struct cr_img *, pb_(read|write)* to accept one
and fixes the compilation of the rest of the code :)
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>

295090c1

Subject: [PATCH 07/14] pstree: Subblock for ids read on task restore · 0c5dc93b

Pavel Emelyanov authored Sep 29, 2014

Ugly, but it's for easier further patching.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>

0c5dc93b

img: Don't return fd, return -1 instead · 35be2ee2

Pavel Emelyanov authored Sep 29, 2014

The same -- int-fd will soon go away, so return the
explicit int -1 instead of it.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>

35be2ee2

img: Use errno when checking optional images open fail · 42821edc

Pavel Emelyanov authored Sep 29, 2014

There will be no int-fd soon, so one more preparation
to this fact.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>

42821edc

img: Rename fdset -> imgset · 5f2a7ac2

Pavel Emelyanov authored Sep 29, 2014

Since we're going to switch from int-fd-s to class-image
soon the fdset name will not fit into the new terminology.

This patch is

 sed -e 's/fdset/imgset/g' -i *
 sed -e 's/imgset_fd/img_from_set/g' -i *
 git mv include/fdset.h include/imgset.h
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>

5f2a7ac2

img: Move images IO helpers into .c file · 1cb690dd

Pavel Emelyanov authored Sep 29, 2014

This is to simplify the change from int fd to more
generic image class data-type.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>

1cb690dd

rst: Don't use write_img_buf for setting last_pid sysctl · 9d9ac53c

Pavel Emelyanov authored Sep 29, 2014

The write_img_buf will be used only for images writing, while
in this place we just have a raw file descriptor.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>

9d9ac53c

img: Keep the copy of flags value in open_image_at · 03482f69

Pavel Emelyanov authored Sep 29, 2014

We drop the O_OPT from flags and will drop one more. So
instead of a set of bools let's have the flags copy at
hands.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>

03482f69

files-reg: Simplify have_seen_dead_pid · 78bbb0a1

Cyrill Gorcunov authored Sep 23, 2014

We've a special helper xrealloc_safe for reallocs.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

78bbb0a1

cgroup: Use xmalloc in rewrite_cgsets · 1ef50607

Cyrill Gorcunov authored Sep 23, 2014

We prefer x* helpers because they print error
in case of allocation failures.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

1ef50607

bfd: timerfd -- Fix parsing typo · c01efda8

Cyrill Gorcunov authored Sep 30, 2014

While been converting reading of data stream
to bfd the @buf member was left untouched leading
to incorrect data to be read, fix it setting up
proper one, ie @str itself, otherwise dumping
of timerfd files are failing.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

c01efda8

29 Sep, 2014 8 commits

bfd: Multiple buffers management (v2) · 5eb39aad

Pavel Emelyanov authored Sep 29, 2014

I plan to re-use the bfd engine for images buffering. Right
now this engine uses one buffer that gets reused by all
bfdopen()-s. This works for current usage (one-by-pne proc
files access), but for images we'll need more buffers.

So this patch just puts buffers in a list and organizes a
stupid R-R with refill on it.

v2:
  Check for buffer allocation errors
  Print buffer mem pointer in debug
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>

5eb39aad

dump: Don't close pid-proc in vain · 1a2e6cbd

Pavel Emelyanov authored Sep 22, 2014

The open_pid_proc engine knows itself how to cache
per-pid descriptors. No need in closing it by hands.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

1a2e6cbd

proc: Keep /proc/self cached separately from /proc/pid · abeae267

Pavel Emelyanov authored Sep 23, 2014

When dumping tasks we do a lot of open_proc()-s and to
speed this up the /proc/pid directory is opened first
and the fd is kept cached. So next open_proc()-s do just
openat(cached_fd, name).

The thing is that we sometimes call open_proc(PROC_SELF)
in between and proc helpers cache the /proc/self too. As
the result we have a bunch of

  open(/proc/pid)
  close()
  open(/proc/self)
  close()

see-saw-s in the middle of dumping tasks.

To fix this we may cache the /proc/self separately from
the /proc/pid descriptor. This eliminates quite a lot
of pointless open-s and close-s.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

abeae267

fd: Close caches proc-pid stuff before restoring files · 829d4332

Pavel Emelyanov authored Sep 23, 2014

We have a bug. If someone opens proc with open_pid_proc or alike
with PROC_SELF of real PID before going to restore fds, then the
fd cached by proc helpers would be cached in fd 0 (we close all
fds beforehead) and it may clash with restored fds.

We don't hit this right now simply due to being too lucky -- we
call open_proc(PROC_GEN) on "locks" which first closes the cached
the per-pid descriptor and then reports back just the /proc one
which sits in service area.

But once we change this (next patch) things would get broken.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

829d4332

proc: Sanitate empty lines · 1c8ab40e
Pavel Emelyanov authored Sep 23, 2014
```
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
```
1c8ab40e

filemap: Get vma mnt_id early · e651a6eb

Pavel Emelyanov authored Sep 23, 2014

We have a, well, issue with how we calculate the vma's mnt_id.

Right now get one via criu side file descriptor that it got by
opening the /proc/pid/map_files/ link. The problem is that these
descriptors are 'merged' or 'borrowed' by adjacent vmas from
previous ones. Thus, getting the mnt_id value for each of them
makes no sense -- these files are the same.

So move this mnt_id getting earlier into vma parsing code. This
brings a potential problem -- if we have two adjacent vmas
mapping the same inode (dev:ino pair) but living in different
mount namespaces -- this check would produce wrong result.
"Wrong" from the perspective that on restore correct file would
be opened from wrong namespace.

I propose to live with it, since this is not worse than the
--evasive-devices option, it's _very_ unlikely, but saves a lot
of openeings.

Note, that in case app switched mount namespace and then mapped
some new library (with dlopen) things would work correctly -- new
vmas will likely be not adjacent and for different dev:ino.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

e651a6eb

vma: Add comments about some dump fields of vma_area · f84d19e0

Pavel Emelyanov authored Sep 23, 2014

We have non-obvious handling of vm_file_fd/vm_socket_id
pair and the vma->file_borrowed.

Comment these to in the structure.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

f84d19e0

vma: Reshuffle the struct vma_area · cf8c9ae8

Pavel Emelyanov authored Sep 23, 2014

We have some fields, that are dump-only and some that
are restore only (quite a lot of them actually).

Reshuffle them on the vma_area to explicitly show which
one is which. And rename some of them for easier grep.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

cf8c9ae8

24 Sep, 2014 2 commits

mntns: don't dump criu's namespace · 92ee1233

Andrey Vagin authored Sep 24, 2014

Reported-by: Mr Jenkins
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

92ee1233

bfd: move the optimization in a proper place · 606bc93a

Andrey Vagin authored Sep 24, 2014

Currently this optimization skips unscanned data
and doesn't work. Lets skip scanned data only.

Reported-by: Jenkins
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

606bc93a

23 Sep, 2014 10 commits

proc_parse: Rework timers parser to use bfd · cfce460b
Pavel Emelyanov authored Sep 19, 2014
```
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
```
cfce460b
proc_parse: Rework smaps parser to use bfd · cc4a67b3
Pavel Emelyanov authored Sep 19, 2014
```
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
```
cc4a67b3
proc_parse: Rework fdinfo parser to use bfd · 2c8af6b8
Pavel Emelyanov authored Sep 19, 2014
```
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
```
2c8af6b8

bfd: File-descriptors based buffered read · 53771adc

Pavel Emelyanov authored Sep 19, 2014

This sounds strange, but we kinda need one. Here's the
justification for that.

We heavily open /proc/pid/foo files. To speed things up we
do pid_dir = open("/proc/pid") then openat(pid_dir, foo).
This really saves time on big trees, up to 10%.

Sometimes we need line-by-line scan of these files, and for
that we currently use the fdopen() call. It takes a file
descriptor (obtained with openat from above) and wraps one
into a FILE*.

The problem with the latter is that fdopen _always_ mmap()s
a buffer for reads and this buffer always (!) gets unmapped
back on fclose(). This pair of mmap() + munmap() eats time
on big trees, up to 10% in my experiments with p.haul tests.

The situation is made even worse by the fact that each fgets
on the file results in a new page allocated in the kernel
(since the mapping is new). And also this fgets copies data,
which is not big deal, but for e.g. smaps file this results
in ~8K bytes being just copied around.

Having said that, here's a small but fast way of reading a
descriptor line-by-line using big buffer for reducing the
amount of read()s.

After all per-task fopen_proc()-s get reworked on this engine
(next 4 patches) the results on p.haul test would be

        Syscall     Calls      Time (% of time)
Now:
           mmap:      463  0.012033 (3.2%)
         munmap:      447  0.014473 (3.9%)
Patched:
         munmap:       57  0.002106 (0.6%)
           mmap:       74  0.002286 (0.7%)

The amount of read()s and open()s doesn't change since FILE*
also uses page-sized buffer for reading.

Also this eliminates some amount of lseek()s and fstat()s
the fdopen() does every time to catch up with file position
and to determine what sort of buffering it should use (for
terminals it's \n-driven, for files it's not).
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

53771adc

ns: Dump namespaces in parallel · b30f0f01

Pavel authored Sep 22, 2014

The main reason for this is -- dumping namespace has a lot of
points when the process just waits for something. At the same
time criu process wait for the ns dumper and doesn't dump
others.

The great example of waiting for something is setns syscall.
Very often it calls synchronize_rcu() which can be quite long.
Let other processes do smth useful while this.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>

b30f0f01

remap: don't add remaps for a dead pid more than once · bbe3f941

Tycho Andersen authored Sep 19, 2014

Unless we seek and re-read the PB images, the only way I can see to do this is
to keep a list of the previously seen dead pids and check if a new remap is in
that list.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

bbe3f941

remap: don't try to remap other files in /proc · 80c4e86e

Tycho Andersen authored Sep 22, 2014

We can't remap these files correctly anyway, so we should just return success
if we find one of these files to remap.

v2: don't try to remap accessible files in /proc
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

80c4e86e

mnt: Shorten the mntns dumping loop · 867bcd21

Pavel authored Sep 22, 2014

We currently have all mouninfo-s from all mnt namespaces collected
in one big list. On dump we scan through it to find the namespaces
we need to dump.

This can be optimized by walking the list of namespaces instead.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>

867bcd21

x86: don't call wait4 as waitpid · 6382ed43

Andrey Vagin authored Sep 22, 2014

Fix compilation on ARM:
pie/restorer.c: In function ‘wait_helpers’:
pie/restorer.c:728:3: error: implicit declaration of function ‘sys_waitpid’ [-Werror=implicit-function-declaration]
cc1: all warnings being treated as errors

Reported-by: Mr Jenkins
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

6382ed43

ptrace: Factor out pie stopping code · ab50f6ac

Pavel Emelyanov authored Sep 22, 2014

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrey Vagin <avagin@parallels.com>

ab50f6ac

22 Sep, 2014 7 commits

ptrace: flush breakpoints · 48fcc799

Andrey Vagin authored Sep 19, 2014

Unfortunately the kernel doesn't flush hw breakpoints on
detaching ptrace. If a breakpoint is triggered without ptrace, it
will be killed by SIGTRAP.

Reported-by: Mr Jenkins
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

48fcc799

test: rpc: test page-server · 3b2ab35b

Ruslan Kuprieiev authored Sep 18, 2014

Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

3b2ab35b

service: page-server: allow requesting page-server without setting any ps_info · a483cbda

Ruslan Kuprieiev authored Sep 18, 2014

Since we now can return port to user in autobind case, it's ok to request
page-server without setting ps_info.
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

a483cbda

service: page-server: return port back to user, v2 · 6b631faa

Ruslan Kuprieiev authored Sep 21, 2014

Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

6b631faa

page-server: assign opts.ps_port to sin_port in autobind case · 45fe2c9d
Ruslan Kuprieiev authored Sep 18, 2014
```
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
```
45fe2c9d

ptrace: say to parasite_stop_on_syscall where is we now · 13fc78b9

Andrew Vagin authored Sep 19, 2014

On restore parasite_stop_on_syscall() can be called after PTRACE_SYSCALL
and after a breakpoint. parasite_stop_on_syscall() must be called only
after PTRACE_SYSCALL, so all tests where is one process stuck.

Reported-by: Mr Jenkins
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

13fc78b9

zdtm: don't call mount_cgroups a few times concurrently · eda6b3d0

Andrey Vagin authored Sep 19, 2014

Here is a race now:
./zdtm.sh --ct -d -C -x static/cgroup02 ns/static/pipe02 &> ns_static_pipe02.log || \
{ flock Makefile cat ns_static_pipe02.log; exit 1; }
./zdtm.sh --ct -d -C -x static/cgroup02 ns/static/busyloop00 &> ns_static_busyloop00.log || \
{ flock Makefile cat ns_static_busyloop00.log; exit 1; }
make[3]: `zdtm_ct' is up to date.
mkdir: cannot create directory ‘zdtm.GgIjUS/holder’: File exists

Reported-by: Mr Jenkins
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

eda6b3d0

19 Sep, 2014 2 commits

restore: use breakpoints instead of tracing syscalls · 248fc315

Andrey Vagin authored Sep 17, 2014

Currently CRIU traces syscalls to catch a moment, when sigreturn() is
called. Now we trace recv(cmd), close(logfd), close(cmdfd), sigreturn().

We can reduce a number of steps by using hw breakpoints. A breakpoint is
set before sigreturn, so we will need to trace only it.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

248fc315

dump: use breakpoints instead of tracing syscalls (v2) · 0b1b8151

Andrey Vagin authored Sep 17, 2014

Currently CRIU traces syscalls to catch a moment, when sigreturn() is
called. Now we trace recv(cmd), close(logfd), close(cmdfd), sigreturn().

We can reduce a number of steps by using hw breakpoints. A breakpoint is
set before sigreturn, so we will need to trace only it.

v2: In the first version a breakpoint is set after sigreturn. In this
case we have a problem with signals. If a process has pending signals,
it will start to precess them after exiting from sigreturn(), but before
returning to userspace. So the breakpoint will not be triggered.

And at the end Here are a few numbers how we catch sigreturn.
Before this patch criu executes 36 syscalls and gets 12 signals.
With this patch criu executes 18 syscalls and gets 5 signals.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

0b1b8151