Commits · 3c7d01f6a7eb2ef4870d4fe5eaa9acb453bc8b57 · zhul / criu

01 Oct, 2014 5 commits

Pavel Emelyanov authored Sep 29, 2014

The setns() syscall (called by switch_ns()) can be extremely
slow. If we call it two or more times from the same task the
kernel will synchonously go on a very slow routine called
synchronize_rcu() trying to put a reference on old namespaces.

To avoid doing this more than once I propose to create all
per-ns sockets in one place with one setns call. In this
patch there's on nl diag socket used to collect other sockets
is created this way.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

3c7d01f6

net: Do walk net namespaces to collect · 4f9acb6a

Pavel Emelyanov authored Sep 29, 2014

Right now we don't support multiple net namespaces,
but some day we will. Other than this we have a logic
to distinguish cases with no namespaces vs one namespace,
so this walking already makes sence.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

4f9acb6a

ns: Introduce collect_net_namespaces · 7327ffe6
Pavel Emelyanov authored Sep 29, 2014
```
And move sockets collection there.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
```
7327ffe6
ns: Introduce collect_namespaces routine · 01f6f890
Pavel Emelyanov authored Sep 29, 2014
```
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
```
01f6f890

irmap: Get root mntfd before releasing tasks on predump · b4768792

Pavel Emelyanov authored Sep 30, 2014

We have a use-after-free in predump code:

1st the free_pstree() is called in pre_dump_tasks(), then we
go to irmap_predump_run() which may call the lookup_irmap()
which, in turn, dereferences the root_item to get the root
mount ns fd.

But the problem is bigger than that. After we've released the
tasks (done before freeing pstree on predump) we can no longer
access them by PIDs, so keeping the root-item after irmap
scan is not a fix.

Fix is to get the root fd before releasing the tasks and using
one in irmap scanner.

Caught recently on iterative inotify_irmap test.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>

b4768792

30 Sep, 2014 21 commits

zdtm/tempfs: set mode for O_CREAT · 66944032

Andrey Vagin authored Sep 22, 2014

man 2 open:
"""
mode specifies the permissions to use in case a new file is cre‐
ated.  This argument must be supplied when O_CREAT or O_TMPFILE
is specified in flags;
"""

Cc: Konstantin Neumoin <kneumoin@parallels.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

66944032

zdtm/cwd01: don't forget to set '\0' after readlink() · c7390d2d

Andrey Vagin authored Sep 22, 2014

Reported-by: Konstantin Neumoin <kneumoin@parallels.com>
Cc: Konstantin Neumoin <kneumoin@parallels.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

c7390d2d

ns: Factor out namespace switching call · 8ac80915
Pavel authored Sep 22, 2014
```
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
```
8ac80915

criu: add .travis.yml (v3) · 3bc0936a

Andrey Vagin authored Sep 30, 2014

Travis CI is configured by adding a file named .travis.yml, which is a
YAML format text file, to the root directory of the GitHub
repository.[5]

Travis CI automatically detects when a commit has been made and pushed
to a GitHub repository that is using Travis CI, and each time this
happens, it will try to build the project and run tests.
""" https://en.wikipedia.org/wiki/Travis_CI

Currently Travis CI builds criu for x86_64 and ARM

v2: move travis-ci.sh in scripts
v3: fix path to the script in the script
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

3bc0936a

bfd: Implement buffered reads · 8651f43b

Pavel Emelyanov authored Sep 29, 2014

The restore times look like

Before patch:
	  futex:      370  3.554482 (84.2%)
	 umount:       41  0.234796 (5.6%)
	   read:     4737  0.113987 (2.7%)
	recvmsg:       43  0.100083 (2.4%)
	  wait4:       10  0.033344 (0.8%)

After patch:
	  futex:      187  1.547642 (72.9%)
	 umount:       41  0.234595 (11.0%)
	recvmsg:       43  0.075738 (3.6%)
	  flock:       42  0.038696 (1.8%)
	  clone:       35  0.037699 (1.8%)

Most of the time we wait for other processes to restore,
but that's OK (would only affect parallel restore). And
we see that read-s really go away (onto 7th position).
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>

8651f43b

bfd: Implement buffered writes · b4640934

Pavel Emelyanov authored Sep 29, 2014

Dump times (top-5) look like

Before patch:
	writev:     1595  0.048337 (15.1%)
	openat:     1326  0.041976 (13.1%)
	 close:     1434  0.034661 (10.8%)
	  read:      988  0.028760 (9.0%)
	 wait4:      170  0.028271 (8.8%)

After patch:
	openat:     1326  0.040010 (16.4%)
	 close:     1434  0.030039 (12.3%)
	  read:      988  0.025827 (10.6%)
	 wait4:      170  0.025549 (10.5%)
	ptrace:      834  0.021624 (8.9%)

So write-s go away from top list (turn into 8th position).

Funny thing is that all object writes get merged with the
magic writes, so the total amount of write()-s (not writev-s)
in the strace remain intact :)
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>

b4640934

img: Prepare to use bfd engine · b90ae65c

Pavel Emelyanov authored Sep 29, 2014

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>

b90ae65c

bfd: Rename fields · 67bbc7ea

Pavel Emelyanov authored Sep 29, 2014

For reads and writes the names pos and bleft will
have strange meaning, so rename them into smth more
appropriate.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>

67bbc7ea

img: Mark unbufferred images · 166c58d5

Pavel Emelyanov authored Sep 29, 2014

We have some images that store raw data together with
the pb objects (and one that just stores raw data) and
use custom access to this. E.g. pipe-data images splice
data into them and sk-queue one lseeks the image for
queue packets.

For those using buffered mode mixed with raw may lead
to troubles. Explicitly mark such images, so that the
buffering (next patches) handle such images carefully.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>

166c58d5

pb: Pass cr_img into image_name() · cf64851b

Pavel Emelyanov authored Sep 29, 2014

The pb_(read|write)-s will stop using plan fd soon.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>

cf64851b

img: Introduce the struct cr_img · 295090c1

Pavel Emelyanov authored Sep 29, 2014

We want to have buffered images to speed up dump and,
slightly, restore. Right now we use plan file descriptors
to write and read images to/from. Making them buffered
cannot be gracefully done on plain fds, so introduce
a new class.

This will also help if (when?) we will want to do more
complex changes with images, e.g. store them all in one
file or send them directly to the network.

For now the cr_img just contains one int _fd variable.

This patch chages the prototype of open_image() to
return struct cr_img *, pb_(read|write)* to accept one
and fixes the compilation of the rest of the code :)
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>

295090c1

Subject: [PATCH 07/14] pstree: Subblock for ids read on task restore · 0c5dc93b

Pavel Emelyanov authored Sep 29, 2014

Ugly, but it's for easier further patching.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>

0c5dc93b

img: Don't return fd, return -1 instead · 35be2ee2

Pavel Emelyanov authored Sep 29, 2014

The same -- int-fd will soon go away, so return the
explicit int -1 instead of it.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>

35be2ee2

img: Use errno when checking optional images open fail · 42821edc

Pavel Emelyanov authored Sep 29, 2014

There will be no int-fd soon, so one more preparation
to this fact.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>

42821edc

img: Rename fdset -> imgset · 5f2a7ac2

Pavel Emelyanov authored Sep 29, 2014

Since we're going to switch from int-fd-s to class-image
soon the fdset name will not fit into the new terminology.

This patch is

 sed -e 's/fdset/imgset/g' -i *
 sed -e 's/imgset_fd/img_from_set/g' -i *
 git mv include/fdset.h include/imgset.h
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>

5f2a7ac2

img: Move images IO helpers into .c file · 1cb690dd

Pavel Emelyanov authored Sep 29, 2014

This is to simplify the change from int fd to more
generic image class data-type.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>

1cb690dd

rst: Don't use write_img_buf for setting last_pid sysctl · 9d9ac53c

Pavel Emelyanov authored Sep 29, 2014

The write_img_buf will be used only for images writing, while
in this place we just have a raw file descriptor.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>

9d9ac53c

img: Keep the copy of flags value in open_image_at · 03482f69

Pavel Emelyanov authored Sep 29, 2014

We drop the O_OPT from flags and will drop one more. So
instead of a set of bools let's have the flags copy at
hands.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>

03482f69

files-reg: Simplify have_seen_dead_pid · 78bbb0a1

Cyrill Gorcunov authored Sep 23, 2014

We've a special helper xrealloc_safe for reallocs.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

78bbb0a1

cgroup: Use xmalloc in rewrite_cgsets · 1ef50607

Cyrill Gorcunov authored Sep 23, 2014

We prefer x* helpers because they print error
in case of allocation failures.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

1ef50607

bfd: timerfd -- Fix parsing typo · c01efda8

Cyrill Gorcunov authored Sep 30, 2014

While been converting reading of data stream
to bfd the @buf member was left untouched leading
to incorrect data to be read, fix it setting up
proper one, ie @str itself, otherwise dumping
of timerfd files are failing.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

c01efda8

29 Sep, 2014 8 commits

bfd: Multiple buffers management (v2) · 5eb39aad

Pavel Emelyanov authored Sep 29, 2014

I plan to re-use the bfd engine for images buffering. Right
now this engine uses one buffer that gets reused by all
bfdopen()-s. This works for current usage (one-by-pne proc
files access), but for images we'll need more buffers.

So this patch just puts buffers in a list and organizes a
stupid R-R with refill on it.

v2:
  Check for buffer allocation errors
  Print buffer mem pointer in debug
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>

5eb39aad

dump: Don't close pid-proc in vain · 1a2e6cbd

Pavel Emelyanov authored Sep 22, 2014

The open_pid_proc engine knows itself how to cache
per-pid descriptors. No need in closing it by hands.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

1a2e6cbd

proc: Keep /proc/self cached separately from /proc/pid · abeae267

Pavel Emelyanov authored Sep 23, 2014

When dumping tasks we do a lot of open_proc()-s and to
speed this up the /proc/pid directory is opened first
and the fd is kept cached. So next open_proc()-s do just
openat(cached_fd, name).

The thing is that we sometimes call open_proc(PROC_SELF)
in between and proc helpers cache the /proc/self too. As
the result we have a bunch of

  open(/proc/pid)
  close()
  open(/proc/self)
  close()

see-saw-s in the middle of dumping tasks.

To fix this we may cache the /proc/self separately from
the /proc/pid descriptor. This eliminates quite a lot
of pointless open-s and close-s.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

abeae267

fd: Close caches proc-pid stuff before restoring files · 829d4332

Pavel Emelyanov authored Sep 23, 2014

We have a bug. If someone opens proc with open_pid_proc or alike
with PROC_SELF of real PID before going to restore fds, then the
fd cached by proc helpers would be cached in fd 0 (we close all
fds beforehead) and it may clash with restored fds.

We don't hit this right now simply due to being too lucky -- we
call open_proc(PROC_GEN) on "locks" which first closes the cached
the per-pid descriptor and then reports back just the /proc one
which sits in service area.

But once we change this (next patch) things would get broken.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

829d4332

proc: Sanitate empty lines · 1c8ab40e
Pavel Emelyanov authored Sep 23, 2014
```
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
```
1c8ab40e

filemap: Get vma mnt_id early · e651a6eb

Pavel Emelyanov authored Sep 23, 2014

We have a, well, issue with how we calculate the vma's mnt_id.

Right now get one via criu side file descriptor that it got by
opening the /proc/pid/map_files/ link. The problem is that these
descriptors are 'merged' or 'borrowed' by adjacent vmas from
previous ones. Thus, getting the mnt_id value for each of them
makes no sense -- these files are the same.

So move this mnt_id getting earlier into vma parsing code. This
brings a potential problem -- if we have two adjacent vmas
mapping the same inode (dev:ino pair) but living in different
mount namespaces -- this check would produce wrong result.
"Wrong" from the perspective that on restore correct file would
be opened from wrong namespace.

I propose to live with it, since this is not worse than the
--evasive-devices option, it's _very_ unlikely, but saves a lot
of openeings.

Note, that in case app switched mount namespace and then mapped
some new library (with dlopen) things would work correctly -- new
vmas will likely be not adjacent and for different dev:ino.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

e651a6eb

vma: Add comments about some dump fields of vma_area · f84d19e0

Pavel Emelyanov authored Sep 23, 2014

We have non-obvious handling of vm_file_fd/vm_socket_id
pair and the vma->file_borrowed.

Comment these to in the structure.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

f84d19e0

vma: Reshuffle the struct vma_area · cf8c9ae8

Pavel Emelyanov authored Sep 23, 2014

We have some fields, that are dump-only and some that
are restore only (quite a lot of them actually).

Reshuffle them on the vma_area to explicitly show which
one is which. And rename some of them for easier grep.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

cf8c9ae8

24 Sep, 2014 2 commits

mntns: don't dump criu's namespace · 92ee1233

Andrey Vagin authored Sep 24, 2014

Reported-by: Mr Jenkins
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

92ee1233

bfd: move the optimization in a proper place · 606bc93a

Andrey Vagin authored Sep 24, 2014

Currently this optimization skips unscanned data
and doesn't work. Lets skip scanned data only.

Reported-by: Jenkins
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

606bc93a

23 Sep, 2014 4 commits

proc_parse: Rework timers parser to use bfd · cfce460b
Pavel Emelyanov authored Sep 19, 2014
```
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
```
cfce460b
proc_parse: Rework smaps parser to use bfd · cc4a67b3
Pavel Emelyanov authored Sep 19, 2014
```
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
```
cc4a67b3
proc_parse: Rework fdinfo parser to use bfd · 2c8af6b8
Pavel Emelyanov authored Sep 19, 2014
```
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
```
2c8af6b8

bfd: File-descriptors based buffered read · 53771adc

Pavel Emelyanov authored Sep 19, 2014

This sounds strange, but we kinda need one. Here's the
justification for that.

We heavily open /proc/pid/foo files. To speed things up we
do pid_dir = open("/proc/pid") then openat(pid_dir, foo).
This really saves time on big trees, up to 10%.

Sometimes we need line-by-line scan of these files, and for
that we currently use the fdopen() call. It takes a file
descriptor (obtained with openat from above) and wraps one
into a FILE*.

The problem with the latter is that fdopen _always_ mmap()s
a buffer for reads and this buffer always (!) gets unmapped
back on fclose(). This pair of mmap() + munmap() eats time
on big trees, up to 10% in my experiments with p.haul tests.

The situation is made even worse by the fact that each fgets
on the file results in a new page allocated in the kernel
(since the mapping is new). And also this fgets copies data,
which is not big deal, but for e.g. smaps file this results
in ~8K bytes being just copied around.

Having said that, here's a small but fast way of reading a
descriptor line-by-line using big buffer for reducing the
amount of read()s.

After all per-task fopen_proc()-s get reworked on this engine
(next 4 patches) the results on p.haul test would be

        Syscall     Calls      Time (% of time)
Now:
           mmap:      463  0.012033 (3.2%)
         munmap:      447  0.014473 (3.9%)
Patched:
         munmap:       57  0.002106 (0.6%)
           mmap:       74  0.002286 (0.7%)

The amount of read()s and open()s doesn't change since FILE*
also uses page-sized buffer for reading.

Also this eliminates some amount of lseek()s and fstat()s
the fdopen() does every time to catch up with file position
and to determine what sort of buffering it should use (for
terminals it's \n-driven, for files it's not).
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

53771adc