Try to include userfaultfd with criu (part 2)

This is a first try to include userfaultfd with criu. Right now it still requires a "normal" checkpoint. After checkpointing the application it can be restored with the help of userfaultfd. All restored pages with MAP_ANONYMOUS and MAP_PRIVATE set are marked as being handled by userfaultfd. As soon as the process is restored it blocks on the first memory access and waits for pages being transferred by userfaultfd. To handle the required pages a new criu command has been added. For a userfaultfd supported restore the first step is to start the 'lazy-pages' server: criu lazy-pages -v4 -D /tmp/3/ --address /tmp/userfault.socket This waits on a unix domain socket (defined using the --address option) to receive a userfaultfd file descriptor from a '--lazy-pages' enabled 'criu restore': criu restore -D /tmp/3 -j -v4 --lazy-pages \ --address /tmp/userfault.socket In the first step the VDSO pages are pushed from the lazy-pages server into the restored process. After that the lazy-pages server waits on the UFFD FD for a UFFD requested page. If there are no requests received during a period of 5 seconds the lazy-pages server switches into a mode where the remaining, non-transferred pages are copied into the destination process. After all remaining pages have been copied the lazy-pages server exits. The first page that usually is requested is a VDSO page. The process currently used for restoring has two VDSO pages, but only one is requested via userfaultfd. In the second part where the remaining pages are copied into the process, the second VDSO page is also copied into the process as it has not been requested previously. Unfortunately, even as this page has not been requested before, it is not accepted by userfaultfd. EINVAL is returned. The reason for EINVAL is not understood and therefore the VDSO pages are copied first into the process, then switching to request mode and copying the pages which are requested via userfaultfd. To decide at which point the VDSO pages can be copied into the process, the lazy-pages server is currently waiting for the first page requested via userfaultfd. This is one of the VDSO pages. To not copy a page a second time, which is unnecessary and not possible, there is now a check to see if the page has been transferred previously. The use case to use usefaultfd with a checkpointed process on a remote machine will probably benefit from the current work related to image-cache and image-proxy. For the final implementation it would be nice to have a restore running in uffd mode on one system which requests the memory pages over the network from another system which is running 'criu checkpoint' also in uffd mode. This way the pages need to be copied only 'once' from the checkpoint process to the uffd restore process. TODO: * Contains still many debug outputs which need to be cleaned up. * Maybe transfer the dump directory FD also via unix domain sockets so that the 'uffd'/'lazy-pages' server can keep running without the need to specify the dump directory with '-D' * Keep the lazy-pages server running after all pages have been transferred and start waiting for new connections to serve. * Resurrect the non-cooperative patch set, as once the restored task fork()'s or calls mremap() the whole thing becomes broken. * Figure out if current VDSO handling is correct. * Figure out when and how zero pages need to be inserted via uffd. v2: * provide option '--lazy-pages' to enable uffd style restore * use send_fd()/recv_fd() provided by criu (instead of own implementation) * do not install the uffd as service_fd * use named constants for MAP_ANONYMOUS * do not restore memory pages and then later mark them as uffd handled * remove function find_pages() to search in pages-<id>.img; now using criu functions to find the necessary pages; for each new page search the pages-<id>.img file is opened * only check the UFFDIO_API once * trying to protect uffd code by CONFIG_UFFD; use make UFFD=1 to compile criu with this patch v3: * renamed the server mode from 'uffd' -> 'lazy-pages' * switched client and server roles transferring the UFFD FD * the criu part running in lazy-pages server mode is now waiting for connections * the criu restore process connects to the lazy-pages server to pass the UFFD FD * before UFFD copying anything else the VDSO pages are copied as it fails to copy unused VDSO pages once the process is running. this was necessary to be able to copy all pages. * if there are no more UFFD messages for 5 seconds the lazy-pages server switches in copy mode to copy all remaining pages, which have not been requested yet, into the restored process * check the UFFDIO_API at the correct place * close UFFD FD in the restorer to remove open UFFD FD in the restored process v4: * removed unnecessary madvise() calls ; it seemed necessary when first running tests with uffd; it actually is not necessary * auto-detect if build-system provides linux/userfaultfd.h header. * simplify unix domain socket setup and communication. * use --address to specify the location of the used unix domain socket. v5: * split the userfaultfd patch in multiple smaller patches * introduced vma_can_be_lazy() function to check if a page can be handled by uffd * moved uffd related code from cr-restore.c to uffd.c * handle failure to register a memory page of the restored process with userfaultfd v6: * get PID of to be restored process from the 'criu restore' process; first the PID is transferred and then the UFFD Signed-off-by: Adrian Reber <areber@redhat.com> Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>

Try to include userfaultfd with criu (part 2)
This is a first try to include userfaultfd with criu. Right now it still requires a "normal" checkpoint. After checkpointing the application it can be restored with the help of userfaultfd. All restored pages with MAP_ANONYMOUS and MAP_PRIVATE set are marked as being handled by userfaultfd. As soon as the process is restored it blocks on the first memory access and waits for pages being transferred by userfaultfd. To handle the required pages a new criu command has been added. For a userfaultfd supported restore the first step is to start the 'lazy-pages' server: criu lazy-pages -v4 -D /tmp/3/ --address /tmp/userfault.socket This waits on a unix domain socket (defined using the --address option) to receive a userfaultfd file descriptor from a '--lazy-pages' enabled 'criu restore': criu restore -D /tmp/3 -j -v4 --lazy-pages \ --address /tmp/userfault.socket In the first step the VDSO pages are pushed from the lazy-pages server into the restored process. After that the lazy-pages server waits on the UFFD FD for a UFFD requested page. If there are no requests received during a period of 5 seconds the lazy-pages server switches into a mode where the remaining, non-transferred pages are copied into the destination process. After all remaining pages have been copied the lazy-pages server exits. The first page that usually is requested is a VDSO page. The process currently used for restoring has two VDSO pages, but only one is requested via userfaultfd. In the second part where the remaining pages are copied into the process, the second VDSO page is also copied into the process as it has not been requested previously. Unfortunately, even as this page has not been requested before, it is not accepted by userfaultfd. EINVAL is returned. The reason for EINVAL is not understood and therefore the VDSO pages are copied first into the process, then switching to request mode and copying the pages which are requested via userfaultfd. To decide at which point the VDSO pages can be copied into the process, the lazy-pages server is currently waiting for the first page requested via userfaultfd. This is one of the VDSO pages. To not copy a page a second time, which is unnecessary and not possible, there is now a check to see if the page has been transferred previously. The use case to use usefaultfd with a checkpointed process on a remote machine will probably benefit from the current work related to image-cache and image-proxy. For the final implementation it would be nice to have a restore running in uffd mode on one system which requests the memory pages over the network from another system which is running 'criu checkpoint' also in uffd mode. This way the pages need to be copied only 'once' from the checkpoint process to the uffd restore process. TODO: * Contains still many debug outputs which need to be cleaned up. * Maybe transfer the dump directory FD also via unix domain sockets so that the 'uffd'/'lazy-pages' server can keep running without the need to specify the dump directory with '-D' * Keep the lazy-pages server running after all pages have been transferred and start waiting for new connections to serve. * Resurrect the non-cooperative patch set, as once the restored task fork()'s or calls mremap() the whole thing becomes broken. * Figure out if current VDSO handling is correct. * Figure out when and how zero pages need to be inserted via uffd. v2: * provide option '--lazy-pages' to enable uffd style restore * use send_fd()/recv_fd() provided by criu (instead of own implementation) * do not install the uffd as service_fd * use named constants for MAP_ANONYMOUS * do not restore memory pages and then later mark them as uffd handled * remove function find_pages() to search in pages-<id>.img; now using criu functions to find the necessary pages; for each new page search the pages-<id>.img file is opened * only check the UFFDIO_API once * trying to protect uffd code by CONFIG_UFFD; use make UFFD=1 to compile criu with this patch v3: * renamed the server mode from 'uffd' -> 'lazy-pages' * switched client and server roles transferring the UFFD FD * the criu part running in lazy-pages server mode is now waiting for connections * the criu restore process connects to the lazy-pages server to pass the UFFD FD * before UFFD copying anything else the VDSO pages are copied as it fails to copy unused VDSO pages once the process is running. this was necessary to be able to copy all pages. * if there are no more UFFD messages for 5 seconds the lazy-pages server switches in copy mode to copy all remaining pages, which have not been requested yet, into the restored process * check the UFFDIO_API at the correct place * close UFFD FD in the restorer to remove open UFFD FD in the restored process v4: * removed unnecessary madvise() calls ; it seemed necessary when first running tests with uffd; it actually is not necessary * auto-detect if build-system provides linux/userfaultfd.h header. * simplify unix domain socket setup and communication. * use --address to specify the location of the used unix domain socket. v5: * split the userfaultfd patch in multiple smaller patches * introduced vma_can_be_lazy() function to check if a page can be handled by uffd * moved uffd related code from cr-restore.c to uffd.c * handle failure to register a memory page of the restored process with userfaultfd v6: * get PID of to be restored process from the 'criu restore' process; first the PID is transferred and then the UFFD Signed-off-by: Adrian Reber <areber@redhat.com> Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
57891afc · Adrian Reber · Andrei Vagin · e2268aa3 · 57891afc · 57891afc
Commit 57891afc authored Mar 15, 2016 by Adrian Reber Committed by Andrei Vagin Sep 16, 2017
9 changed files
--- a/criu/cr-restore.c
+++ b/criu/cr-restore.c
@@ -49,6 +49,7 @@
 #include "proc_parse.h"
 #include "pie/restorer-blob.h"
 #include "crtools.h"
+#include "uffd.h"
 #include "namespaces.h"
 #include "mem.h"
 #include "mount.h"
@@ -3204,6 +3205,11 @@ static int sigreturn_restore(pid_t pid, struct task_restore_args *task_args, uns

 	strncpy(task_args->comm, core->tc->comm, sizeof(task_args->comm));

+	if (!opts.lazy_pages)
+		task_args->uffd = -1;
+	else
+		if (setup_uffd(task_args, pid) != 0)
+			goto err;

 	/*
 	 * Fill up per-thread data.

--- a/criu/crtools.c
+++ b/criu/crtools.c
@@ -279,6 +279,9 @@ int main(int argc, char *argv[], char *envp[])
 		{ "timeout",			required_argument,	0, 1072 },
 		{ "external",			required_argument,	0, 1073	},
 		{ "empty-ns",			required_argument,	0, 1074	},
+#ifdef CONFIG_HAS_UFFD
+		{ "lazy-pages",			no_argument,		0, 1076 },
+#endif
 		BOOL_OPT("extra", &opts.check_extra_features),
 		BOOL_OPT("experimental", &opts.check_experimental_features),
 		{ "all",			no_argument,		0, 1079	},
@@ -519,6 +522,11 @@ int main(int argc, char *argv[], char *envp[])
 		case 1072:
 			opts.timeout = atoi(optarg);
 			break;
+#ifdef CONFIG_HAS_UFFD
+		case 1076:
+			opts.lazy_pages = true;
+			break;
+#endif
 		case 'M':
 			{
 				char *aux;
@@ -815,6 +823,12 @@ usage:
 "                        restore making it the parent of the restored process\n"
 "  --freeze-cgroup       use cgroup freezer to collect processes\n"
 "  --weak-sysctls        skip restoring sysctls that are not available\n"
+#ifdef CONFIG_HAS_UFFD
+"  --lazy-pages          restore pages on demand\n"
+"                        this requires running a second instance of criu\n"
+"                        in lazy-pages mode: 'criu lazy-pages -D DIR'\n"
+"                        --lazy-pages and lazy-pages mode require userfaultfd\n"
+#endif
 "\n"
 "* External resources support:\n"
 "  --external RES        dump objects from this list as external resources:\n"

--- a/criu/include/cr_options.h
+++ b/criu/include/cr_options.h
@@ -107,6 +107,7 @@ struct cr_options {
 	unsigned int		timeout;
 	unsigned int		empty_ns;
 	int			tcp_skip_in_flight;
+	bool			lazy_pages;
 	char			*work_dir;

 	/*

--- a/criu/include/restorer.h
+++ b/criu/include/restorer.h
@@ -117,6 +117,8 @@ struct task_restore_args {
 	unsigned int			loglevel;
 	struct timeval			logstart;

+	int				uffd;
+
 	/* threads restoration */
 	int				nr_threads;		/* number of threads */
 	thread_restore_fcall_t		clone_restore_fn;	/* helper address for clone() call */

--- a/criu/include/uffd.h
+++ b/criu/include/uffd.h
@@ -2,6 +2,7 @@
 #define __CR_UFFD_H_

 #include "config.h"
+#include "restorer.h"

 #ifdef CONFIG_HAS_UFFD

@@ -11,6 +12,11 @@
 #ifndef __NR_userfaultfd
 #error "missing __NR_userfaultfd definition"
 #endif
+
+extern int setup_uffd(struct task_restore_args *task_args, int pid);
+#else
+static inline int setup_uffd(struct task_restore_args *task_args, int pid) { return 0; }
+
 #endif /* CONFIG_HAS_UFFD */

 #endif /* __CR_UFFD_H_ */
--- a/criu/include/vma.h
+++ b/criu/include/vma.h
@@ -6,6 +6,8 @@

 #include "images/vma.pb-c.h"

+#include <sys/mman.h>
+
 struct vm_area_list {
 	struct list_head	h;
 	unsigned		nr;
@@ -123,4 +125,11 @@ static inline struct vma_area *vma_next(struct vma_area *vma)
 	return list_entry(vma->list.next, struct vma_area, list);
 }

+static inline bool vma_entry_can_be_lazy(VmaEntry *e)
+{
+	return ((e->flags & MAP_ANONYMOUS) &&
+		(e->flags & MAP_PRIVATE) &&
+		!(vma_entry_is(e, VMA_AREA_VSYSCALL)));
+}
+
 #endif /* __CR_VMA_H__ */
--- a/criu/mem.c
+++ b/criu/mem.c
@@ -3,6 +3,7 @@
 #include <sys/mman.h>
 #include <errno.h>
 #include <fcntl.h>
+#include <sys/syscall.h>

 #include "types.h"
 #include "cr_options.h"
@@ -17,6 +18,7 @@
 #include "stats.h"
 #include "vma.h"
 #include "shmem.h"
+#include "uffd.h"
 #include "pstree.h"
 #include "restorer.h"
 #include "rst-malloc.h"
@@ -821,6 +823,7 @@ static int restore_priv_vma_content(struct pstree_item *t, struct page_read *pr)
 	unsigned int nr_shared = 0;
 	unsigned int nr_droped = 0;
 	unsigned int nr_compared = 0;
+	unsigned int nr_lazy = 0;
 	unsigned long va;

 	vma = list_first_entry(vmas, struct vma_area, list);
@@ -898,6 +901,17 @@ static int restore_priv_vma_content(struct pstree_item *t, struct page_read *pr)
 			p = decode_pointer((off) * PAGE_SIZE +
 					vma->premmaped_addr);

+			/*
+			 * This means that userfaultfd is used to load the pages
+			 * on demand.
+			 */
+			if (opts.lazy_pages && vma_entry_can_be_lazy(vma->e)) {
+				pr_debug("Lazy restore skips %#016"PRIx64"\n", vma->e->start);
+				pr.skip_pages(&pr, PAGE_SIZE);
+				nr_lazy++;
+				continue;
+			}
+
 			set_bit(off, vma->page_bitmap);
 			if (vma_inherited(vma)) {
 				clear_bit(off, vma->pvma->page_bitmap);
@@ -986,6 +1000,7 @@ err_read:
 	pr_info("nr_restored_pages: %d\n", nr_restored);
 	pr_info("nr_shared_pages:   %d\n", nr_shared);
 	pr_info("nr_droped_pages:   %d\n", nr_droped);
+	pr_info("nr_lazy:           %d\n", nr_lazy);

 	return 0;


--- a/criu/pie/restorer.c
+++ b/criu/pie/restorer.c
@@ -31,6 +31,7 @@
 #include "image.h"
 #include "sk-inet.h"
 #include "vma.h"
+#include "uffd.h"

 #include "common/lock.h"
 #include "restorer.h"
@@ -783,8 +784,50 @@ static void rst_tcp_socks_all(struct task_restore_args *ta)
 		rst_tcp_repair_off(&ta->tcp_socks[i]);
 }

-static int vma_remap(unsigned long src, unsigned long dst, unsigned long len)
+
+
+
+static int enable_uffd(int uffd, unsigned long addr, unsigned long len)
 {
+	/*
+	 * If uffd == -1, this means that userfaultfd is not enabled
+	 * or it is not available.
+	 */
+	if (uffd == -1)
+		return 0;
+#ifdef CONFIG_HAS_UFFD
+	int rc;
+	struct uffdio_register uffdio_register;
+	unsigned long expected_ioctls;
+
+	uffdio_register.range.start = addr;
+	uffdio_register.range.len = len;
+	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+	pr_info("lazy-pages: uffdio_register.range.start 0x%lx\n", (unsigned long) uffdio_register.range.start);
+	pr_info("lazy-pages: uffdio_register.len 0x%llx\n", uffdio_register.range.len);
+	rc = sys_ioctl(uffd, UFFDIO_REGISTER, &uffdio_register);
+	pr_info("lazy-pages: ioctl UFFDIO_REGISTER rc %d\n", rc);
+	pr_info("lazy-pages: uffdio_register.range.start 0x%lx\n", (unsigned long) uffdio_register.range.start);
+	pr_info("lazy-pages: uffdio_register.len 0x%llx\n", uffdio_register.range.len);
+	if (rc != 0)
+		return -1;
+
+	expected_ioctls = (1 << _UFFDIO_WAKE) | (1 << _UFFDIO_COPY) | (1 << _UFFDIO_ZEROPAGE);
+
+	if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) {
+		pr_err("lazy-pages: unexpected missing uffd ioctl for anon memory\n");
+	}
+
+#endif
+	return 0;
+}
+
+
+static int vma_remap(VmaEntry *vma_entry, int uffd)
+{
+	unsigned long src = vma_premmaped_start(vma_entry);
+	unsigned long dst = vma_entry->start;
+	unsigned long len = vma_entry_len(vma_entry);
 	unsigned long guard = 0, tmp;

 	pr_info("Remap %lx->%lx len %lx\n", src, dst, len);
@@ -856,6 +899,18 @@ static int vma_remap(unsigned long src, unsigned long dst, unsigned long len)
 		return -1;
 	}

+	/*
+	 * If running in userfaultfd/lazy-pages mode pages with
+	 * MAP_ANONYMOUS and MAP_PRIVATE are remapped but without the
+	 * real content.
+	 * The function enable_uffd() marks the page(s) as userfaultfd
+	 * pages, so that the processes will hang until the memory is
+	 * injected via userfaultfd.
+	 */
+	if (vma_entry_can_be_lazy(vma_entry))
+		if (enable_uffd(uffd, dst, len) != 0)
+			return -1;
+
 	return 0;
 }

@@ -1134,6 +1189,10 @@ long __export_restore_task(struct task_restore_args *args)

 	pr_info("Switched to the restorer %d\n", my_pid);

+	if (args->uffd > -1) {
+		pr_debug("lazy-pages: uffd %d\n", args->uffd);
+	}
+
 	if (!args->compatible_mode) {
 		/* Compatible vDSO will be mapped, not moved */
 		if (vdso_do_park(&args->vdso_sym_rt,
@@ -1162,8 +1221,7 @@ long __export_restore_task(struct task_restore_args *args)
 		if (vma_entry->start > vma_entry->shmid)
 			break;

-		if (vma_remap(vma_premmaped_start(vma_entry),
-				vma_entry->start, vma_entry_len(vma_entry)))
+		if (vma_remap(vma_entry, args->uffd))
 			goto core_restore_end;
 	}

@@ -1180,11 +1238,20 @@ long __export_restore_task(struct task_restore_args *args)
 		if (vma_entry->start < vma_entry->shmid)
 			break;

-		if (vma_remap(vma_premmaped_start(vma_entry),
-				vma_entry->start, vma_entry_len(vma_entry)))
+		if (vma_remap(vma_entry, args->uffd))
 			goto core_restore_end;
 	}

+	if (args->uffd > -1) {
+		pr_debug("lazy-pages: closing uffd %d\n", args->uffd);
+		/*
+		 * All userfaultfd configuration has finished at this point.
+		 * Let's close the UFFD file descriptor, so that the restored
+		 * process does not have an opened UFFD FD for ever.
+		 */
+		sys_close(args->uffd);
+	}
+
 	/*
 	 * OK, lets try to map new one.
 	 */

--- a/criu/uffd.c
+++ b/criu/uffd.c
@@ -32,6 +32,90 @@
 #undef  LOG_PREFIX
 #define LOG_PREFIX "lazy-pages: "

+static int send_uffd(int sendfd, int pid)
+{
+	int fd;
+	int len;
+	int ret = -1;
+	struct sockaddr_un sun;
+
+	if (!opts.addr) {
+		pr_info("Please specify a file name for the unix domain socket\n");
+		pr_info("used to communicate between the lazy-pages server\n");
+		pr_info("and the restore process. Use the --address option like\n");
+		pr_info("criu restore --lazy-pages --address /tmp/userfault.socket\n");
+		return -1;
+	}
+
+	if (sendfd < 0)
+		return -1;
+
+	if (strlen(opts.addr) >= sizeof(sun.sun_path)) {
+		return -1;
+	}
+
+	if ((fd = socket(AF_UNIX, SOCK_STREAM, 0)) < 0)
+		return -1;
+
+	memset(&sun, 0, sizeof(sun));
+	sun.sun_family = AF_UNIX;
+	strcpy(sun.sun_path, opts.addr);
+	len = offsetof(struct sockaddr_un, sun_path) + strlen(opts.addr);
+	if (connect(fd, (struct sockaddr *) &sun, len) < 0) {
+		pr_perror("connect to %s failed", opts.addr);
+		goto out;
+	}
+
+	/* The "transfer protocol" is first the pid as int and then
+	 * the FD for UFFD */
+	pr_debug("Sending PID %d\n", pid);
+	if (send(fd, &pid, sizeof(pid), 0) < 0) {
+		pr_perror("PID sending error:");
+		goto out;
+	}
+
+	if (send_fd(fd, NULL, 0, sendfd) < 0) {
+		pr_perror("send_fd error:");
+		goto out;
+	}
+	ret = 0;
+out:
+	close(fd);
+	return ret;
+}
+
+/* This function is used by 'criu restore --lazy-pages' */
+int setup_uffd(struct task_restore_args *task_args, int pid)
+{
+	struct uffdio_api uffdio_api;
+	/*
+	 * Open userfaulfd FD which is passed to the restorer blob and
+	 * to a second process handling the userfaultfd page faults.
+	 */
+	task_args->uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+
+	/*
+	 * Check if the UFFD_API is the one which is expected
+	 */
+	uffdio_api.api = UFFD_API;
+	uffdio_api.features = 0;
+	if (ioctl(task_args->uffd, UFFDIO_API, &uffdio_api)) {
+		pr_err("Checking for UFFDIO_API failed.\n");
+		return -1;
+	}
+	if (uffdio_api.api != UFFD_API) {
+		pr_err("Result of looking up UFFDIO_API does not match: %Lu\n", uffdio_api.api);
+		return -1;
+	}
+
+	if (send_uffd(task_args->uffd, pid) < 0) {
+		close(task_args->uffd);
+		return -1;
+	}
+
+	return 0;
+}
+
 static int server_listen(struct sockaddr_un *saddr)
 {
 	int fd;
@@ -234,9 +318,7 @@ static int collect_uffd_pages(struct page_read *pr, struct list_head *uffd_list,
 			 * in the VMA list.
 			 */
 			if (base >= vma->e->start && base < vma->e->end) {
-				if ((vma->e->flags & MAP_ANONYMOUS) &&
-				    (vma->e->flags & MAP_PRIVATE) &&
-				    !(vma_area_is(vma, VMA_AREA_VSYSCALL))) {
+				if (vma_entry_can_be_lazy(vma->e)) {
 					uffd_page = true;
 					if (vma_area_is(vma, VMA_AREA_VDSO))
 						uffd_vdso = true;