145 lines
6.9 KiB
Plaintext
145 lines
6.9 KiB
Plaintext
|
= Userfaultfd =
|
||
|
|
||
|
== Objective ==
|
||
|
|
||
|
Userfaults allow the implementation of on-demand paging from userland
|
||
|
and more generally they allow userland to take control of various
|
||
|
memory page faults, something otherwise only the kernel code could do.
|
||
|
|
||
|
For example userfaults allows a proper and more optimal implementation
|
||
|
of the PROT_NONE+SIGSEGV trick.
|
||
|
|
||
|
== Design ==
|
||
|
|
||
|
Userfaults are delivered and resolved through the userfaultfd syscall.
|
||
|
|
||
|
The userfaultfd (aside from registering and unregistering virtual
|
||
|
memory ranges) provides two primary functionalities:
|
||
|
|
||
|
1) read/POLLIN protocol to notify a userland thread of the faults
|
||
|
happening
|
||
|
|
||
|
2) various UFFDIO_* ioctls that can manage the virtual memory regions
|
||
|
registered in the userfaultfd that allows userland to efficiently
|
||
|
resolve the userfaults it receives via 1) or to manage the virtual
|
||
|
memory in the background
|
||
|
|
||
|
The real advantage of userfaults if compared to regular virtual memory
|
||
|
management of mremap/mprotect is that the userfaults in all their
|
||
|
operations never involve heavyweight structures like vmas (in fact the
|
||
|
userfaultfd runtime load never takes the mmap_sem for writing).
|
||
|
|
||
|
Vmas are not suitable for page- (or hugepage) granular fault tracking
|
||
|
when dealing with virtual address spaces that could span
|
||
|
Terabytes. Too many vmas would be needed for that.
|
||
|
|
||
|
The userfaultfd once opened by invoking the syscall, can also be
|
||
|
passed using unix domain sockets to a manager process, so the same
|
||
|
manager process could handle the userfaults of a multitude of
|
||
|
different processes without them being aware about what is going on
|
||
|
(well of course unless they later try to use the userfaultfd
|
||
|
themselves on the same region the manager is already tracking, which
|
||
|
is a corner case that would currently return -EBUSY).
|
||
|
|
||
|
== API ==
|
||
|
|
||
|
When first opened the userfaultfd must be enabled invoking the
|
||
|
UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
|
||
|
a later API version) which will specify the read/POLLIN protocol
|
||
|
userland intends to speak on the UFFD and the uffdio_api.features
|
||
|
userland requires. The UFFDIO_API ioctl if successful (i.e. if the
|
||
|
requested uffdio_api.api is spoken also by the running kernel and the
|
||
|
requested features are going to be enabled) will return into
|
||
|
uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of
|
||
|
respectively all the available features of the read(2) protocol and
|
||
|
the generic ioctl available.
|
||
|
|
||
|
Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
|
||
|
be invoked (if present in the returned uffdio_api.ioctls bitmask) to
|
||
|
register a memory range in the userfaultfd by setting the
|
||
|
uffdio_register structure accordingly. The uffdio_register.mode
|
||
|
bitmask will specify to the kernel which kind of faults to track for
|
||
|
the range (UFFDIO_REGISTER_MODE_MISSING would track missing
|
||
|
pages). The UFFDIO_REGISTER ioctl will return the
|
||
|
uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
|
||
|
userfaults on the range registered. Not all ioctls will necessarily be
|
||
|
supported for all memory types depending on the underlying virtual
|
||
|
memory backend (anonymous memory vs tmpfs vs real filebacked
|
||
|
mappings).
|
||
|
|
||
|
Userland can use the uffdio_register.ioctls to manage the virtual
|
||
|
address space in the background (to add or potentially also remove
|
||
|
memory from the userfaultfd registered range). This means a userfault
|
||
|
could be triggering just before userland maps in the background the
|
||
|
user-faulted page.
|
||
|
|
||
|
The primary ioctl to resolve userfaults is UFFDIO_COPY. That
|
||
|
atomically copies a page into the userfault registered range and wakes
|
||
|
up the blocked userfaults (unless uffdio_copy.mode &
|
||
|
UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
|
||
|
UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
|
||
|
half copied page since it'll keep userfaulting until the copy has
|
||
|
finished.
|
||
|
|
||
|
== QEMU/KVM ==
|
||
|
|
||
|
QEMU/KVM is using the userfaultfd syscall to implement postcopy live
|
||
|
migration. Postcopy live migration is one form of memory
|
||
|
externalization consisting of a virtual machine running with part or
|
||
|
all of its memory residing on a different node in the cloud. The
|
||
|
userfaultfd abstraction is generic enough that not a single line of
|
||
|
KVM kernel code had to be modified in order to add postcopy live
|
||
|
migration to QEMU.
|
||
|
|
||
|
Guest async page faults, FOLL_NOWAIT and all other GUP features work
|
||
|
just fine in combination with userfaults. Userfaults trigger async
|
||
|
page faults in the guest scheduler so those guest processes that
|
||
|
aren't waiting for userfaults (i.e. network bound) can keep running in
|
||
|
the guest vcpus.
|
||
|
|
||
|
It is generally beneficial to run one pass of precopy live migration
|
||
|
just before starting postcopy live migration, in order to avoid
|
||
|
generating userfaults for readonly guest regions.
|
||
|
|
||
|
The implementation of postcopy live migration currently uses one
|
||
|
single bidirectional socket but in the future two different sockets
|
||
|
will be used (to reduce the latency of the userfaults to the minimum
|
||
|
possible without having to decrease /proc/sys/net/ipv4/tcp_wmem).
|
||
|
|
||
|
The QEMU in the source node writes all pages that it knows are missing
|
||
|
in the destination node, into the socket, and the migration thread of
|
||
|
the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE
|
||
|
ioctls on the userfaultfd in order to map the received pages into the
|
||
|
guest (UFFDIO_ZEROCOPY is used if the source page was a zero page).
|
||
|
|
||
|
A different postcopy thread in the destination node listens with
|
||
|
poll() to the userfaultfd in parallel. When a POLLIN event is
|
||
|
generated after a userfault triggers, the postcopy thread read() from
|
||
|
the userfaultfd and receives the fault address (or -EAGAIN in case the
|
||
|
userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run
|
||
|
by the parallel QEMU migration thread).
|
||
|
|
||
|
After the QEMU postcopy thread (running in the destination node) gets
|
||
|
the userfault address it writes the information about the missing page
|
||
|
into the socket. The QEMU source node receives the information and
|
||
|
roughly "seeks" to that page address and continues sending all
|
||
|
remaining missing pages from that new page offset. Soon after that
|
||
|
(just the time to flush the tcp_wmem queue through the network) the
|
||
|
migration thread in the QEMU running in the destination node will
|
||
|
receive the page that triggered the userfault and it'll map it as
|
||
|
usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it
|
||
|
was spontaneously sent by the source or if it was an urgent page
|
||
|
requested through an userfault).
|
||
|
|
||
|
By the time the userfaults start, the QEMU in the destination node
|
||
|
doesn't need to keep any per-page state bitmap relative to the live
|
||
|
migration around and a single per-page bitmap has to be maintained in
|
||
|
the QEMU running in the source node to know which pages are still
|
||
|
missing in the destination node. The bitmap in the source node is
|
||
|
checked to find which missing pages to send in round robin and we seek
|
||
|
over it when receiving incoming userfaults. After sending each page of
|
||
|
course the bitmap is updated accordingly. It's also useful to avoid
|
||
|
sending the same page twice (in case the userfault is read by the
|
||
|
postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
|
||
|
thread).
|