188 lines
5.8 KiB
Plaintext
188 lines
5.8 KiB
Plaintext
|
What is hwpoison?
|
||
|
|
||
|
Upcoming Intel CPUs have support for recovering from some memory errors
|
||
|
(``MCA recovery''). This requires the OS to declare a page "poisoned",
|
||
|
kill the processes associated with it and avoid using it in the future.
|
||
|
|
||
|
This patchkit implements the necessary infrastructure in the VM.
|
||
|
|
||
|
To quote the overview comment:
|
||
|
|
||
|
* High level machine check handler. Handles pages reported by the
|
||
|
* hardware as being corrupted usually due to a 2bit ECC memory or cache
|
||
|
* failure.
|
||
|
*
|
||
|
* This focusses on pages detected as corrupted in the background.
|
||
|
* When the current CPU tries to consume corruption the currently
|
||
|
* running process can just be killed directly instead. This implies
|
||
|
* that if the error cannot be handled for some reason it's safe to
|
||
|
* just ignore it because no corruption has been consumed yet. Instead
|
||
|
* when that happens another machine check will happen.
|
||
|
*
|
||
|
* Handles page cache pages in various states. The tricky part
|
||
|
* here is that we can access any page asynchronous to other VM
|
||
|
* users, because memory failures could happen anytime and anywhere,
|
||
|
* possibly violating some of their assumptions. This is why this code
|
||
|
* has to be extremely careful. Generally it tries to use normal locking
|
||
|
* rules, as in get the standard locks, even if that means the
|
||
|
* error handling takes potentially a long time.
|
||
|
*
|
||
|
* Some of the operations here are somewhat inefficient and have non
|
||
|
* linear algorithmic complexity, because the data structures have not
|
||
|
* been optimized for this case. This is in particular the case
|
||
|
* for the mapping from a vma to a process. Since this case is expected
|
||
|
* to be rare we hope we can get away with this.
|
||
|
|
||
|
The code consists of a the high level handler in mm/memory-failure.c,
|
||
|
a new page poison bit and various checks in the VM to handle poisoned
|
||
|
pages.
|
||
|
|
||
|
The main target right now is KVM guests, but it works for all kinds
|
||
|
of applications. KVM support requires a recent qemu-kvm release.
|
||
|
|
||
|
For the KVM use there was need for a new signal type so that
|
||
|
KVM can inject the machine check into the guest with the proper
|
||
|
address. This in theory allows other applications to handle
|
||
|
memory failures too. The expection is that near all applications
|
||
|
won't do that, but some very specialized ones might.
|
||
|
|
||
|
---
|
||
|
|
||
|
There are two (actually three) modi memory failure recovery can be in:
|
||
|
|
||
|
vm.memory_failure_recovery sysctl set to zero:
|
||
|
All memory failures cause a panic. Do not attempt recovery.
|
||
|
(on x86 this can be also affected by the tolerant level of the
|
||
|
MCE subsystem)
|
||
|
|
||
|
early kill
|
||
|
(can be controlled globally and per process)
|
||
|
Send SIGBUS to the application as soon as the error is detected
|
||
|
This allows applications who can process memory errors in a gentle
|
||
|
way (e.g. drop affected object)
|
||
|
This is the mode used by KVM qemu.
|
||
|
|
||
|
late kill
|
||
|
Send SIGBUS when the application runs into the corrupted page.
|
||
|
This is best for memory error unaware applications and default
|
||
|
Note some pages are always handled as late kill.
|
||
|
|
||
|
---
|
||
|
|
||
|
User control:
|
||
|
|
||
|
vm.memory_failure_recovery
|
||
|
See sysctl.txt
|
||
|
|
||
|
vm.memory_failure_early_kill
|
||
|
Enable early kill mode globally
|
||
|
|
||
|
PR_MCE_KILL
|
||
|
Set early/late kill mode/revert to system default
|
||
|
arg1: PR_MCE_KILL_CLEAR: Revert to system default
|
||
|
arg1: PR_MCE_KILL_SET: arg2 defines thread specific mode
|
||
|
PR_MCE_KILL_EARLY: Early kill
|
||
|
PR_MCE_KILL_LATE: Late kill
|
||
|
PR_MCE_KILL_DEFAULT: Use system global default
|
||
|
Note that if you want to have a dedicated thread which handles
|
||
|
the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should
|
||
|
call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise,
|
||
|
the SIGBUS is sent to the main thread.
|
||
|
|
||
|
PR_MCE_KILL_GET
|
||
|
return current mode
|
||
|
|
||
|
|
||
|
---
|
||
|
|
||
|
Testing:
|
||
|
|
||
|
madvise(MADV_HWPOISON, ....)
|
||
|
(as root)
|
||
|
Poison a page in the process for testing
|
||
|
|
||
|
|
||
|
hwpoison-inject module through debugfs
|
||
|
|
||
|
/sys/debug/hwpoison/
|
||
|
|
||
|
corrupt-pfn
|
||
|
|
||
|
Inject hwpoison fault at PFN echoed into this file. This does
|
||
|
some early filtering to avoid corrupted unintended pages in test suites.
|
||
|
|
||
|
unpoison-pfn
|
||
|
|
||
|
Software-unpoison page at PFN echoed into this file. This
|
||
|
way a page can be reused again.
|
||
|
This only works for Linux injected failures, not for real
|
||
|
memory failures.
|
||
|
|
||
|
Note these injection interfaces are not stable and might change between
|
||
|
kernel versions
|
||
|
|
||
|
corrupt-filter-dev-major
|
||
|
corrupt-filter-dev-minor
|
||
|
|
||
|
Only handle memory failures to pages associated with the file system defined
|
||
|
by block device major/minor. -1U is the wildcard value.
|
||
|
This should be only used for testing with artificial injection.
|
||
|
|
||
|
corrupt-filter-memcg
|
||
|
|
||
|
Limit injection to pages owned by memgroup. Specified by inode number
|
||
|
of the memcg.
|
||
|
|
||
|
Example:
|
||
|
mkdir /sys/fs/cgroup/mem/hwpoison
|
||
|
|
||
|
usemem -m 100 -s 1000 &
|
||
|
echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks
|
||
|
|
||
|
memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ')
|
||
|
echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg
|
||
|
|
||
|
page-types -p `pidof init` --hwpoison # shall do nothing
|
||
|
page-types -p `pidof usemem` --hwpoison # poison its pages
|
||
|
|
||
|
corrupt-filter-flags-mask
|
||
|
corrupt-filter-flags-value
|
||
|
|
||
|
When specified, only poison pages if ((page_flags & mask) == value).
|
||
|
This allows stress testing of many kinds of pages. The page_flags
|
||
|
are the same as in /proc/kpageflags. The flag bits are defined in
|
||
|
include/linux/kernel-page-flags.h and documented in
|
||
|
Documentation/vm/pagemap.txt
|
||
|
|
||
|
Architecture specific MCE injector
|
||
|
|
||
|
x86 has mce-inject, mce-test
|
||
|
|
||
|
Some portable hwpoison test programs in mce-test, see blow.
|
||
|
|
||
|
---
|
||
|
|
||
|
References:
|
||
|
|
||
|
http://halobates.de/mce-lc09-2.pdf
|
||
|
Overview presentation from LinuxCon 09
|
||
|
|
||
|
git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
|
||
|
Test suite (hwpoison specific portable tests in tsrc)
|
||
|
|
||
|
git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
|
||
|
x86 specific injector
|
||
|
|
||
|
|
||
|
---
|
||
|
|
||
|
Limitations:
|
||
|
|
||
|
- Not all page types are supported and never will. Most kernel internal
|
||
|
objects cannot be recovered, only LRU pages for now.
|
||
|
- Right now hugepage support is missing.
|
||
|
|
||
|
---
|
||
|
Andi Kleen, Oct 2009
|
||
|
|