hypervision tips and tricks: April 2020

Computers are complicated machines. It's all too common to be browsing the internet, playing the newest Call of Duty, or just watching a movie when suddenly your audio cuts out and your screen turns blue. It's easy to assume that it was just a fluke, or that some overworked intern at Microsoft would be to blame - yet, when faced with repeated but inconsistent crashes, one can't help but find out why.

Analyzing MEMORY.DMP

On opening the crash dump with WinDbg Preview, it is quickly obvious that:

A page fault occurred;
It was right in the middle of my Modern Warfare multiplayer session, and;
It likely occurred somewhere inside of win32kbase!EnterCrit.

Given the existence of previously disclosed exploits surrounding win32k, it is indeed possible that someone at Microsoft is to blame - my specific version of Windows combined with my exact hardware configuration may have caused the crashes.

Let's analyze this.

Immediately, it is clear that someone at Microsoft needs to update their documentation for Arg2; it's described as only being capable of containing either 1 or 0, yet here, it obviously contains 10 (hex).

It's also clear that, because the address referenced (Arg1) and the address of the instruction which referenced the memory (Arg3) are exactly the same, we likely have a page fault caused by an instruction fetch. The bugcheck code PAGE_FAULT_IN_NONPAGED_AREA also tells us that this memory is supposedly not present as well as being nonpaged; it is a non-existent address in the sense of both the OS memory manager as well as the CPU's paging structures.

So why would an address within 100 bytes of a valid location (win32kbase!EnterCrit) cause a page fault? Perhaps knowing of the exact context would give us a clue.

Wait, that-

So it seems that, despite causing a PAGE_FAULT_IN_NONPAGED_AREA, this is a valid address, as the debugger has no issue reading its contents.

Furthermore, the instruction can not possibly have caused any side-effects, as the instruction, test rax, rax, can not reference other memory, implicitly or otherwise. It is also a correct disassembly and not in-between another instruction, which could potentially confuse the debugger, as the instruction pointer from the register context matches up with its apparent address.

A valid and present memory address that causes a page fault complaining of exactly the opposite. Who could debug that?

Bug Hunting: Beyond the Dump

Perhaps it was myself, and not Microsoft, who was to blame. It was around this point when I realized that I also had a constantly-running-from-boot type 1 hypervisor; I also realized that this was a relatively new issue. I then recollected of a recent commit I had made in response to reading a certain section of the Intel SDM:

Of particular interest is the clause dictating that microcode updates being performed under VMX non-root operation (which is to say that the VMM does not exit on writes to IA32_BIOS_UPDT_TRIG) can cause "unpredictable" system behavior. Rereading this was quite alarming as I had previously understood it as if it implied that the processor would simply ignore microcode updates performed in non-root operation.

I then set out to write a microcode update loader. Realizing that microcode updates occurred in the form of a linear address reference and that they could be variable-size, however, I figured that the best course of action for then would be to just ignore any microcode update that the OS would try to load; the motherboard would attempt to load its own update anyways.

Or did it? My CPU, the Core i7-4790k, is part of Intel's 4th-generation Haswell (technically Devil's Canyon) series of processors. Sometime soon after the release of my CPU around August of 2014, Intel identified an erratum relating to the TSX extensions, which was fixed via a microcode update that simply disabled the feature.

Knowing this, and knowing that my motherboard's manufacturer last released an update for their product sometime in 2016, I figured that if tools such as CPU-Z did not report the presence of TSX, the motherboard would be loading some form of microcode update in the place of Windows - and that my issue likely wouldn't be related to outdated microcode.

I was in awe. Did ASUS really expect Windows to load the update for them, or did they never bother? I knew now that my microcode was horribly out of date, if an update had been loaded at all. If I then searched for a list of errata for 4th generation Intel processors, I would know that most if not all of them would apply for my case. Perhaps the source of this problem could be found here.

HSD132: Unexpected Page Fault

Bingo! After finding the correct documentation, various unpatched errata appeared to apply. Arguably the most relevant one, however, was HSD132:

Which most definitely matches with my page fault scenario; my hypervisor, using EPT, caused the errata to manifest itself randomly and therefore generate an unexpected page fault. Other errata which caught my attention (some of which may have also contributed to this specific crash or others) include:

HSD12. CR0.CD Is Ignored in VMX Operation
HSD88. Event Injection by VM Entry May Use an Incorrect B Flag for SS
HSD147. Unpredictable Operation at Turbo Frequencies Above 4.0 GHz (Affects only the i7-4790k due to its out-of-box turbo frequency; can cause machine checks, system hangs, or other issues)

Loading Microcode Updates

It seemed that now the microcode update loader I previously held off on writing had to be written, unless I didn't want to use my hypervisor. To understand a simple microcode update loader, I first had to understand how they work.

Microcode updates involve two MSRs and one CPUID leaf - IA32_BIOS_SIGN_ID (index 8Bh), IA32_BIOS_UPDT_TRIG (index 79h), and CPUID.01h. All hypervisors meant to support correct processor operation must exit on all interactions with these, with the exception of reads to IA32_BIOS_UPDT_TRIG. The function of those fields is as follows:

IA32_BIOS_SIGN_ID: A read/write buffer containing the current microcode update revision.
IA32_BIOS_UPDT_TRIG: A write-only buffer. When written to, the processor will attempt to load the microcode update specified in the eax and edx registers as a linear address.
CPUID.01h: When executed, the higher 32 bits of IA32_BIOS_SIGN_ID will contain the current microcode update revision. This has a potentially unexpected implication that hypervisors which completely emulate the CPUID instruction (and do not fill in the leaf information by simply executing the instruction) must also set the microcode update revision inside of IA32_BIOS_SIGN_ID for executions when EAX = 1.

In order to read the microcode update revision, one must clear IA32_BIOS_SIGN_ID with zero, execute CPUID.01h, and then read the higher 32 bits of the aforementioned MSR. Note that microcode update revisions are signed integers.

Implementing a microcode update loader appears to be quite a daunting task due to the fact that the MSR references a linear address which must occur inside of VMX root mode; this is to say that the hypervisor must have the guest-loaded microcode update somewhere inside of its own address space.

This can be a problem for hypervisors which are not able to allocate memory at runtime (such as when they are launched from EFI boot into runtime operation); they will have to allocate a fixed buffer which they can copy the microcode update into. Allocating this buffer, however, can be quite expensive, as microcode updates are variable length; Haswell rev. 40 (0x28), for example, is 23kb large. To avoid reserving laughable amounts of memory from the OS, the hypervisor can instead reserve a portion of its address space, where each page would have its physical address aligned with that of the guest pages.

Before the hypervisor can figure out how large the microcode update is, however, they must first parse the microcode update header. The microcode update header can be found 48 bytes prior to the address given to IA32_BIOS_UPDT_TRIG, and contains various fields relating to its verification and length. A hypervisor may or may not wish to manually verify a microcode update; such will always be done by the CPU, regardless. It will be assumed, here, that the microcode update is perfectly valid.

Two fields are used to indicate the size of a microcode update, being the "total size" and "data size" fields. In order to determine the size of a microcode update, the following algorithm can be used:

If the "data size" field is zero, the size of the microcode update, in bytes, is 2000 plus the size of the microcode update header (48 bytes); this means that the size is fixed to 2048 bytes.
Otherwise, the complete size of the microcode update is specified in the "total size" field.

The rest of the microcode updater is relatively straightforward.

Map the physical (not guest-physical!) address of every affected page from the guest's address space into the hypervisor's address space by changing the physical address of each PTE and then executing invlpg on the PTE's corresponding address. A PTE must be used as it represents the lowest possible granularity the guest can use, in the event that each page in the guest's address space refers to non-contiguous chunks of physical memory.
Make sure that no page faults can occur while the processor attempts to load the microcode update.
If Hyperthreading is used or can be used, make sure that no other logical processor on the current physical core can load a microcode update. This can be done with a mutex.
Alternatively, use an IPI to force all other processors to wait for the update to finish so that, afterwards, they may re-initialize buffers which could be modified by a microcode update (an example would be the availability of extensions such as TSX, as mentioned previously in this article). Note that this may occur during early system initialization where these processors could be in the wait-for-SIPI state.
Ideally under an exception handler, write to the IA32_BIOS_UPDT_TRIG MSR with the linear address of the microcode update data. Note that the address given to the MSR must point to the actual microcode update data itself, which is 48 bytes after the microcode update header.
If an exception was generated, reflect it to the guest or treat it as an internal error.
Use some method to signal processors to update relevant cached information; this may be with an IPI or with an identifier that is atomically incremented per every microcode update.
If the microcode update revision does not change between microcode updates and the hypervisor treats invalid microcode update loads as an error, assume that the microcode update was not loaded.

Bring-Your-Own Microcode Update

Alternatively, a hypervisor can also simply load its own microcode update during early initialization; a hypervisor can also load its own microcode update along with loading guest microcode updates. Various microcode updates for select Intel processors can be found at an official repository hosted on Intel's GitHub.

Alternative locations include a series of "mcupdate_************.dll" (where each asterisk is a character from the CPU's vendor name, such as GenuineIntel) files located inside of any recent version of Windows, which one can extract with a simple binary analysis. These files contain a treasure trove of microcode updates from as early as 2010, and more recent versions may contain microcode updates one or more revisions newer than on Intel's repo.

Both solutions will contain the microcode update header, as-is; this is to say that the construction of the microcode update header is to be done by Intel instead of the microcode update loader. The extent of the checks which must be done on these microcode updates are numerous and beyond the scope of this article; they can be located in section 9.11 of Volume 3 of the Intel SDM.

Fruits of our Labour

Hopefully that wasn't too head-spinning. CPU-Z now reports the loss of TSX:

And I've yet to experience any system crashes, inexplicable or otherwise.

Conclusion

I hope you enjoyed this article. After over a year of inactivity I recently felt motivated to write an article, and I thought that one talking about my thought process when implementing a feature to my hypervisor as well as one that is hopefully informative to anybody having a similar issue would be interesting to read. An issue with this type of blog is that anything I post must be specific to hypervisor development, which can usually seem obvious or not interesting to write about.

In the meantime, you can hopefully read more articles by myself or other smart people over at the increasingly popular "secret club", which will be less focused on hypervisor development. You can also follow my twitter account.

hypervision tips and tricks

Sunday, 12 April 2020

Unexplained crashes, microcode updates, and CPU errata