hypervision tips and tricks

Sunday, 12 April 2020

Unexplained crashes, microcode updates, and CPU errata

Computers are complicated machines. It's all too common to be browsing the internet, playing the newest Call of Duty, or just watching a movie when suddenly your audio cuts out and your screen turns blue. It's easy to assume that it was just a fluke, or that some overworked intern at Microsoft would be to blame - yet, when faced with repeated but inconsistent crashes, one can't help but find out why.

Analyzing MEMORY.DMP

On opening the crash dump with WinDbg Preview, it is quickly obvious that:

A page fault occurred;
It was right in the middle of my Modern Warfare multiplayer session, and;
It likely occurred somewhere inside of win32kbase!EnterCrit.

Given the existence of previously disclosed exploits surrounding win32k, it is indeed possible that someone at Microsoft is to blame - my specific version of Windows combined with my exact hardware configuration may have caused the crashes.

Let's analyze this.

Immediately, it is clear that someone at Microsoft needs to update their documentation for Arg2; it's described as only being capable of containing either 1 or 0, yet here, it obviously contains 10 (hex).

It's also clear that, because the address referenced (Arg1) and the address of the instruction which referenced the memory (Arg3) are exactly the same, we likely have a page fault caused by an instruction fetch. The bugcheck code PAGE_FAULT_IN_NONPAGED_AREA also tells us that this memory is supposedly not present as well as being nonpaged; it is a non-existent address in the sense of both the OS memory manager as well as the CPU's paging structures.

So why would an address within 100 bytes of a valid location (win32kbase!EnterCrit) cause a page fault? Perhaps knowing of the exact context would give us a clue.

Wait, that-

So it seems that, despite causing a PAGE_FAULT_IN_NONPAGED_AREA, this is a valid address, as the debugger has no issue reading its contents.

Furthermore, the instruction can not possibly have caused any side-effects, as the instruction, test rax, rax, can not reference other memory, implicitly or otherwise. It is also a correct disassembly and not in-between another instruction, which could potentially confuse the debugger, as the instruction pointer from the register context matches up with its apparent address.

A valid and present memory address that causes a page fault complaining of exactly the opposite. Who could debug that?

Bug Hunting: Beyond the Dump

Perhaps it was myself, and not Microsoft, who was to blame. It was around this point when I realized that I also had a constantly-running-from-boot type 1 hypervisor; I also realized that this was a relatively new issue. I then recollected of a recent commit I had made in response to reading a certain section of the Intel SDM:

Of particular interest is the clause dictating that microcode updates being performed under VMX non-root operation (which is to say that the VMM does not exit on writes to IA32_BIOS_UPDT_TRIG) can cause "unpredictable" system behavior. Rereading this was quite alarming as I had previously understood it as if it implied that the processor would simply ignore microcode updates performed in non-root operation.

I then set out to write a microcode update loader. Realizing that microcode updates occurred in the form of a linear address reference and that they could be variable-size, however, I figured that the best course of action for then would be to just ignore any microcode update that the OS would try to load; the motherboard would attempt to load its own update anyways.

Or did it? My CPU, the Core i7-4790k, is part of Intel's 4th-generation Haswell (technically Devil's Canyon) series of processors. Sometime soon after the release of my CPU around August of 2014, Intel identified an erratum relating to the TSX extensions, which was fixed via a microcode update that simply disabled the feature.

Knowing this, and knowing that my motherboard's manufacturer last released an update for their product sometime in 2016, I figured that if tools such as CPU-Z did not report the presence of TSX, the motherboard would be loading some form of microcode update in the place of Windows - and that my issue likely wouldn't be related to outdated microcode.

I was in awe. Did ASUS really expect Windows to load the update for them, or did they never bother? I knew now that my microcode was horribly out of date, if an update had been loaded at all. If I then searched for a list of errata for 4th generation Intel processors, I would know that most if not all of them would apply for my case. Perhaps the source of this problem could be found here.

HSD132: Unexpected Page Fault

Bingo! After finding the correct documentation, various unpatched errata appeared to apply. Arguably the most relevant one, however, was HSD132:

Which most definitely matches with my page fault scenario; my hypervisor, using EPT, caused the errata to manifest itself randomly and therefore generate an unexpected page fault. Other errata which caught my attention (some of which may have also contributed to this specific crash or others) include:

HSD12. CR0.CD Is Ignored in VMX Operation
HSD88. Event Injection by VM Entry May Use an Incorrect B Flag for SS
HSD147. Unpredictable Operation at Turbo Frequencies Above 4.0 GHz (Affects only the i7-4790k due to its out-of-box turbo frequency; can cause machine checks, system hangs, or other issues)

Loading Microcode Updates

It seemed that now the microcode update loader I previously held off on writing had to be written, unless I didn't want to use my hypervisor. To understand a simple microcode update loader, I first had to understand how they work.

Microcode updates involve two MSRs and one CPUID leaf - IA32_BIOS_SIGN_ID (index 8Bh), IA32_BIOS_UPDT_TRIG (index 79h), and CPUID.01h. All hypervisors meant to support correct processor operation must exit on all interactions with these, with the exception of reads to IA32_BIOS_UPDT_TRIG. The function of those fields is as follows:

IA32_BIOS_SIGN_ID: A read/write buffer containing the current microcode update revision.
IA32_BIOS_UPDT_TRIG: A write-only buffer. When written to, the processor will attempt to load the microcode update specified in the eax and edx registers as a linear address.
CPUID.01h: When executed, the higher 32 bits of IA32_BIOS_SIGN_ID will contain the current microcode update revision. This has a potentially unexpected implication that hypervisors which completely emulate the CPUID instruction (and do not fill in the leaf information by simply executing the instruction) must also set the microcode update revision inside of IA32_BIOS_SIGN_ID for executions when EAX = 1.

In order to read the microcode update revision, one must clear IA32_BIOS_SIGN_ID with zero, execute CPUID.01h, and then read the higher 32 bits of the aforementioned MSR. Note that microcode update revisions are signed integers.

Implementing a microcode update loader appears to be quite a daunting task due to the fact that the MSR references a linear address which must occur inside of VMX root mode; this is to say that the hypervisor must have the guest-loaded microcode update somewhere inside of its own address space.

This can be a problem for hypervisors which are not able to allocate memory at runtime (such as when they are launched from EFI boot into runtime operation); they will have to allocate a fixed buffer which they can copy the microcode update into. Allocating this buffer, however, can be quite expensive, as microcode updates are variable length; Haswell rev. 40 (0x28), for example, is 23kb large. To avoid reserving laughable amounts of memory from the OS, the hypervisor can instead reserve a portion of its address space, where each page would have its physical address aligned with that of the guest pages.

Before the hypervisor can figure out how large the microcode update is, however, they must first parse the microcode update header. The microcode update header can be found 48 bytes prior to the address given to IA32_BIOS_UPDT_TRIG, and contains various fields relating to its verification and length. A hypervisor may or may not wish to manually verify a microcode update; such will always be done by the CPU, regardless. It will be assumed, here, that the microcode update is perfectly valid.

Two fields are used to indicate the size of a microcode update, being the "total size" and "data size" fields. In order to determine the size of a microcode update, the following algorithm can be used:

If the "data size" field is zero, the size of the microcode update, in bytes, is 2000 plus the size of the microcode update header (48 bytes); this means that the size is fixed to 2048 bytes.
Otherwise, the complete size of the microcode update is specified in the "total size" field.

The rest of the microcode updater is relatively straightforward.

Map the physical (not guest-physical!) address of every affected page from the guest's address space into the hypervisor's address space by changing the physical address of each PTE and then executing invlpg on the PTE's corresponding address. A PTE must be used as it represents the lowest possible granularity the guest can use, in the event that each page in the guest's address space refers to non-contiguous chunks of physical memory.
Make sure that no page faults can occur while the processor attempts to load the microcode update.
If Hyperthreading is used or can be used, make sure that no other logical processor on the current physical core can load a microcode update. This can be done with a mutex.
Alternatively, use an IPI to force all other processors to wait for the update to finish so that, afterwards, they may re-initialize buffers which could be modified by a microcode update (an example would be the availability of extensions such as TSX, as mentioned previously in this article). Note that this may occur during early system initialization where these processors could be in the wait-for-SIPI state.
Ideally under an exception handler, write to the IA32_BIOS_UPDT_TRIG MSR with the linear address of the microcode update data. Note that the address given to the MSR must point to the actual microcode update data itself, which is 48 bytes after the microcode update header.
If an exception was generated, reflect it to the guest or treat it as an internal error.
Use some method to signal processors to update relevant cached information; this may be with an IPI or with an identifier that is atomically incremented per every microcode update.
If the microcode update revision does not change between microcode updates and the hypervisor treats invalid microcode update loads as an error, assume that the microcode update was not loaded.

Bring-Your-Own Microcode Update

Alternatively, a hypervisor can also simply load its own microcode update during early initialization; a hypervisor can also load its own microcode update along with loading guest microcode updates. Various microcode updates for select Intel processors can be found at an official repository hosted on Intel's GitHub.

Alternative locations include a series of "mcupdate_************.dll" (where each asterisk is a character from the CPU's vendor name, such as GenuineIntel) files located inside of any recent version of Windows, which one can extract with a simple binary analysis. These files contain a treasure trove of microcode updates from as early as 2010, and more recent versions may contain microcode updates one or more revisions newer than on Intel's repo.

Both solutions will contain the microcode update header, as-is; this is to say that the construction of the microcode update header is to be done by Intel instead of the microcode update loader. The extent of the checks which must be done on these microcode updates are numerous and beyond the scope of this article; they can be located in section 9.11 of Volume 3 of the Intel SDM.

Fruits of our Labour

Hopefully that wasn't too head-spinning. CPU-Z now reports the loss of TSX:

And I've yet to experience any system crashes, inexplicable or otherwise.

Conclusion

I hope you enjoyed this article. After over a year of inactivity I recently felt motivated to write an article, and I thought that one talking about my thought process when implementing a feature to my hypervisor as well as one that is hopefully informative to anybody having a similar issue would be interesting to read. An issue with this type of blog is that anything I post must be specific to hypervisor development, which can usually seem obvious or not interesting to write about.

In the meantime, you can hopefully read more articles by myself or other smart people over at the increasingly popular "secret club", which will be less focused on hypervisor development. You can also follow my twitter account.

Monday, 14 January 2019

A Common Missight in Most Hypervisors

Generally, when a hypervisor encounters a VM exit, it is because it needs to emulate the effects of an instruction, be it CPUID, RDMSR, or ENCLS. Due to ineffective wording in Intel's Software Developer's Manual, most assume that the only necessary action to be taken is to update the guest's registers or inject an exception. There are many times, however, where this is not the case. Hidden all the way in section 32.2.1 of volume 3 is this sentence:
"If a VMM emulates a guest instruction that would encounter a debug trap (single step or data or I/O breakpoint), it should cause that trap to be delivered."

After reading the subsequent paragraph going on about injecting single-step debug exceptions, I created a new project in Visual Studio, confirming the suspicions in the back of my head:

The issue here is that, by not injecting a debug exception after the emulated CPUID instruction, the trap is essentially delayed until the NOP can recognize it.

This brings us to the main point, which is that the phrase "none of the events normally associated take place," albeit used to describe exiting for startup-IPIs, describes every fault-like VM exit.

Single-Step Debug Exceptions

It's not very hard to handle single-step debug exceptions. I don't know how to make flowcharts, so here's a simple step-by-step explanation:

1. After incrementing the guest instruction pointer (meaning that this can only be done if the instruction has been emulated), check if the trap flag (bit 8) of the guest's RFLAGS is set, and that bit 1 of the guest's IA32_DEBUGCTL isn't set (more on that later).

2. Inject a single-step debug exception using the guest pending debug exception VMCS field (rather than using traditional event injection, due to the way that the processor handles pending debug exceptions) by setting the caused by single step bit to 1.

The format of the pending debug exceptions VMCS field. As said in the manual, even though the structure's format matches that of DR6, the RTM bit has an inverse effect (where if the bit in DR6 would be set, it would not be set in the pending debug exceptions field).

Of course, there are hoops one has to jump through. If the loading and saving of debug controls are enabled in the entry and exit controls respectively (which should be supported on every processor), IA32_DEBUGCTL will be stored inside of a VMCS field with the encoding 0x2802. Bit 1 of IA32_DEBUGCTL has to be checked as, if that bit is set, debug exceptions will only happen on branch instructions, and thus should only be injected by a hypervisor if it is emulating a branch instruction. As a rule of thumb, a hypervisor should get the value, set the relevant bits, and then write to the pending debug exceptions field, as some bits can be set by the processor, or even by previous writes to the field.

But Wait, There's More

This isn't perfect, as shown here:

In this GIF, the effects of the MOV SS blocking of debug exceptions are extended to the NOP instruction, when they should have ended after the CPUID instruction.

This is caused by the fact that the hypervisor also needs to clear bits from the guest interruptibility state in the VMCS, or specifically, the blocked by mov ss/blocked by sti/blocked by smi bits. All of these bits are similar in that they are only meant to last for the duration of the instruction after the one that caused them to be set; the hypervisor, when emulating an instruction, must also clear these bits as the CPU would during normal operation.

The bit format of the guest interruptibility state VMCS field. The blocked by nmi and enclave interruption fields aren't touched as those are generally set and cleared by the processor and are only relevant for specific VM exits.

Similar to the checks done on the trap flag, these bits should only be cleared if the instruction has been emulated, leading to the correct behaviour.

Everything now works as it should.

Closing Notes

Along with the fields mentioned above, I've found that the following should be taken into consideration:

Preserving cr2; depending on the design of the hypervisor, a page fault could be possible.

Preserving SSE register state, including every xmm* register and the mxcsr, as well as setting up the mxcsr so that #XM faults cannot happen, by setting it to a value of 0x1F80. Many compilers, such as MSVC, will optimize certain portions of code to use the SSE instruction set due to its speed.

If the hypervisor interacts with guest memory or VM exits on any I/O port interactions, debug exceptions (as well as alignment checks) should be injected for both, using the aforementioned pending debug exceptions VMCS field. Do note that a debug exception being encountered will not stop the instruction from executing, and I/O port debug exceptions are checked using a sign-extended I/O port address (the last point may be wrong - make sure to consult the manual).

VM exits encountered while the processor is injecting an event into the guest will cause it to fill out the IDT vectoring information field, which contains the event that was being injected. Ignoring this field could cause the event to be lost in the infinite void. Note that this can be filled out even if the hypervisor did not inject the event. More information can be found in sections 24.9.2, 27.2, and 31.7.1.2 of volume 3.

Of course, not everything is covered here, such as the handling of the trap flag if a write to IA32_DEBUGCTL causes a VM exit (which is covered in 32.2.1). Always consult the Intel manual before listening to a stranger on the internet.

Monday, 22 October 2018

Control Register Access Exiting and Crashing VMware

(Updated on November 4 2018 to correct a minor error)

Coinciding with my previous two posts, here's how you can crash or at least detect VMware and many other hypervisors:
https://gist.github.com/drew-gpf/d31840bebbbb1ff1d112a6f46e162c05

Backstory:
When I was writing a simple SEH emulator (following the documentation on msdn as well as this excellent blogpost) for my hypervisor, I was testing under VMware.

When trying to execute that, I found that VMware would instantly close without any message. After being stumped for a while, I tried on my PC only to find that my SEH emulator did, in fact, work.

When I talked about it with my friend daax (whose blog can be found over at https://revers.engineering/), he recalled experiencing the exact same issue (albeit with different motivations): when he tried to unset CR0.pe to cause a #GP(0), VMware would just close on him. This eventually drove me to the conclusion that VMware was improperly handling the CR access VM exit for CR0, or more specifically, they don't check that cr0.pg is already enabled, which would normally cause a #GP(0). Since they write the invalid value into the guest CR0 VMCS field, the processor objects upon VM entry in the form of a VM entry failure, which VMware responds to by just closing itself.

Additional checks in that gist linked above rely on hypervisors not properly emulating CPU behaviour, which includes:

Injecting an exception upon updating bits of CR0 required to be a certain value by the CPU for VMX operation or just not updating them at all
Not injecting a fault when the guest attempts to set a reserved bit of CR4, which can result in either VM entry failure or a triple fault due to repeated #GP(0)s
Not forcing reserved bits of CR0 to 0 (despite me previously stating otherwise, I forgot that reserved bits of CR0 won't actually cause a #GP(0) and will be forced to 0)
Updating the state of CR0 even though the write caused a #GP(0)

Fixes for such include:

Always checking if a change is valid before changing any CPU state
For control register bits that are documented, if they are changed, the hypervisor should ensure that the processor supports the bit and if the processor would inject an exception if the bit was changed (i.e with the VMware example, they should check if CR0.pg is set, and if it is, declare the change as invalid)
Control register bits that are forced to be a fixed value should be host owned bits which values only change in the read shadow
Control register bits which don't exist at the time of writing the hypervisor should never be allowed to change - this also means that the hypervisor *must* control CPUID responses to remove reserved bits from responses, as well as reserved leafs

Now to shamelessly plug my first post. An easy mechanism to implement what I described is found right here!

Note that, at least for my CPU, reserved bits of CR4 will be marked as bits which must be 0 by the architecture; using my system will just make those bits change in the read shadow, so those bits must be set inside of the "no set" bitmask.

Comments and suggestions are appreciated.

Wednesday, 26 September 2018

Detecting VMware by reading an invalid MSR

This isn't exactly related to hypervisor development, I just thought it was a neat find.
Basic background:
An MSR, or Model-Specific Register is a natural-width (i.e the size of a pointer) buffer that contains data which either affects processor behaviour or is used to store processor-specific data which is deemed to not go into a CPUID leaf. Trying to interact with an invalid MSR index will result in a General-Protection Fault, or a #GP(0) (the (0) meaning an error code of 0).

MSRs are virtualized or emulated by the Virtual Machine Monitor. Generally speaking, an MSR read or write will trap into the VMM/hypervisor so it can give the correct response (newer processors contain features such as MSR bitmaps for Intel's Virtual Machine eXtensions which specify the MSRs that do this). Therefore, if a hypervisor decides to not implement the proper response for an MSR interaction, it can be used as a way to detect it.

So, back to invalid MSR indices. Intel defines a range of MSRs which, unless subverted by a hypervisor, will always be invalid, no matter what CPU you're using (provided it's Intel). These are MSRs 40000000h - 400000FFh.

As said before, reading or writing to these MSRs will always cause a #GP(0). Unless, of course, you're running under VMware.

One day, while messing around with my hypervisor inside of VMware and reading random MSRs, well, let the image speak for itself:

I was able to read a normally invalid MSR! This means that one could detect VMware by simply doing:

For context, with my program, the normal output is:

Tuesday, 18 September 2018

"How does control register exiting work?"

For my first magic trick, I'll explain:
-What control register guest/host masks and read shadows are
-How to determine bits of CR0 and CR4 which are required to be a certain state during VMX operation, and how they relate to the Guest/Host masks and read shadows
-Handling CR access VM exits

CR0 and CR4 Guest/Host Masks and Read Shadows

First, an explanation from Volume 3 of the Intel Manual:

Put simply, the VMCS has a guest/host mask and read shadow for control registers CR0 and CR4. If a bit in the mask is 1, it is "host-owned", meaning that:
-Reads from the control register will return host-owned bits from their value in the read shadow (i.e. if bit 3 of cr0 is setup to be host owned, the value the guest is given will be from bit 3 of the cr0 read shadow; other bits, such as possibly bit 0, that aren't host owned will return their actual values).
-Writes that change host-owned bits from their value in the read shadow will cause a CR Access VM exit. Such a VM exit is fault-like, meaning that the instruction is not executed and the guest instruction pointer will point to the instruction that caused the VM exit.

Such a mechanism is needed because of bits the VMM may want to emulate the effect of or not want the guest to set; there are also bits which are not updated by the processor upon VMX transitions, which will be covered later. A typical VMM will make bits required by the architecture to be set to a certain value unable to be changed by the guest.

Determining Bits That Are Required By the Architecture

To determine required bits, the architecture defines 4 Model-Specific Registers. They are:

-IA32_VMX_CR0_FIXED0 (index 486h)

-IA32_VMX_CR0_FIXED1 (index 487h)

-IA32_VMX_CR4_FIXED0 (index 488h)

-IA32_VMX_CR4_FIXED1 (index 489h)

These MSRs are formatted in a way which may be seen as unorthodox. Indeed, if one looks for an explanation in the Intel Manual, they'll see a wall of text which takes ages to decrypt:

(repeat for cr4)
Which means, in English, that if a bit in both 'fixed' MSRs is set to the same value, that bit in the control register must be that value; if the bit in both MSRs is set to two different values, that bit is 'flexible' and can be any value.

From this, you can create a mask of bits which must have a certain value, as well as the values of those bits.

You should obviously replace 'crx' with cr0 and cr4
How does this work? To get flexible bits, both MSRs are XORed together, since bits that are two different values can be flexible (they have to be 0 and 1 which will always be 1 after an xor, while required bits will be 0 since 0 xor 0 and 1 xor 1 are both 1).

To get which bits are required, (which are going to be part of our host mask) a bitwise NOT is performed from the bits that are flexible.

To get the values of required bits, I get what both MSRs tell me, minus the flexible bits.
I can then write the required bits, along with the already existing bits of the control register (only including flexible bits) to the control register, allowing VMX operation.

If you're using a vm control called unrestricted guest, which allows the guest to run in non-paged protected mode and 16-bit real address mode, it's recommended that you remove bits 0 and 31 (corresponding to paging and protected mode enable bits) from the cr0 required bitmask and states as they may be set there, even though the vm control allows those bits to be unset.

The required bitmask and states can then be used with the CRx guest/host mask and read shadow, as well as the relevant VM exits.

You can now use these values when setting up the VMCS fields:

The values and purposes of "crx_host_mask_trap/no_set" are explained when handling CR access vm exits.

Handling Cr Access VM Exits

Note that this section does not cover exits caused by cr3 or cr8 reads or writes (as they aren't relevant for this, they aren't required, and they're very simple).

These VM exits have 3 parts: Getting the value of the control register that the guest wants to set, setting the read shadow/actual control register and injecting a #GP if needed, and handling how the guest sets the bits of each control register.

Getting the control register value

Getting the actual control register can be simple or a bit complicated. If the exit qualification indicates that the control register number is 4, it always means a MOV to CR4, and you can get the value the guest wants to set from the corresponding register.

CR0 has 3 ways: A MOV to CR0, an execution of CLTS, and the LMSW instruction. CLTS is really simple; get the value of the control register that the guest would read and remove bit 3, the 'task switched' bit. If LMSW causes a VM exit, you need to either read from the exit qualification or from memory. Since this is implementation specific (if the guest uses paging you need to translate one or more linear addresses), I won't go too in-depth.

You'd then need to determine if the setting is valid, and set the read shadow and control register.

Setting the read shadow and control register

Because of the way this site handles code formatting, I posted an example in a gist over at https://gist.github.com/drew-gpf/7c2d4c0f03aa3c06ed9c94ec435355c1

By giving the function the value the guest wants to set, as well as bits which can't be set, which will be changed but have no effect, and bits which are changed and only exist to VM exit on, you can properly handle a CR access VM exit.

Remember when I said that some bits aren't changed on VMX transitions, and that CR access VM exits don't execute the instruction? Those bits are bits 30 (cache disable) and 29 (not write through). If a CR0 write changes these bits and some host owned bits, these bits won't actually change, thus meaning that you should explicitly handle them (you can either directly change them or emulate their changes by changing the EPT structures so that all entries are uncacheable and ignore the PAT memory type).

Handling different bits set by the guest

Helpfully, handle_cr_change will fill out a buffer that receives the value of every bit that is different. With these different bits, you can figure out if a bit was changed, and if it's relevant, perform the needed operation. For example, if you wanted to do something when the guest set cr4.smep, you'd set the 20th bit in the 'trap' host mask (or the 'dont care' host mask if you don't want it to change cpu operation) and handle the 20th bit in the different bit field being set.

Caveats

Earlier, I made a case about needing to handle cr0.cd and cr0.nwt since they aren't changed by the architecture if a VM exit happens. There are also more bits which you need to handle that otherwise won't perform the correct operation. For CR0, you'd want to handle the guest setting cr0.pg (bit 31), since when entering long mode it otherwise won't set important structures. This issue actually affected me when trying to virtualize a system before the Windows kernel initialized, as the guest was also modifying a VMX required bit, which caused a VM-entry failure. To handle the changing of this bit, you should look at how to enter long mode, as well as structures the processor verifies when transitioning to the guest.

CR4 also has bits such as pge, pcide, and smep which flush the TLB, with PCIDE also being a reserved bit while operating in protected mode. It's also a good idea to verify that the pae bit isn't being unset if EFER.LMA is set, as that causes a VM entry failure.

While it is not required, it's also a good idea to reserve control register bits which don't exist at the time of writing the hypervisor.