Monday 22 October 2018

Control Register Access Exiting and Crashing VMware

(Updated on November 4 2018 to correct a minor error)

Coinciding with my previous two posts, here's how you can crash or at least detect VMware and many other hypervisors:
https://gist.github.com/drew-gpf/d31840bebbbb1ff1d112a6f46e162c05

Backstory:
When I was writing a simple SEH emulator (following the documentation on msdn as well as this excellent blogpost) for my hypervisor, I was testing under VMware.
When trying to execute that, I found that VMware would instantly close without any message. After being stumped for a while, I tried on my PC only to find that my SEH emulator did, in fact, work.

When I talked about it with my friend daax (whose blog can be found over at https://revers.engineering/), he recalled experiencing the exact same issue (albeit with different motivations): when he tried to unset CR0.pe to cause a #GP(0), VMware would just close on him. This eventually drove me to the conclusion that VMware was improperly handling the CR access VM exit for CR0, or more specifically, they don't check that cr0.pg is already enabled, which would normally cause a #GP(0). Since they write the invalid value into the guest CR0 VMCS field, the processor objects upon VM entry in the form of a VM entry failure, which VMware responds to by just closing itself.

Additional checks in that gist linked above rely on hypervisors not properly emulating CPU behaviour, which includes:
  • Injecting an exception upon updating bits of CR0 required to be a certain value by the CPU for VMX operation or just not updating them at all
  • Not injecting a fault when the guest attempts to set a reserved bit of CR4, which can result in either VM entry failure or a triple fault due to repeated #GP(0)s
  • Not forcing reserved bits of CR0 to 0 (despite me previously stating otherwise, I forgot that reserved bits of CR0 won't actually cause a #GP(0) and will be forced to 0)
  • Updating the state of CR0 even though the write caused a #GP(0)

Fixes for such include:
  • Always checking if a change is valid before changing any CPU state
  • For control register bits that are documented, if they are changed, the hypervisor should ensure that the processor supports the bit and if the processor would inject an exception if the bit was changed (i.e with the VMware example, they should check if CR0.pg is set, and if it is, declare the change as invalid)
  • Control register bits that are forced to be a fixed value should be host owned bits which values only change in the read shadow
  • Control register bits which don't exist at the time of writing the hypervisor should never be allowed to change - this also means that the hypervisor *must* control CPUID responses to remove reserved bits from responses, as well as reserved leafs


Now to shamelessly plug my first post. An easy mechanism to implement what I described is found right here!
Note that, at least for my CPU, reserved bits of CR4 will be marked as bits which must be 0 by the architecture; using my system will just make those bits change in the read shadow, so those bits must be set inside of the "no set" bitmask.

Comments and suggestions are appreciated.

Wednesday 26 September 2018

Detecting VMware by reading an invalid MSR

This isn't exactly related to hypervisor development, I just thought it was a neat find.
Basic background:
An MSR, or Model-Specific Register is a natural-width (i.e the size of a pointer) buffer that contains data which either affects processor behaviour or is used to store processor-specific data which is deemed to not go into a CPUID leaf. Trying to interact with an invalid MSR index will result in a General-Protection Fault, or a #GP(0) (the (0) meaning an error code of 0).

MSRs are virtualized or emulated by the Virtual Machine Monitor. Generally speaking, an MSR read or write will trap into the VMM/hypervisor so it can give the correct response (newer processors contain features such as MSR bitmaps for Intel's Virtual Machine eXtensions which specify the MSRs that do this). Therefore, if a hypervisor decides to not implement the proper response for an MSR interaction, it can be used as a way to detect it.

So, back to invalid MSR indices. Intel defines a range of MSRs which, unless subverted by a hypervisor, will always be invalid, no matter what CPU you're using (provided it's Intel). These are MSRs 40000000h - 400000FFh.
As said before, reading or writing to these MSRs will always cause a #GP(0). Unless, of course, you're running under VMware.

One day, while messing around with my hypervisor inside of VMware and reading random MSRs, well, let the image speak for itself:
I was able to read a normally invalid MSR! This means that one could detect VMware by simply doing:

For context, with my program, the normal output is:

Tuesday 18 September 2018

"How does control register exiting work?"

For my first magic trick, I'll explain:
-What control register guest/host masks and read shadows are
-How to determine bits of CR0 and CR4 which are required to be a certain state during VMX operation, and how they relate to the Guest/Host masks and read shadows
-Handling CR access VM exits

CR0 and CR4 Guest/Host Masks and Read Shadows


First, an explanation from Volume 3 of the Intel Manual:
Put simply, the VMCS has a guest/host mask and read shadow for control registers CR0 and CR4. If a bit in the mask is 1, it is "host-owned", meaning that:
-Reads from the control register will return host-owned bits from their value in the read shadow (i.e. if bit 3 of cr0 is setup to be host owned, the value the guest is given will be from bit 3 of the cr0 read shadow; other bits, such as possibly bit 0, that aren't host owned will return their actual values).
-Writes that change host-owned bits from their value in the read shadow will cause a CR Access VM exit. Such a VM exit is fault-like, meaning that the instruction is not executed and the guest instruction pointer will point to the instruction that caused the VM exit.

Such a mechanism is needed because of bits the VMM may want to emulate the effect of or not want the guest to set; there are also bits which are not updated by the processor upon VMX transitions, which will be covered later. A typical VMM will make bits required by the architecture to be set to a certain value unable to be changed by the guest.

Determining Bits That Are Required By the Architecture


To determine required bits, the architecture defines 4 Model-Specific Registers. They are:
-IA32_VMX_CR0_FIXED0 (index 486h)
-IA32_VMX_CR0_FIXED1 (index 487h)
-IA32_VMX_CR4_FIXED0 (index 488h)
-IA32_VMX_CR4_FIXED1 (index 489h)

These MSRs are formatted in a way which may be seen as unorthodox. Indeed, if one looks for an explanation in the Intel Manual, they'll see a wall of text which takes ages to decrypt:
(repeat for cr4)
Which means, in English, that if a bit in both 'fixed' MSRs is set to the same value, that bit in the control register must be that value; if the bit in both MSRs is set to two different values, that bit is 'flexible' and can be any value.

From this, you can create a mask of bits which must have a certain value, as well as the values of those bits.

You should obviously replace 'crx' with cr0 and cr4
How does this work? To get flexible bits, both MSRs are XORed together, since bits that are two different values can be flexible (they have to be 0 and 1 which will always be 1 after an xor, while required bits will be 0 since 0 xor 0 and 1 xor 1 are both 1).

To get which bits are required, (which are going to be part of our host mask) a bitwise NOT is performed from the bits that are flexible.

To get the values of required bits, I get what both MSRs tell me, minus the flexible bits.
I can then write the required bits, along with the already existing bits of the control register (only including flexible bits) to the control register, allowing VMX operation.

If you're using a vm control called unrestricted guest, which allows the guest to run in non-paged protected mode and 16-bit real address mode, it's recommended that you remove bits 0 and 31 (corresponding to paging and protected mode enable bits) from the cr0 required bitmask and states as they may be set there, even though the vm control allows those bits to be unset.

The required bitmask and states can then be used with the CRx guest/host mask and read shadow, as well as the relevant VM exits.

You can now use these values when setting up the VMCS fields:
The values and purposes of "crx_host_mask_trap/no_set" are explained when handling CR access vm exits.

Handling Cr Access VM Exits


Note that this section does not cover exits caused by cr3 or cr8 reads or writes (as they aren't relevant for this, they aren't required, and they're very simple).

These VM exits have 3 parts: Getting the value of the control register that the guest wants to set, setting the read shadow/actual control register and injecting a #GP if needed, and handling how the guest sets the bits of each control register.

Getting the control register value


Getting the actual control register can be simple or a bit complicated. If the exit qualification indicates that the control register number is 4, it always means a MOV to CR4, and you can get the value the guest wants to set from the corresponding register.
CR0 has 3 ways: A MOV to CR0, an execution of CLTS, and the LMSW instruction. CLTS is really simple; get the value of the control register that the guest would read and remove bit 3, the 'task switched' bit. If LMSW causes a VM exit, you need to either read from the exit qualification or from memory. Since this is implementation specific (if the guest uses paging you need to translate one or more linear addresses), I won't go too in-depth.

You'd then need to determine if the setting is valid, and set the read shadow and control register.

Setting the read shadow and control register

Because of the way this site handles code formatting, I posted an example in a gist over at https://gist.github.com/drew-gpf/7c2d4c0f03aa3c06ed9c94ec435355c1
By giving the function the value the guest wants to set, as well as bits which can't be set, which will be changed but have no effect, and bits which are changed and only exist to VM exit on, you can properly handle a CR access VM exit.

Remember when I said that some bits aren't changed on VMX transitions, and that CR access VM exits don't execute the instruction? Those bits are bits 30 (cache disable) and 29 (not write through). If a CR0 write changes these bits and some host owned bits, these bits won't actually change, thus meaning that you should explicitly handle them (you can either directly change them or emulate their changes by changing the EPT structures so that all entries are uncacheable and ignore the PAT memory type).

Handling different bits set by the guest

Helpfully, handle_cr_change will fill out a buffer that receives the value of every bit that is different. With these different bits, you can figure out if a bit was changed, and if it's relevant, perform the needed operation. For example, if you wanted to do something when the guest set cr4.smep, you'd set the 20th bit in the 'trap' host mask (or the 'dont care' host mask if you don't want it to change cpu operation) and handle the 20th bit in the different bit field being set.

Caveats

Earlier, I made a case about needing to handle cr0.cd and cr0.nwt since they aren't changed by the architecture if a VM exit happens. There are also more bits which you need to handle that otherwise won't perform the correct operation. For CR0, you'd want to handle the guest setting cr0.pg (bit 31), since when entering long mode it otherwise won't set important structures. This issue actually affected me when trying to virtualize a system before the Windows kernel initialized, as the guest was also modifying a VMX required bit, which caused a VM-entry failure. To handle the changing of this bit, you should look at how to enter long mode, as well as structures the processor verifies when transitioning to the guest.
CR4 also has bits such as pge, pcide, and smep which flush the TLB, with PCIDE also being a reserved bit while operating in protected mode. It's also a good idea to verify that the pae bit isn't being unset if EFER.LMA is set, as that causes a VM entry failure.

While it is not required, it's also a good idea to reserve control register bits which don't exist at the time of writing the hypervisor.

Unexplained crashes, microcode updates, and CPU errata

Computers are complicated machines. It's all too common to be browsing the internet, playing the newest Call of Duty , or just watching ...