Monday, 14 January 2019

A Common Missight in Most Hypervisors

Generally, when a hypervisor encounters a VM exit, it is because it needs to emulate the effects of an instruction, be it CPUID, RDMSR, or ENCLS. Due to ineffective wording in Intel's Software Developer's Manual, most assume that the only necessary action to be taken is to update the guest's registers or inject an exception. There are many times, however, where this is not the case. Hidden all the way in section 32.2.1 of volume 3 is this sentence:
"If a VMM emulates a guest instruction that would encounter a debug trap (single step or data or I/O breakpoint), it should cause that trap to be delivered."

After reading the subsequent paragraph going on about injecting single-step debug exceptions, I created a new project in Visual Studio, confirming the suspicions in the back of my head:

The issue here is that, by not injecting a debug exception after the emulated CPUID instruction, the trap is essentially delayed until the NOP can recognize it.

This brings us to the main point, which is that the phrase "none of the events normally associated take place," albeit used to describe exiting for startup-IPIs, describes every fault-like VM exit.

Single-Step Debug Exceptions

It's not very hard to handle single-step debug exceptions. I don't know how to make flowcharts, so here's a simple step-by-step explanation:

    1. After incrementing the guest instruction pointer (meaning that this can only be done if the instruction has been emulated), check if the trap flag (bit 8) of the guest's RFLAGS is set, and that bit 1 of the guest's IA32_DEBUGCTL isn't set (more on that later).

    2. Inject a single-step debug exception using the guest pending debug exception VMCS field (rather than using traditional event injection, due to the way that the processor handles pending debug exceptions) by setting the caused by single step bit to 1.

The format of the pending debug exceptions VMCS field. As said in the manual, even though the structure's format matches that of DR6, the RTM bit has an inverse effect (where if the bit in DR6 would be set, it would not be set in the pending debug exceptions field).
Of course, there are hoops one has to jump through. If the loading and saving of debug controls are enabled in the entry and exit controls respectively (which should be supported on every processor), IA32_DEBUGCTL will be stored inside of a VMCS field with the encoding 0x2802. Bit 1 of IA32_DEBUGCTL has to be checked as, if that bit is set, debug exceptions will only happen on branch instructions, and thus should only be injected by a hypervisor if it is emulating a branch instruction. As a rule of thumb, a hypervisor should get the value, set the relevant bits, and then write to the pending debug exceptions field, as some bits can be set by the processor, or even by previous writes to the field.

But Wait, There's More

This isn't perfect, as shown here:

In this GIF, the effects of the MOV SS blocking of debug exceptions are extended to the NOP instruction, when they should have ended after the CPUID instruction.
This is caused by the fact that the hypervisor also needs to clear bits from the guest interruptibility state in the VMCS, or specifically, the blocked by mov ss/blocked by sti/blocked by smi bits. All of these bits are similar in that they are only meant to last for the duration of the instruction after the one that caused them to be set; the hypervisor, when emulating an instruction, must also clear these bits as the CPU would during normal operation.

The bit format of the guest interruptibility state VMCS field. The blocked by nmi and enclave interruption fields aren't touched as those are generally set and cleared by the processor and are only relevant for specific VM exits.

Similar to the checks done on the trap flag, these bits should only be cleared if the instruction has been emulated, leading to the correct behaviour.

Everything now works as it should.

Closing Notes

Along with the fields mentioned above, I've found that the following should be taken into consideration:

Preserving cr2; depending on the design of the hypervisor, a page fault could be possible.

Preserving SSE register state, including every xmm* register and the mxcsr, as well as setting up the mxcsr so that #XM faults cannot happen, by setting it to a value of 0x1F80. Many compilers, such as MSVC, will optimize certain portions of code to use the SSE instruction set due to its speed.

If the hypervisor interacts with guest memory or VM exits on any I/O port interactions, debug exceptions (as well as alignment checks) should be injected for both, using the aforementioned pending debug exceptions VMCS field. Do note that a debug exception being encountered will not stop the instruction from executing, and I/O port debug exceptions are checked using a sign-extended I/O port address (the last point may be wrong - make sure to consult the manual).

VM exits encountered while the processor is injecting an event into the guest will cause it to fill out the IDT vectoring information field, which contains the event that was being injected. Ignoring this field could cause the event to be lost in the infinite void. Note that this can be filled out even if the hypervisor did not inject the event. More information can be found in sections 24.9.2, 27.2, and of volume 3.

Of course, not everything is covered here, such as the handling of the trap flag if a write to IA32_DEBUGCTL causes a VM exit (which is covered in 32.2.1). Always consult the Intel manual before listening to a stranger on the internet.

Monday, 22 October 2018

Control Register Access Exiting and Crashing VMware

(Updated on November 4 2018 to correct a minor error)

Coinciding with my previous two posts, here's how you can crash or at least detect VMware and many other hypervisors:

When I was writing a simple SEH emulator (following the documentation on msdn as well as this excellent blogpost) for my hypervisor, I was testing under VMware.
When trying to execute that, I found that VMware would instantly close without any message. After being stumped for a while, I tried on my PC only to find that my SEH emulator did, in fact, work.

When I talked about it with my friend daax (whose blog can be found over at, he recalled experiencing the exact same issue (albeit with different motivations): when he tried to unset to cause a #GP(0), VMware would just close on him. This eventually drove me to the conclusion that VMware was improperly handling the CR access VM exit for CR0, or more specifically, they don't check that is already enabled, which would normally cause a #GP(0). Since they write the invalid value into the guest CR0 VMCS field, the processor objects upon VM entry in the form of a VM entry failure, which VMware responds to by just closing itself.

Additional checks in that gist linked above rely on hypervisors not properly emulating CPU behaviour, which includes:
  • Injecting an exception upon updating bits of CR0 required to be a certain value by the CPU for VMX operation or just not updating them at all
  • Not injecting a fault when the guest attempts to set a reserved bit of CR4, which can result in either VM entry failure or a triple fault due to repeated #GP(0)s
  • Not forcing reserved bits of CR0 to 0 (despite me previously stating otherwise, I forgot that reserved bits of CR0 won't actually cause a #GP(0) and will be forced to 0)
  • Updating the state of CR0 even though the write caused a #GP(0)

Fixes for such include:
  • Always checking if a change is valid before changing any CPU state
  • For control register bits that are documented, if they are changed, the hypervisor should ensure that the processor supports the bit and if the processor would inject an exception if the bit was changed (i.e with the VMware example, they should check if is set, and if it is, declare the change as invalid)
  • Control register bits that are forced to be a fixed value should be host owned bits which values only change in the read shadow
  • Control register bits which don't exist at the time of writing the hypervisor should never be allowed to change - this also means that the hypervisor *must* control CPUID responses to remove reserved bits from responses, as well as reserved leafs

Now to shamelessly plug my first post. An easy mechanism to implement what I described is found right here!
Note that, at least for my CPU, reserved bits of CR4 will be marked as bits which must be 0 by the architecture; using my system will just make those bits change in the read shadow, so those bits must be set inside of the "no set" bitmask.

Comments and suggestions are appreciated.

Wednesday, 26 September 2018

Detecting VMware by reading an invalid MSR

This isn't exactly related to hypervisor development, I just thought it was a neat find.
Basic background:
An MSR, or Model-Specific Register is a natural-width (i.e the size of a pointer) buffer that contains data which either affects processor behaviour or is used to store processor-specific data which is deemed to not go into a CPUID leaf. Trying to interact with an invalid MSR index will result in a General-Protection Fault, or a #GP(0) (the (0) meaning an error code of 0).

MSRs are virtualized or emulated by the Virtual Machine Monitor. Generally speaking, an MSR read or write will trap into the VMM/hypervisor so it can give the correct response (newer processors contain features such as MSR bitmaps for Intel's Virtual Machine eXtensions which specify the MSRs that do this). Therefore, if a hypervisor decides to not implement the proper response for an MSR interaction, it can be used as a way to detect it.

So, back to invalid MSR indices. Intel defines a range of MSRs which, unless subverted by a hypervisor, will always be invalid, no matter what CPU you're using (provided it's Intel). These are MSRs 40000000h - 400000FFh.
As said before, reading or writing to these MSRs will always cause a #GP(0). Unless, of course, you're running under VMware.

One day, while messing around with my hypervisor inside of VMware and reading random MSRs, well, let the image speak for itself:
I was able to read a normally invalid MSR! This means that one could detect VMware by simply doing:

For context, with my program, the normal output is:

Tuesday, 18 September 2018

"How does control register exiting work?"

For my first magic trick, I'll explain:
-What control register guest/host masks and read shadows are
-How to determine bits of CR0 and CR4 which are required to be a certain state during VMX operation, and how they relate to the Guest/Host masks and read shadows
-Handling CR access VM exits

CR0 and CR4 Guest/Host Masks and Read Shadows

First, an explanation from Volume 3 of the Intel Manual:
Put simply, the VMCS has a guest/host mask and read shadow for control registers CR0 and CR4. If a bit in the mask is 1, it is "host-owned", meaning that:
-Reads from the control register will return host-owned bits from their value in the read shadow (i.e. if bit 3 of cr0 is setup to be host owned, the value the guest is given will be from bit 3 of the cr0 read shadow; other bits, such as possibly bit 0, that aren't host owned will return their actual values).
-Writes that change host-owned bits from their value in the read shadow will cause a CR Access VM exit. Such a VM exit is fault-like, meaning that the instruction is not executed and the guest instruction pointer will point to the instruction that caused the VM exit.

Such a mechanism is needed because of bits the VMM may want to emulate the effect of or not want the guest to set; there are also bits which are not updated by the processor upon VMX transitions, which will be covered later. A typical VMM will make bits required by the architecture to be set to a certain value unable to be changed by the guest.

Determining Bits That Are Required By the Architecture

To determine required bits, the architecture defines 4 Model-Specific Registers. They are:
-IA32_VMX_CR0_FIXED0 (index 486h)
-IA32_VMX_CR0_FIXED1 (index 487h)
-IA32_VMX_CR4_FIXED0 (index 488h)
-IA32_VMX_CR4_FIXED1 (index 489h)

These MSRs are formatted in a way which may be seen as unorthodox. Indeed, if one looks for an explanation in the Intel Manual, they'll see a wall of text which takes ages to decrypt:
(repeat for cr4)
Which means, in English, that if a bit in both 'fixed' MSRs is set to the same value, that bit in the control register must be that value; if the bit in both MSRs is set to two different values, that bit is 'flexible' and can be any value.

From this, you can create a mask of bits which must have a certain value, as well as the values of those bits.

You should obviously replace 'crx' with cr0 and cr4
How does this work? To get flexible bits, both MSRs are XORed together, since bits that are two different values can be flexible (they have to be 0 and 1 which will always be 1 after an xor, while required bits will be 0 since 0 xor 0 and 1 xor 1 are both 1).

To get which bits are required, (which are going to be part of our host mask) a bitwise NOT is performed from the bits that are flexible.

To get the values of required bits, I get what both MSRs tell me, minus the flexible bits.
I can then write the required bits, along with the already existing bits of the control register (only including flexible bits) to the control register, allowing VMX operation.

If you're using a vm control called unrestricted guest, which allows the guest to run in non-paged protected mode and 16-bit real address mode, it's recommended that you remove bits 0 and 31 (corresponding to paging and protected mode enable bits) from the cr0 required bitmask and states as they may be set there, even though the vm control allows those bits to be unset.

The required bitmask and states can then be used with the CRx guest/host mask and read shadow, as well as the relevant VM exits.

You can now use these values when setting up the VMCS fields:
The values and purposes of "crx_host_mask_trap/no_set" are explained when handling CR access vm exits.

Handling Cr Access VM Exits

Note that this section does not cover exits caused by cr3 or cr8 reads or writes (as they aren't relevant for this, they aren't required, and they're very simple).

These VM exits have 3 parts: Getting the value of the control register that the guest wants to set, setting the read shadow/actual control register and injecting a #GP if needed, and handling how the guest sets the bits of each control register.

Getting the control register value

Getting the actual control register can be simple or a bit complicated. If the exit qualification indicates that the control register number is 4, it always means a MOV to CR4, and you can get the value the guest wants to set from the corresponding register.
CR0 has 3 ways: A MOV to CR0, an execution of CLTS, and the LMSW instruction. CLTS is really simple; get the value of the control register that the guest would read and remove bit 3, the 'task switched' bit. If LMSW causes a VM exit, you need to either read from the exit qualification or from memory. Since this is implementation specific (if the guest uses paging you need to translate one or more linear addresses), I won't go too in-depth.

You'd then need to determine if the setting is valid, and set the read shadow and control register.

Setting the read shadow and control register

Because of the way this site handles code formatting, I posted an example in a gist over at
By giving the function the value the guest wants to set, as well as bits which can't be set, which will be changed but have no effect, and bits which are changed and only exist to VM exit on, you can properly handle a CR access VM exit.

Remember when I said that some bits aren't changed on VMX transitions, and that CR access VM exits don't execute the instruction? Those bits are bits 30 (cache disable) and 29 (not write through). If a CR0 write changes these bits and some host owned bits, these bits won't actually change, thus meaning that you should explicitly handle them (you can either directly change them or emulate their changes by changing the EPT structures so that all entries are uncacheable and ignore the PAT memory type).

Handling different bits set by the guest

Helpfully, handle_cr_change will fill out a buffer that receives the value of every bit that is different. With these different bits, you can figure out if a bit was changed, and if it's relevant, perform the needed operation. For example, if you wanted to do something when the guest set cr4.smep, you'd set the 20th bit in the 'trap' host mask (or the 'dont care' host mask if you don't want it to change cpu operation) and handle the 20th bit in the different bit field being set.


Earlier, I made a case about needing to handle and cr0.nwt since they aren't changed by the architecture if a VM exit happens. There are also more bits which you need to handle that otherwise won't perform the correct operation. For CR0, you'd want to handle the guest setting (bit 31), since when entering long mode it otherwise won't set important structures. This issue actually affected me when trying to virtualize a system before the Windows kernel initialized, as the guest was also modifying a VMX required bit, which caused a VM-entry failure. To handle the changing of this bit, you should look at how to enter long mode, as well as structures the processor verifies when transitioning to the guest.
CR4 also has bits such as pge, pcide, and smep which flush the TLB, with PCIDE also being a reserved bit while operating in protected mode. It's also a good idea to verify that the pae bit isn't being unset if EFER.LMA is set, as that causes a VM entry failure.

While it is not required, it's also a good idea to reserve control register bits which don't exist at the time of writing the hypervisor.

A Common Missight in Most Hypervisors

Generally, when a hypervisor encounters a VM exit, it is because it needs to emulate the effects of an instruction, be it CPUID, RDMSR, or E...