Monday, 14 January 2019

A Common Missight in Most Hypervisors

Generally, when a hypervisor encounters a VM exit, it is because it needs to emulate the effects of an instruction, be it CPUID, RDMSR, or ENCLS. Due to ineffective wording in Intel's Software Developer's Manual, most assume that the only necessary action to be taken is to update the guest's registers or inject an exception. There are many times, however, where this is not the case. Hidden all the way in section 32.2.1 of volume 3 is this sentence:
"If a VMM emulates a guest instruction that would encounter a debug trap (single step or data or I/O breakpoint), it should cause that trap to be delivered."

After reading the subsequent paragraph going on about injecting single-step debug exceptions, I created a new project in Visual Studio, confirming the suspicions in the back of my head:

The issue here is that, by not injecting a debug exception after the emulated CPUID instruction, the trap is essentially delayed until the NOP can recognize it.


This brings us to the main point, which is that the phrase "none of the events normally associated take place," albeit used to describe exiting for startup-IPIs, describes every fault-like VM exit.

Single-Step Debug Exceptions

It's not very hard to handle single-step debug exceptions. I don't know how to make flowcharts, so here's a simple step-by-step explanation:

    1. After incrementing the guest instruction pointer (meaning that this can only be done if the instruction has been emulated), check if the trap flag (bit 8) of the guest's RFLAGS is set, and that bit 1 of the guest's IA32_DEBUGCTL isn't set (more on that later).

    2. Inject a single-step debug exception using the guest pending debug exception VMCS field (rather than using traditional event injection, due to the way that the processor handles pending debug exceptions) by setting the caused by single step bit to 1.

The format of the pending debug exceptions VMCS field. As said in the manual, even though the structure's format matches that of DR6, the RTM bit has an inverse effect (where if the bit in DR6 would be set, it would not be set in the pending debug exceptions field).
 
Of course, there are hoops one has to jump through. If the loading and saving of debug controls are enabled in the entry and exit controls respectively (which should be supported on every processor), IA32_DEBUGCTL will be stored inside of a VMCS field with the encoding 0x2802. Bit 1 of IA32_DEBUGCTL has to be checked as, if that bit is set, debug exceptions will only happen on branch instructions, and thus should only be injected by a hypervisor if it is emulating a branch instruction. As a rule of thumb, a hypervisor should get the value, set the relevant bits, and then write to the pending debug exceptions field, as some bits can be set by the processor, or even by previous writes to the field.

But Wait, There's More

This isn't perfect, as shown here:

 
In this GIF, the effects of the MOV SS blocking of debug exceptions are extended to the NOP instruction, when they should have ended after the CPUID instruction.
 
This is caused by the fact that the hypervisor also needs to clear bits from the guest interruptibility state in the VMCS, or specifically, the blocked by mov ss/blocked by sti/blocked by smi bits. All of these bits are similar in that they are only meant to last for the duration of the instruction after the one that caused them to be set; the hypervisor, when emulating an instruction, must also clear these bits as the CPU would during normal operation.

The bit format of the guest interruptibility state VMCS field. The blocked by nmi and enclave interruption fields aren't touched as those are generally set and cleared by the processor and are only relevant for specific VM exits.

Similar to the checks done on the trap flag, these bits should only be cleared if the instruction has been emulated, leading to the correct behaviour.

Everything now works as it should.

Closing Notes

Along with the fields mentioned above, I've found that the following should be taken into consideration:

Preserving cr2; depending on the design of the hypervisor, a page fault could be possible.

Preserving SSE register state, including every xmm* register and the mxcsr, as well as setting up the mxcsr so that #XM faults cannot happen, by setting it to a value of 0x1F80. Many compilers, such as MSVC, will optimize certain portions of code to use the SSE instruction set due to its speed.

If the hypervisor interacts with guest memory or VM exits on any I/O port interactions, debug exceptions (as well as alignment checks) should be injected for both, using the aforementioned pending debug exceptions VMCS field. Do note that a debug exception being encountered will not stop the instruction from executing, and I/O port debug exceptions are checked using a sign-extended I/O port address (the last point may be wrong - make sure to consult the manual).

VM exits encountered while the processor is injecting an event into the guest will cause it to fill out the IDT vectoring information field, which contains the event that was being injected. Ignoring this field could cause the event to be lost in the infinite void. Note that this can be filled out even if the hypervisor did not inject the event. More information can be found in sections 24.9.2, 27.2, and 31.7.1.2 of volume 3.



Of course, not everything is covered here, such as the handling of the trap flag if a write to IA32_DEBUGCTL causes a VM exit (which is covered in 32.2.1). Always consult the Intel manual before listening to a stranger on the internet.

Unexplained crashes, microcode updates, and CPU errata

Computers are complicated machines. It's all too common to be browsing the internet, playing the newest Call of Duty , or just watching ...