Tuesday, 18 September 2018

"How does control register exiting work?"

For my first magic trick, I'll explain:
-What control register guest/host masks and read shadows are
-How to determine bits of CR0 and CR4 which are required to be a certain state during VMX operation, and how they relate to the Guest/Host masks and read shadows
-Handling CR access VM exits

CR0 and CR4 Guest/Host Masks and Read Shadows


First, an explanation from Volume 3 of the Intel Manual:
Put simply, the VMCS has a guest/host mask and read shadow for control registers CR0 and CR4. If a bit in the mask is 1, it is "host-owned", meaning that:
-Reads from the control register will return host-owned bits from their value in the read shadow (i.e. if bit 3 of cr0 is setup to be host owned, the value the guest is given will be from bit 3 of the cr0 read shadow; other bits, such as possibly bit 0, that aren't host owned will return their actual values).
-Writes that change host-owned bits from their value in the read shadow will cause a CR Access VM exit. Such a VM exit is fault-like, meaning that the instruction is not executed and the guest instruction pointer will point to the instruction that caused the VM exit.

Such a mechanism is needed because of bits the VMM may want to emulate the effect of or not want the guest to set; there are also bits which are not updated by the processor upon VMX transitions, which will be covered later. A typical VMM will make bits required by the architecture to be set to a certain value unable to be changed by the guest.

Determining Bits That Are Required By the Architecture


To determine required bits, the architecture defines 4 Model-Specific Registers. They are:
-IA32_VMX_CR0_FIXED0 (index 486h)
-IA32_VMX_CR0_FIXED1 (index 487h)
-IA32_VMX_CR4_FIXED0 (index 488h)
-IA32_VMX_CR4_FIXED1 (index 489h)

These MSRs are formatted in a way which may be seen as unorthodox. Indeed, if one looks for an explanation in the Intel Manual, they'll see a wall of text which takes ages to decrypt:
(repeat for cr4)
Which means, in English, that if a bit in both 'fixed' MSRs is set to the same value, that bit in the control register must be that value; if the bit in both MSRs is set to two different values, that bit is 'flexible' and can be any value.

From this, you can create a mask of bits which must have a certain value, as well as the values of those bits.

You should obviously replace 'crx' with cr0 and cr4
How does this work? To get flexible bits, both MSRs are XORed together, since bits that are two different values can be flexible (they have to be 0 and 1 which will always be 1 after an xor, while required bits will be 0 since 0 xor 0 and 1 xor 1 are both 1).

To get which bits are required, (which are going to be part of our host mask) a bitwise NOT is performed from the bits that are flexible.

To get the values of required bits, I get what both MSRs tell me, minus the flexible bits.
I can then write the required bits, along with the already existing bits of the control register (only including flexible bits) to the control register, allowing VMX operation.

If you're using a vm control called unrestricted guest, which allows the guest to run in non-paged protected mode and 16-bit real address mode, it's recommended that you remove bits 0 and 31 (corresponding to paging and protected mode enable bits) from the cr0 required bitmask and states as they may be set there, even though the vm control allows those bits to be unset.

The required bitmask and states can then be used with the CRx guest/host mask and read shadow, as well as the relevant VM exits.

You can now use these values when setting up the VMCS fields:
The values and purposes of "crx_host_mask_trap/no_set" are explained when handling CR access vm exits.

Handling Cr Access VM Exits


Note that this section does not cover exits caused by cr3 or cr8 reads or writes (as they aren't relevant for this, they aren't required, and they're very simple).

These VM exits have 3 parts: Getting the value of the control register that the guest wants to set, setting the read shadow/actual control register and injecting a #GP if needed, and handling how the guest sets the bits of each control register.

Getting the control register value


Getting the actual control register can be simple or a bit complicated. If the exit qualification indicates that the control register number is 4, it always means a MOV to CR4, and you can get the value the guest wants to set from the corresponding register.
CR0 has 3 ways: A MOV to CR0, an execution of CLTS, and the LMSW instruction. CLTS is really simple; get the value of the control register that the guest would read and remove bit 3, the 'task switched' bit. If LMSW causes a VM exit, you need to either read from the exit qualification or from memory. Since this is implementation specific (if the guest uses paging you need to translate one or more linear addresses), I won't go too in-depth.

You'd then need to determine if the setting is valid, and set the read shadow and control register.

Setting the read shadow and control register

Because of the way this site handles code formatting, I posted an example in a gist over at https://gist.github.com/drew-gpf/7c2d4c0f03aa3c06ed9c94ec435355c1
By giving the function the value the guest wants to set, as well as bits which can't be set, which will be changed but have no effect, and bits which are changed and only exist to VM exit on, you can properly handle a CR access VM exit.

Remember when I said that some bits aren't changed on VMX transitions, and that CR access VM exits don't execute the instruction? Those bits are bits 30 (cache disable) and 29 (not write through). If a CR0 write changes these bits and some host owned bits, these bits won't actually change, thus meaning that you should explicitly handle them (you can either directly change them or emulate their changes by changing the EPT structures so that all entries are uncacheable and ignore the PAT memory type).

Handling different bits set by the guest

Helpfully, handle_cr_change will fill out a buffer that receives the value of every bit that is different. With these different bits, you can figure out if a bit was changed, and if it's relevant, perform the needed operation. For example, if you wanted to do something when the guest set cr4.smep, you'd set the 20th bit in the 'trap' host mask (or the 'dont care' host mask if you don't want it to change cpu operation) and handle the 20th bit in the different bit field being set.

Caveats

Earlier, I made a case about needing to handle cr0.cd and cr0.nwt since they aren't changed by the architecture if a VM exit happens. There are also more bits which you need to handle that otherwise won't perform the correct operation. For CR0, you'd want to handle the guest setting cr0.pg (bit 31), since when entering long mode it otherwise won't set important structures. This issue actually affected me when trying to virtualize a system before the Windows kernel initialized, as the guest was also modifying a VMX required bit, which caused a VM-entry failure. To handle the changing of this bit, you should look at how to enter long mode, as well as structures the processor verifies when transitioning to the guest.
CR4 also has bits such as pge, pcide, and smep which flush the TLB, with PCIDE also being a reserved bit while operating in protected mode. It's also a good idea to verify that the pae bit isn't being unset if EFER.LMA is set, as that causes a VM entry failure.

While it is not required, it's also a good idea to reserve control register bits which don't exist at the time of writing the hypervisor.

No comments:

Post a Comment

A Common Missight in Most Hypervisors

Generally, when a hypervisor encounters a VM exit, it is because it needs to emulate the effects of an instruction, be it CPUID, RDMSR, or E...