aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/kvm/vmx.c
AgeCommit message (Collapse)Author
2010-05-13KVM: VMX: blocked-by-sti must not defer NMI injectionsJan Kiszka
As the processor may not consider GUEST_INTR_STATE_STI as a reason for blocking NMI, it could return immediately with EXIT_REASON_NMI_WINDOW when we asked for it. But as we consider this state as NMI-blocking, we can run into an endless loop. Resolve this by allowing NMI injection if just GUEST_INTR_STATE_STI is active (originally suggested by Gleb). Intel confirmed that this is safe, the processor will never complain about NMI injection in this state. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> KVM-Stable-Tag Acked-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-04-20KVM: VMX: Save/restore rflags.vm correctly in real modeAvi Kivity
Currently we set eflags.vm unconditionally when entering real mode emulation through virtual-8086 mode, and clear it unconditionally when we enter protected mode. The means that the following sequence KVM_SET_REGS (rflags.vm=1) KVM_SET_SREGS (cr0.pe=1) Ends up with rflags.vm clear due to KVM_SET_SREGS triggering enter_pmode(). Fix by shadowing rflags.vm (and rflags.iopl) correctly while in real mode: reads and writes to those bits access a shadow register instead of the actual register. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-30include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo
implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-01KVM: VMX: Update instruction length on intercepted BPJan Kiszka
We intercept #BP while in guest debugging mode. As VM exits due to intercepted exceptions do not necessarily come with valid idt_vectoring, we have to update event_exit_inst_len explicitly in such cases. At least in the absence of migration, this ensures that re-injections of #BP will find and use the correct instruction length. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Cc: stable@kernel.org (2.6.32, 2.6.33) Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01KVM: VMX: Rename VMX_EPT_IGMT_BIT to VMX_EPT_IPAT_BITSheng Yang
Following the new SDM. Now the bit is named "Ignore PAT memory type". Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01KVM: VMX: Remove redundant test in vmx_set_efer()Julia Lawall
msr was tested above, so the second test is not needed. A simplified version of the semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @r@ expression *x; expression e; identifier l; @@ if (x == NULL || ...) { ... when forall return ...; } ... when != goto l; when != x = e when != &x *x == NULL // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01KVM: VMX: Wire up .fpu_activate() callbackAvi Kivity
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01KVM: VMX: Remove redundant check in vm_need_virtualize_apic_accesses()Gui Jianfeng
flexpriority_enabled implies cpu_has_vmx_virtualize_apic_accesses() returning true, so we don't need this check here. Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01KVM: Trace failed msr reads and writesAvi Kivity
Record failed msrs reads and writes, and the fact that they failed as well. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01KVM: VMX: Pass cr0.mp through to the guest when the fpu is activeAvi Kivity
When cr0.mp is clear, the guest doesn't expect a #NM in response to a WAIT instruction. Because we always keep cr0.mp set, it will get a #NM, and potentially be confused. Fix by keeping cr0.mp set only when the fpu is inactive, and passing it through when inactive. Reported-by: Lorenzo Martignoni <martignlo@gmail.com> Analyzed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01KVM: Rename vcpu->shadow_efer to eferAvi Kivity
None of the other registers have the shadow_ prefix. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01KVM: Add a helper for checking if the guest is in protected modeAvi Kivity
Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01KVM: Activate fpu on cltsAvi Kivity
Assume that if the guest executes clts, it knows what it's doing, and load the guest fpu to prevent an #NM exception. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01KVM: VMX: Clean up DR6 emulationJan Kiszka
As we trap all debug register accesses, we do not need to switch real DR6 at all. Clean up update_exception_bitmap at this chance, too. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01KVM: VMX: Fix emulation of DR4 and DR5Jan Kiszka
Make sure DR4 and DR5 are aliased to DR6 and DR7, respectively, if CR4.DE is not set. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01KVM: VMX: Fix exceptions of mov to drJan Kiszka
Injecting GP without an error code is a bad idea (causes unhandled guest exits). Moreover, we must not skip the instruction if we injected an exception. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01KVM: VMX: Remove emulation failure reportSheng Yang
As Avi noted: >There are two problems with the kernel failure report. First, it >doesn't report enough data - registers, surrounding instructions, etc. >that are needed to explain what is going on. Second, it can flood >dmesg, which is a pretty bad thing to do. So we remove the emulation failure report in handle_invalid_guest_state(), and would inspected the guest using userspace tool in the future. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01KVM: VMX: Give the guest ownership of cr0.ts when the fpu is activeAvi Kivity
If the guest fpu is loaded, there is nothing interesing about cr0.ts; let the guest play with it as it will. This makes context switches between fpu intensive guest processes faster, as we won't trap the clts and cr0 write instructions. [marcelo: fix cr0 read shadow update on fpu deactivation; kills F8 install] Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01KVM: Lazify fpu activation and deactivationAvi Kivity
Defer fpu deactivation as much as possible - if the guest fpu is loaded, keep it loaded until the next heavyweight exit (where we are forced to unload it). This reduces unnecessary exits. We also defer fpu activation on clts; while clts signals the intent to use the fpu, we can't be sure the guest will actually use it. Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01KVM: VMX: Allow the guest to own some cr0 bitsAvi Kivity
We will use this later to give the guest ownership of cr0.ts. Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01KVM: Replace read accesses of vcpu->arch.cr0 by an accessorAvi Kivity
Since we'd like to allow the guest to own a few bits of cr0 at times, we need to know when we access those bits. Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01KVM: VMX: trace clts and lmsw instructions as cr accessesAvi Kivity
clts writes cr0.ts; lmsw writes cr0[0:15] - record that in ftrace. Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01KVM: VMX: Enable EPT 1GB page supportSheng Yang
Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01KVM: x86: Rename gb_page_enable() to get_lpage_level() in kvm_x86_opsSheng Yang
Then the callback can provide the maximum supported large page level, which is more flexible. Also move the gb page support into x86_64 specific. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01KVM: Fill out ftrace exit reason stringsAvi Kivity
Some exit reasons missed their strings; fill out the table. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01KVM: convert slots_lock to a mutexMarcelo Tosatti
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01KVM: switch vcpu context to use SRCUMarcelo Tosatti
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01KVM: introduce kvm->srcu and convert kvm_set_memory_region to SRCU updateMarcelo Tosatti
Use two steps for memslot deletion: mark the slot invalid (which stops instantiation of new shadow pages for that slot, but allows destruction), then instantiate the new empty slot. Also simplifies kvm_handle_hva locking. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01KVM: modify memslots layout in struct kvmMarcelo Tosatti
Have a pointer to an allocated region inside struct kvm. [alex: fix ppc book 3s] Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01KVM: VMX: Add instruction rdtscp support for guestSheng Yang
Before enabling, execution of "rdtscp" in guest would result in #UD. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01KVM: Add cpuid_update() callback to kvm_x86_opsSheng Yang
Sometime, we need to adjust some state in order to reflect guest CPUID setting, e.g. if we don't expose rdtscp to guest, we won't want to enable it on hardware. cpuid_update() is introduced for this purpose. Also export kvm_find_cpuid_entry() for later use. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01KVM: VMX: Remove redundant variableSheng Yang
It's no longer necessary. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01KVM: VMX: Fold ept_update_paging_mode_cr4() into its callerAvi Kivity
ept_update_paging_mode_cr4() accesses vcpu->arch.cr4 directly, which usually needs to be accessed via kvm_read_cr4(). In this case, we can't, since cr4 is in the process of being updated. Instead of adding inane comments, fold the function into its caller (vmx_set_cr4), so it can use the not-yet-committed cr4 directly. Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01KVM: VMX: When using ept, allow the guest to own cr4.pgeAvi Kivity
We make no use of cr4.pge if ept is enabled, but the guest does (to flush global mappings, as with vmap()), so give the guest ownership of this bit. Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01KVM: VMX: Make guest cr4 mask more conservativeAvi Kivity
Instead of specifying the bits which we want to trap on, specify the bits which we allow the guest to change transparently. This is safer wrt future changes to cr4. Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01KVM: Add accessor for reading cr4 (or some bits of cr4)Avi Kivity
Some bits of cr4 can be owned by the guest on vmx, so when we read them, we copy them to the vcpu structure. In preparation for making the set of guest-owned bits dynamic, use helpers to access these bits so we don't need to know where the bit resides. No changes to svm since all bits are host-owned there. Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01KVM: VMX: Move some cr[04] related constants to vmx.cAvi Kivity
They have no place in common code. Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01KVM: VMX: Trap and invalid MWAIT/MONITOR instructionSheng Yang
We don't support these instructions, but guest can execute them even if the feature('monitor') haven't been exposed in CPUID. So we would trap and inject a #UD if guest try this way. Cc: stable@kernel.org Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03KVM: VMX: Fix comparison of guest efer with stale host valueAvi Kivity
update_transition_efer() masks out some efer bits when deciding whether to switch the msr during guest entry; for example, NX is emulated using the mmu so we don't need to disable it, and LMA/LME are handled by the hardware. However, with shared msrs, the comparison is made against a stale value; at the time of the guest switch we may be running with another guest's efer. Fix by deferring the mask/compare to the actual point of guest entry. Noted by Marcelo. Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03KVM: VMX: Disable unrestricted guest when EPT disabledSheng Yang
Otherwise would cause VMEntry failure when using ept=0 on unrestricted guest supported processors. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03KVM: x86: Add KVM_GET/SET_VCPU_EVENTSJan Kiszka
This new IOCTL exports all yet user-invisible states related to exceptions, interrupts, and NMIs. Together with appropriate user space changes, this fixes sporadic problems of vmsave/restore, live migration and system reset. [avi: future-proof abi by adding a flags field] Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03KVM: VMX: Report unexpected simultaneous exceptions as internal errorsAvi Kivity
These happen when we trap an exception when another exception is being delivered; we only expect these with MCEs and page faults. If something unexpected happens, things probably went south and we're better off reporting an internal error and freezing. Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03KVM: Allow internal errors reported to userspace to carry extra dataAvi Kivity
Usually userspace will freeze the guest so we can inspect it, but some internal state is not available. Add extra data to internal error reporting so we can expose it to the debugger. Extra data is specific to the suberror. Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03KVM: VMX: Remove vmx->msr_offset_eferAvi Kivity
This variable is used to communicate between a caller and a callee; switch to a function argument instead. Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03KVM: VMX: move CR3/PDPTR update to vmx_set_cr3Marcelo Tosatti
GUEST_CR3 is updated via kvm_set_cr3 whenever CR3 is modified from outside guest context. Similarly pdptrs are updated via load_pdptrs. Let kvm_set_cr3 perform the update, removing it from the vcpu_run fast path. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Acked-by: Acked-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03KVM: VMX: Use shared msr infrastructureAvi Kivity
Instead of reloading syscall MSRs on every preemption, use the new shared msr infrastructure to reload them at the last possible minute (just before exit to userspace). Improves vcpu/idle/vcpu switches by about 2000 cycles (when EFER needs to be reloaded as well). [jan: fix slot index missing indirection] Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03KVM: VMX: Move MSR_KERNEL_GS_BASE out of the vmx autoload msr areaAvi Kivity
Currently MSR_KERNEL_GS_BASE is saved and restored as part of the guest/host msr reloading. Since we wish to lazy-restore all the other msrs, save and reload MSR_KERNEL_GS_BASE explicitly instead of using the common code. Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03KVM: VMX: Use macros instead of hex value on cr0 initializationEduardo Habkost
This should have no effect, it is just to make the code clearer. Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03KVM: VMX: fix handle_pause declarationMarcelo Tosatti
There's no kvm_run argument anymore. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-12-03KVM: VMX: Add support for Pause-Loop ExitingZhai, Edwin
New NHM processors will support Pause-Loop Exiting by adding 2 VM-execution control fields: PLE_Gap - upper bound on the amount of time between two successive executions of PAUSE in a loop. PLE_Window - upper bound on the amount of time a guest is allowed to execute in a PAUSE loop If the time, between this execution of PAUSE and previous one, exceeds the PLE_Gap, processor consider this PAUSE belongs to a new loop. Otherwise, processor determins the the total execution time of this loop(since 1st PAUSE in this loop), and triggers a VM exit if total time exceeds the PLE_Window. * Refer SDM volume 3b section 21.6.13 & 22.1.3. Pause-Loop Exiting can be used to detect Lock-Holder Preemption, where one VP is sched-out after hold a spinlock, then other VPs for same lock are sched-in to waste the CPU time. Our tests indicate that most spinlocks are held for less than 212 cycles. Performance tests show that with 2X LP over-commitment we can get +2% perf improvement for kernel build(Even more perf gain with more LPs). Signed-off-by: Zhai Edwin <edwin.zhai@intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>