aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2008-08-14x86: hpet: workaround SB700 BIOSThomas Gleixner
AMD SB700 based systems with spread spectrum enabled use a SMM based HPET emulation to provide proper frequency setting. The SMM code is initialized with the first HPET register access and takes some time to complete. During this time the config register reads 0xffffffff. We check for max. 1000 loops whether the config register reads a non 0xffffffff value to make sure that HPET is up and running before we go further. A counting loop is safe, as the HPET access takes thousands of CPU cycles. On non SB700 based machines this check is only done once and has no side effects. Based on a quirk patch from: crane cai <crane.cai@amd.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-08-14x86: check bigsmp in smp_sanity_check instead of cpu_upYinghai Lu
clear bits for cpu nr > 8. This allows us to boot the full range of possible CPUs that the supported APIC model will allow. Previously we'd hang or boot up with less than 8 CPUs. Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Tested-by: Jeff Chua <jeff.chua.linux@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-14x86: don't call e820_regiter_active_regions if out of range on nodeYinghai Lu
so we don't get warning on 32bit system with 64g RAM or more Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-14x86: resurrect proper handling of maxcpus= kernel option (v2)Max Krasnyansky
For some reason we had two parsers registered for maxcpus=. One in init/main.c and another in arch/x86/smpboot.c. So I nuked the one in arch/x86. Also 64-bit kernels used to handle maxcpus= as documented in Documentation/cpu-hotplug.txt. CPUs with 'id > maxcpus' are initialized but not booted. 32-bit version for some reason ignored them even though all the infrastructure for booting them later is there. In the current mainline both 64 and 32 bit versions are broken. This patch restores the correct behaviour. I've tested x86_64 version on 4- and 8- way Core2 and 2-way Opteron based machines. Various config combinations SMP, !SMP, CPU_HOTPLUG, !CPU_HOTPLUG. Booted with maxcpus=1 and maxcpus=4, etc. Everything is working as expected. So far we've received two reports from different people confirming that 32-bit version also works fine, both on dual core laptops and 16way server machines. [v2: This version fixes visws breakage pointed out by Ingo.] Signed-off-by: Max Krasnyansky <maxk@qualcomm.com> Cc: lizf@cn.fujitsu.com Cc: jeff.chua.linux@gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-14Merge branch 'x86/fpu' into x86/urgentIngo Molnar
2008-08-14x86: cleanup for setup code crashes during IST probeH. Peter Anvin
Clean up the code for crashes during SpeedStep probing on older machines. Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-13x86: use WARN() in arch/x86/mm/pageattr.cArjan van de Ven
Use WARN() instead of a printk+WARN_ON() pair; this way the message becomes part of the warning section for better reporting/collection. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: akpm@linux-foundation.org Cc: arjan@linux.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-13x86: allow MMCONFIG above 4GB on x86_64John Keller
SGI UV will have MMCFG base addresses that are greater than 4GB (32 bits). v2: Use CONFIG_RESOURCES_64BIT instead of CONFIG_X86_64. v3: Create a flag, that is set by platform specific code, to disable the > 4GB check. Signed-off-by: John Keller <jpk@sgi.com> Cc: jpk@sgi.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-13x86: fix 2 section mismatch warnings - find_and_reserve_crashkernelMarcin Slusarz
WARNING: vmlinux.o(.text+0xcd1f): Section mismatch in reference from the function find_and_reserve_crashkernel() to the function .init.text:find_e820_area() The function find_and_reserve_crashkernel() references the function __init find_e820_area(). This is often because find_and_reserve_crashkernel lacks a __init annotation or the annotation of find_e820_area is wrong. WARNING: vmlinux.o(.text+0xcd38): Section mismatch in reference from the function find_and_reserve_crashkernel() to the function .init.text:reserve_bootmem_generic() The function find_and_reserve_crashkernel() references the function __init reserve_bootmem_generic(). This is often because find_and_reserve_crashkernel lacks a __init annotation or the annotation of reserve_bootmem_generic is wrong. find_and_reserve_crashkernel is called from __init function (reserve_crashkernel) and calls 2 __init functions (find_e820_area, reserve_bootmem_generic), so mark it __init Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-13x86: fix 2 section mismatch warnings - map_high()Marcin Slusarz
WARNING: vmlinux.o(.text+0x14cf8): Section mismatch in reference from the function map_high() to the function .init.text:init_extra_mapping_uc() The function map_high() references the function __init init_extra_mapping_uc(). This is often because map_high lacks a __init annotation or the annotation of init_extra_mapping_uc is wrong. WARNING: vmlinux.o(.text+0x14d05): Section mismatch in reference from the function map_high() to the function .init.text:init_extra_mapping_wb() The function map_high() references the function __init init_extra_mapping_wb(). This is often because map_high lacks a __init annotation or the annotation of init_extra_mapping_wb is wrong. map_high is called only from __init functions (map_*_high) and calls 2 __init_functions (init_extra_mapping_*) Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-13Merge commit 'v2.6.27-rc3' into x86/urgentIngo Molnar
2008-08-13x86: fix setup code crashes on my old 486 boxJoerg Roedel
yesterday I tried to reactivate my old 486 box and wanted to install a current Linux with latest kernel on it. But it turned out that the latest kernel does not boot because the machine crashes early in the setup code. After some debugging it turned out that the problem is the query_ist() function. If this interrupt with that function is called the machine simply locks up. It looks like a BIOS bug. Looking for a workaround for this problem I wrote the attached patch. It checks for the CPUID instruction and if it is not implemented it does not call the speedstep BIOS function. As far as I know speedstep should be available since some Pentium earliest. Alan Cox observed that it's available since the Pentium II, so cpuid levels 4 and 5 can be excluded altogether. H. Peter Anvin cleaned up the code some more: > Right in concept, but I dislike the implementation (duplication of the > CPU detect code we already have). Could you try this patch and see if > it works for you? which, with a small modification to fix a build error with it the resulting kernel boots on my machine. Signed-off-by: Joerg Roedel <joro@8bytes.org> Signed-off-by: "H. Peter Anvin" <hpa@zytor.com> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linusLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: fix spinlock recursion in hvc_console stop_machine: remove unused variable modules: extend initcall_debug functionality to the module loader export virtio_rng.h lguest: use get_user_pages_fast() instead of get_user_pages() mm: Make generic weak get_user_pages_fast and EXPORT_GPL it lguest: don't set MAC address for guest unless specified
2008-08-12mm: Make generic weak get_user_pages_fast and EXPORT_GPL itRusty Russell
Out of line get_user_pages_fast fallback implementation, make it a weak symbol, get rid of CONFIG_HAVE_GET_USER_PAGES_FAST. Export the symbol to modules so lguest can use it. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-08-11Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: fix 2.6.27rc1 cannot boot more than 8CPUs x86: make "apic" an early_param() on 32-bit, NULL check EFI, x86: fix function prototype x86, pci-calgary: fix function declaration x86: work around gcc 3.4.x bug x86: make "apic" an early_param() on 32-bit x86, debug: tone down arch/x86/kernel/mpparse.c debugging printk x86_64: restore the proper NR_IRQS define so larger systems work. x86: Restore proper vector locking during cpu hotplug x86: Fix broken VMI in 2.6.27-rc.. x86: fdiv bug detection fix
2008-08-11x86: fix 2.6.27rc1 cannot boot more than 8CPUsYinghai Lu
Jeff Chua reported that booting a !bigsmp kernel on a 16-way box hangs silently. this is a long-standing issue, smp start AP cpu could check the apic id >=8 etc before trying to start it. achieve this by moving the def_to_bigsmp check later and skip the apicid id > 8 [ mingo@elte.hu: clean up the message that is printed. ] Reported-by: "Jeff Chua" <jeff.chua.linux@gmail.com> Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> arch/x86/kernel/setup.c | 6 ------ arch/x86/kernel/smpboot.c | 10 ++++++++++ 2 files changed, 10 insertions(+), 6 deletions(-)
2008-08-11x86: make "apic" an early_param() on 32-bit, NULL checkRene Herman
Cyrill Gorcunov observed: > you turned it into early_param so now it's NULL injecting vulnerabled. > Could you please add checking for NULL str param? fix that. Also, change the name of 'str' into 'arg', to make it more apparent that this is an optional argument that can be NULL, not a string parameter that is empty when unset. Reported-by: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Rene Herman <rene.herman@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-11x86, pci-calgary: fix function declarationRandy Dunlap
Fix function declaration: linux-next-20080807/arch/x86/kernel/pci-calgary_64.c:1353:36: warning: non-ANSI function declaration of function 'get_tce_space_from_tar' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Acked-by: Muli Ben-Yehuda <muli@il.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-11x86: work around gcc 3.4.x bugJeremy Fitzhardinge
Simon Horman reported that gcc-3.4.x crashes when compiling pgd_prepopulate_pmd() when PREALLOCATED_PMDS == 0 and CONFIG_DEBUG_INFO is enabled. Adding an extra check for PREALLOCATED_PMDS == 0 [which is compiled out by gcc] seems to avoid the problem. Reported-by: Simon Horman <horms@verge.net.au> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-11x86: make "apic" an early_param() on 32-bitRene Herman
On 32-bit, "apic" is a __setup() param meaning it is parsed rather late in the game. Make it an early_param() for apic_printk() use by arch/x86/kernel/mpparse.c. On 64-bit, it already is an early_param(). Signed-off-by: Rene Herman <rene.herman@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-11x86, debug: tone down arch/x86/kernel/mpparse.c debugging printkRene Herman
commit 11a62a056093a7f25f1595fbd8bd5f93559572b6 turns some formerly nopped debugging printks in arch/x86/kernel/mppparse.c into regular ones. The one at the top of smp_scan_config() in particular also prints on !CONFIG_SMP/CONFIG_X86_LOCAL_APIC kernels and UP machines without anything resembling MP tables which makes their lowly UP owners wonder... Turn the former Dprintk()s into apic_printk()s instead meaning that their printing is dependent on passing the apic=verbose (or =debug) command line param. On 32-bit, "apic" is a __setup() param which isn't early enough for this code and therefore needs a followup changing it into an early_param(). On 64-bit, it already is. Signed-off-by: Rene Herman <rene.herman@gmail.com> Cc: Andrew Morton <akpm@osdl.org> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-11x86: Restore proper vector locking during cpu hotplugEric W. Biederman
Having cpu_online_map change during assign_irq_vector can result in some really nasty and weird things happening. The one that bit me last time was accessing non existent per cpu memory for non existent cpus. This locking was removed in a sloppy x86_64 and x86_32 merge patch. Guys can we please try and avoid subtly breaking x86 when we are merging files together? Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2008-08-08x86: Fix broken VMI in 2.6.27-rc..Alok Kataria
The lowmem mapping table created by VMI need not depend on max_low_pfn at all. Instead we now create an extra large mapping which covers all possible lowmem instead of the physical ram that is actually available. This allows the vmi initialization to be done before max_low_pfn could be computed. We also move the vmi_init code very early in the boot process so that nobody accidentally breaks the fixmap dependancy. Signed-off-by: Alok N Kataria <akataria@vmware.com> Acked-by: Zachary Amsden <zach@vmware.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2008-08-08[CPUFREQ][2/2] preregister support for powernow-k8Mark Langsdorf
This patch provides support for the _PSD ACPI object in the Powernow-k8 driver. Although it looks like an invasive patch, most of it is simply the consequence of turning the static acpi_performance_data structure into a pointer. AMD has tested it on several machines over the past few days without issue. [trivial checkpatch warnings fixed up by davej] [X86_POWERNOW_K8_ACPI=n buildfix from Randy Dunlap] Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com> Tested-by: Frank Arnold <frank.arnold@amd.com> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Dave Jones <davej@redhat.com>
2008-08-08[CPUFREQ][1/2] whitespace fix for powernow-k8Mark Langsdorf
Trivial whitespace fix for powernow-k8. Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com> Signed-off-by: Dave Jones <davej@redhat.com>
2008-08-08[CPUFREQ] Fix warning in elanfreqDave Jones
arch/x86/kernel/cpu/cpufreq/elanfreq.c:47:26: warning: symbol 'elan_multiplier' was not declared. Should it be static? Yes, yes it should. Signed-off-by: Dave Jones <davej@redhat.com>
2008-08-08[CPUFREQ] Remove EXPERIMENTAL annotation from VIA C7 powersaver kconfig.Dave Jones
This has been pretty solid, and doesn't see much change at all. Noticed by Harald Welte. Signed-off-by: Dave Jones <davej@redhat.com>
2008-08-01Merge branch 'kvm-updates-2.6.27' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm * 'kvm-updates-2.6.27' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: KVM: s390: Fix kvm on IBM System z10 KVM: Advertise synchronized mmu support to userspace KVM: Synchronize guest physical memory map to host virtual memory map KVM: Allow browsing memslots with mmu_lock KVM: Allow reading aliases with mmu_lock
2008-08-01Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: generic, x86: fix add iommu_num_pages helper function x86: remove stray <6> in BogoMIPS printk x86: move dma32_reserve_bootmem() after reserve_crashkernel()
2008-07-31x86: fdiv bug detection fixKrzysztof Helt
The fdiv detection code writes s32 integer into the boot_cpu_data.fdiv_bug. However, the boot_cpu_data.fdiv_bug is only char (s8) field so the detection overwrites already set fields for other bugs, e.g. the f00f bug field. Use local s32 variable to receive result. This is a partial fix to Bugzilla #9928 - fixes wrong information about the f00f bug (tested) and probably for coma bug (I have no cpu to test this). Signed-off-by: Krzysztof Helt <krzysztof.h1@wp.pl> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-30x86: wrong register was used in align macroVitaly Mayatskikh
New ALIGN_DESTINATION macro has sad typo: r8d register was used instead of ecx in fixup section. This can be considered as a regression. Register ecx was also wrongly loaded with value in r8d in copy_user_nocache routine. Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-30GRU Driver: export is_uv_system(), zap_page_range() & follow_page()Jack Steiner
Exports needed by the GRU driver. Signed-off-by: Jack Steiner <steiner@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-29generic, x86: fix add iommu_num_pages helper functionFUJITA Tomonori
This IOMMU helper function doesn't work for some architectures: http://marc.info/?l=linux-kernel&m=121699304403202&w=2 It also breaks POWER and SPARC builds: http://marc.info/?l=linux-kernel&m=121730388001890&w=2 Currently, only x86 IOMMUs use this so let's move it to x86 for now. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-29Merge commit 'v2.6.27-rc1' into x86/urgentIngo Molnar
2008-07-29KVM: Advertise synchronized mmu support to userspaceAvi Kivity
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-07-29KVM: Synchronize guest physical memory map to host virtual memory mapAndrea Arcangeli
Synchronize changes to host virtual addresses which are part of a KVM memory slot to the KVM shadow mmu. This allows pte operations like swapping, page migration, and madvise() to transparently work with KVM. Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-07-29KVM: Allow browsing memslots with mmu_lockAndrea Arcangeli
This allows reading memslots with only the mmu_lock hold for mmu notifiers that runs in atomic context and with mmu_lock held. Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-07-29KVM: Allow reading aliases with mmu_lockAndrea Arcangeli
This allows the mmu notifier code to run unalias_gfn with only the mmu_lock held. Only alias writes need the mmu_lock held. Readers will either take the slots_lock in read mode or the mmu_lock. Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-07-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linusLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: lguest: turn Waker into a thread, not a process lguest: Enlarge virtio rings lguest: Use GSO/IFF_VNET_HDR extensions on tun/tap lguest: Remove 'network: no dma buffer!' warning lguest: Adaptive timeout lguest: Tell Guest net not to notify us on every packet xmit lguest: net block unneeded receive queue update notifications lguest: wrap last_avail accesses. lguest: use cpu capability accessors lguest: virtio-rng support lguest: Support assigning a MAC address lguest: Don't leak /dev/zero fd lguest: fix verbose printing of device features. lguest: fix switcher_page leak on unload lguest: Guest int3 fix lguest: set max_pfn_mapped, growl loudly at Yinghai Lu
2008-07-28Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (21 commits) x86/PCI: use dev_printk when possible PCI: add D3 power state avoidance quirk PCI: fix bogus "'device' may be used uninitialized" warning in pci_slot PCI: add an option to allow ASPM enabled forcibly PCI: disable ASPM on pre-1.1 PCIe devices PCI: disable ASPM per ACPI FADT setting PCI MSI: Don't disable MSIs if the mask bit isn't supported PCI: handle 64-bit resources better on 32-bit machines PCI: rewrite PCI BAR reading code PCI: document pci_target_state PCI hotplug: fix typo in pcie hotplug output x86 gart: replace to_pages macro with iommu_num_pages x86, AMD IOMMU: replace to_pages macro with iommu_num_pages iommu: add iommu_num_pages helper function dma-coherent: add documentation to new interfaces Cris: convert to using generic dma-coherent mem allocator Sh: use generic per-device coherent dma allocator ARM: support generic per-device coherent dma mem Generic dma-coherent: fix DMA_MEMORY_EXCLUSIVE x86: use generic per-device dma coherent allocator ...
2008-07-28Fix 'get_user_pages_fast()' with non-page-aligned start addressLinus Torvalds
Alexey Dobriyan reported trouble with LTP with the new fast-gup code, and Johannes Weiner debugged it to non-page-aligned addresses, where the new get_user_pages_fast() code would do all the wrong things, including just traversing past the end of the requested area due to 'addr' never matching 'end' exactly. This is not a pretty fix, and we may actually want to move the alignment into generic code, leaving just the core code per-arch, but Alexey verified that the vmsplice01 LTP test doesn't crash with this. Reported-and-tested-by: Alexey Dobriyan <adobriyan@gmail.com> Debugged-by: Johannes Weiner <hannes@saeurebad.de> Cc: Nick Piggin <npiggin@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-29lguest: set max_pfn_mapped, growl loudly at Yinghai LuRusty Russell
6af61a7614a306fe882a0c2b4ddc63b65aa66efc 'x86: clean up max_pfn_mapped usage - 32-bit' makes the following comment: XEN PV and lguest may need to assign max_pfn_mapped too. But no CC. Yinghai, wasting fellow developers' time is a VERY bad habit. If you do it again, I will hunt you down and try to extract the three hours of my life I just lost :) Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Yinghai Lu <yhlu.kernel@gmail.com>
2008-07-28mmu-notifiers: coreAndrea Arcangeli
With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages. There are secondary MMUs (with secondary sptes and secondary tlbs) too. sptes in the kvm case are shadow pagetables, but when I say spte in mmu-notifier context, I mean "secondary pte". In GRU case there's no actual secondary pte and there's only a secondary tlb because the GRU secondary MMU has no knowledge about sptes and every secondary tlb miss event in the MMU always generates a page fault that has to be resolved by the CPU (this is not the case of KVM where the a secondary tlb miss will walk sptes in hardware and it will refill the secondary tlb transparently to software if the corresponding spte is present). The same way zap_page_range has to invalidate the pte before freeing the page, the spte (and secondary tlb) must also be invalidated before any page is freed and reused. Currently we take a page_count pin on every page mapped by sptes, but that means the pages can't be swapped whenever they're mapped by any spte because they're part of the guest working set. Furthermore a spte unmap event can immediately lead to a page to be freed when the pin is released (so requiring the same complex and relatively slow tlb_gather smp safe logic we have in zap_page_range and that can be avoided completely if the spte unmap event doesn't require an unpin of the page previously mapped in the secondary MMU). The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know when the VM is swapping or freeing or doing anything on the primary MMU so that the secondary MMU code can drop sptes before the pages are freed, avoiding all page pinning and allowing 100% reliable swapping of guest physical address space. Furthermore it avoids the code that teardown the mappings of the secondary MMU, to implement a logic like tlb_gather in zap_page_range that would require many IPI to flush other cpu tlbs, for each fixed number of spte unmapped. To make an example: if what happens on the primary MMU is a protection downgrade (from writeable to wrprotect) the secondary MMU mappings will be invalidated, and the next secondary-mmu-page-fault will call get_user_pages and trigger a do_wp_page through get_user_pages if it called get_user_pages with write=1, and it'll re-establishing an updated spte or secondary-tlb-mapping on the copied page. Or it will setup a readonly spte or readonly tlb mapping if it's a guest-read, if it calls get_user_pages with write=0. This is just an example. This allows to map any page pointed by any pte (and in turn visible in the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an full MMU with both sptes and secondary-tlb like the shadow-pagetable layer with kvm), or a remote DMA in software like XPMEM (hence needing of schedule in XPMEM code to send the invalidate to the remote node, while no need to schedule in kvm/gru as it's an immediate event like invalidating primary-mmu pte). At least for KVM without this patch it's impossible to swap guests reliably. And having this feature and removing the page pin allows several other optimizations that simplify life considerably. Dependencies: 1) mm_take_all_locks() to register the mmu notifier when the whole VM isn't doing anything with "mm". This allows mmu notifier users to keep track if the VM is in the middle of the invalidate_range_begin/end critical section with an atomic counter incraese in range_begin and decreased in range_end. No secondary MMU page fault is allowed to map any spte or secondary tlb reference, while the VM is in the middle of range_begin/end as any page returned by get_user_pages in that critical section could later immediately be freed without any further ->invalidate_page notification (invalidate_range_begin/end works on ranges and ->invalidate_page isn't called immediately before freeing the page). To stop all page freeing and pagetable overwrites the mmap_sem must be taken in write mode and all other anon_vma/i_mmap locks must be taken too. 2) It'd be a waste to add branches in the VM if nobody could possibly run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of mmu notifiers, but this already allows to compile a KVM external module against a kernel with mmu notifiers enabled and from the next pull from kvm.git we'll start using them. And GRU/XPMEM will also be able to continue the development by enabling KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n). This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n. The mmu_notifier_register call can fail because mm_take_all_locks may be interrupted by a signal and return -EINTR. Because mmu_notifier_reigster is used when a driver startup, a failure can be gracefully handled. Here an example of the change applied to kvm to register the mmu notifiers. Usually when a driver startups other allocations are required anyway and -ENOMEM failure paths exists already. struct kvm *kvm_arch_create_vm(void) { struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); + int err; if (!kvm) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops; + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm); + if (err) { + kfree(kvm); + return ERR_PTR(err); + } + return kvm; } mmu_notifier_unregister returns void and it's reliable. The patch also adds a few needed but missing includes that would prevent kernel to compile after these changes on non-x86 archs (x86 didn't need them by luck). [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix mm/filemap_xip.c build] [akpm@linux-foundation.org: fix mm/mmu_notifier.c build] Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Cc: Jack Steiner <steiner@sgi.com> Cc: Robin Holt <holt@sgi.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Kanoj Sarcar <kanojsarcar@yahoo.com> Cc: Roland Dreier <rdreier@cisco.com> Cc: Steve Wise <swise@opengridcomputing.com> Cc: Avi Kivity <avi@qumranet.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Anthony Liguori <aliguori@us.ibm.com> Cc: Chris Wright <chrisw@redhat.com> Cc: Marcelo Tosatti <marcelo@kvack.org> Cc: Eric Dumazet <dada1@cosmosbay.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Izik Eidus <izike@qumranet.com> Cc: Anthony Liguori <aliguori@us.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-28x86/PCI: use dev_printk when possibleBjorn Helgaas
Convert printks to use dev_printk(). I converted DBG() to dev_dbg(). This DBG() is from arch/x86/pci/pci.h and requires source-code modification to enable, so dev_dbg() seems roughly equivalent. Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2008-07-28Merge branch 'core/generic-dma-coherent' of ↵Jesse Barnes
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip into for-linus
2008-07-29Merge branch 'linus' into core/generic-dma-coherentIngo Molnar
Conflicts: arch/x86/Kconfig Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-28Merge branch 'x86/iommu' of ↵Jesse Barnes
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip into for-linus
2008-07-28cpu masks: optimize and clean up cpumask_of_cpu()Linus Torvalds
Clean up and optimize cpumask_of_cpu(), by sharing all the zero words. Instead of stupidly generating all possible i=0...NR_CPUS 2^i patterns creating a huge array of constant bitmasks, realize that the zero words can be shared. In other words, on a 64-bit architecture, we only ever need 64 of these arrays - with a different bit set in one single world (with enough zero words around it so that we can create any bitmask by just offsetting in that big array). And then we just put enough zeroes around it that we can point every single cpumask to be one of those things. So when we have 4k CPU's, instead of having 4k arrays (of 4k bits each, with one bit set in each array - 2MB memory total), we have exactly 64 arrays instead, each 8k bits in size (64kB total). And then we just point cpumask(n) to the right position (which we can calculate dynamically). Once we have the right arrays, getting "cpumask(n)" ends up being: static inline const cpumask_t *get_cpu_mask(unsigned int cpu) { const unsigned long *p = cpu_bit_bitmap[1 + cpu % BITS_PER_LONG]; p -= cpu / BITS_PER_LONG; return (const cpumask_t *)p; } This brings other advantages and simplifications as well: - we are not wasting memory that is just filled with a single bit in various different places - we don't need all those games to re-create the arrays in some dense format, because they're already going to be dense enough. if we compile a kernel for up to 4k CPU's, "wasting" that 64kB of memory is a non-issue (especially since by doing this "overlapping" trick we probably get better cache behaviour anyway). [ mingo@elte.hu: Converted Linus's mails into a commit. See: http://lkml.org/lkml/2008/7/27/156 http://lkml.org/lkml/2008/7/28/320 Also applied a family filter - which also has the side-effect of leaving out the bits where Linus calls me an idio... Oh, never mind ;-) ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-28Merge branch 'linus' into cpus4096Ingo Molnar
2008-07-28Merge branch 'x86/crashdump' into x86/urgentIngo Molnar