aboutsummaryrefslogtreecommitdiff
path: root/arch/powerpc/kernel
AgeCommit message (Collapse)Author
2009-09-04Merge branch 'amd-iommu/2.6.32' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/joro/linux-2.6-iommu into core/iommu
2009-09-03perf_counter/powerpc: Fix cache event codes for POWER7Paul Mackerras
I had the codes for L1 D-cache load accesses and misses swapped around, and the wrong codes for LL-cache accesses and misses. This corrects them. Reported-by: Corey Ashford <cjashfor@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org> LKML-Reference: <19103.8514.709300.585484@cargo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-02powerpc/fsl-booke: Use HW PTE format if CONFIG_PTE_64BITKumar Gala
Switch to using the Power ISA defined PTE format when we have a 64-bit PTE. This makes the code handling between fsl-booke and book3e-64 similiar for TLB faults. Additionally this lets use take advantage of the page size encodings and full permissions that the HW PTE defines. Also defined _PMD_PRESENT, _PMD_PRESENT_MASK, and _PMD_BAD since the 32-bit ppc arch code expects them. Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-09-02powerpc/pseries: Fix to handle slb resize across migrationBrian King
The SLB can change sizes across a live migration, which was not being handled, resulting in possible machine crashes during migration if migrating to a machine which has a smaller max SLB size than the source machine. Fix this by first reducing the SLB size to the minimum possible value, which is 32, prior to migration. Then during the device tree update which occurs after migration, we make the call to ensure the SLB gets updated. Also add the slb_size to the lparcfg output so that the migration tools can check to make sure the kernel has this capability before allowing migration in scenarios where the SLB size will change. BenH: Fixed #include <asm/mmu-hash64.h> -> <asm/mmu.h> to avoid breaking ppc32 build Signed-off-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-09-02powerpc/pci: Merge ppc32 and ppc64 versions of phb_scan()Grant Likely
The two versions are doing almost exactly the same thing. No need to maintain them as separate files. This patch also has the side effect of making the PCI device tree scanning code available to 32 bit powerpc machines, but no board ports actually make use of this feature at this point. Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Acked-by: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-29Merge branch 'timers/posixtimers' into timers/tracingThomas Gleixner
Merge reason: timer tracepoint patches depend on both branches Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-28powerpc: Properly start decrementer on BookE secondary CPUsBenjamin Herrenschmidt
This moves the code to start the decrementer on 40x and BookE into a separate function which is now called from time_init() and secondary_time_init(), before the respective clock sources are registered. We also remove the 85xx specific code for doing it from the platform code. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-28powerpc/pci: Pull ppc32 PCI features into commonKumar Gala
Some of the PCI features we have in ppc32 we will need on ppc64 platforms in the future. These include support for: * ppc_md.pci_exclude_device * indirect config cycles * early config cycles We also simplified the logic in fake_pci_bus() to assume it will always get a valid pci_controller. Since all current callers seem to pass it one. Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Acked-by: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-28powerpc/pci: move pci_64.c device tree scanning code into pci-common.cGrant Likely
The PCI device tree scanning code in pci_64.c is some useful functionality. It allows PCI devices to be described in the device tree instead of being probed for, which in turn allows pci devices to use all of the device tree facilities to describe complex PCI bus architectures like GPIO and IRQ routing (perhaps not a common situation for desktop or server systems, but useful for embedded systems with on-board PCI devices). This patch moves the device tree scanning into pci-common.c so it is available for 32-bit powerpc machines too. Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Acked-by: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-28powerpc/pci: Remove dead checks for CONFIG_PPC_OFGrant Likely
PPC_OF is always selected for arch/powerpc. This patch removes the stale #defines Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Acked-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-28powerpc/book3e-64: Add support to initial_tlb_book3e for non-HES TLBKumar Gala
We now search through TLBnCFG looking for the first array that has IPROT support (we assume that there is only one). If that TLB has hardware entry select (HES) support we use the existing code and with the proper TLB select (the HES code still needs to clean up bolted entries from firmware). The non-HES code is pretty similiar to the 32-bit FSL Book-E code but does make some new assumtions (like that we have tlbilx) and simplifies things down a bit. Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-28powerpc/book3e-64: Add helper function to setup IVORsKumar Gala
Not all 64-bit Book-3E parts will have fixed IVORs so add a function that cpusetup code can call to setup the base IVORs (0..15) to match the fixed offsets. We need to 'or' part of interrupt_base_book3e into the IVORs since on parts that have them the IVPR doesn't extend as far down. Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-28powerpc/book3e-64: Wait til generic_calibrate_decr to enable decrementerKumar Gala
Match what we do on 32-bit Book-E processors and enable the decrementer in generic_calibrate_decr. We need to make sure we disable the decrementer early in boot since we currently use lazy (soft) interrupt on 64-bit Book-E and possible get a decrementer exception before we are ready for it. Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-28powerpc/book3e-64: Move the default cpu table entryKumar Gala
Move the default cpu entry table for CONFIG_PPC_BOOK3E_64 to the very end since we will probably want to support both 32-bit and 64-bit kernels for some processors that are higher up in the list. Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-28powerpc: Add CONFIG_DMA_API_DEBUG supportFUJITA Tomonori
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-28powerpc: Handle SWIOTLB mapping error properlyFUJITA Tomonori
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-28powerpc: use dma_map_ops structFUJITA Tomonori
This converts uses dma_map_ops struct (in include/linux/dma-mapping.h) instead of POWERPC homegrown dma_mapping_ops. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Acked-by: Becky Bruce <beckyb@kernel.crashing.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-28powerpc: Remove swiotlb_pci_dma_opsFUJITA Tomonori
Now swiotlb_pci_dma_ops is identical to swiotlb_dma_ops; we can use swiotlb_dma_ops with any devices. This removes swiotlb_pci_dma_ops. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Acked-by: Becky Bruce <beckyb@kernel.crashing.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-28powerpc: Remove addr_needs_map in struct dma_mapping_opsFUJITA Tomonori
This patch adds max_direct_dma_addr to struct dev_archdata to remove addr_needs_map in struct dma_mapping_ops. It also converts dma_capable() to use max_direct_dma_addr. max_direct_dma_addr is initialized in pci_dma_dev_setup_swiotlb(), called via ppc_md.pci_dma_dev_setup hook. For further information: http://marc.info/?t=124719060200001&r=1&w=2 Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Acked-by: Becky Bruce <beckyb@kernel.crashing.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-28Merge commit 'tip/iommu-for-powerpc' into nextBenjamin Herrenschmidt
2009-08-27powerpc/pseries: Reduce the polling interval in __cpu_up()Gautham R Shenoy
Time time taken for a single cpu online operation on a pseries machine is as follows: Dedicated LPAR (POWER6): ~220ms. Shared LPAR (POWER5) : ~240ms. Of this time, approximately 200ms is taken up by __cpu_up(). This is because we poll every 200ms to check if the new cpu has notified it's presence through the cpu_callin_map. We repeat this operation until the new cpu sets the value in cpu_callin_map or 5 seconds elapse, whichever comes earlier. However, using completion_structs instead of polling loops, the time taken by the new processor to indicate it's presence has found to be less than 1ms on pseries. This method however may not work on all powerpc platforms due to the time-base synchronization code. Keeping this in mind, we could reduce msleep polling interval from 200ms to 1ms while retaining the 5 second timeout. With this, the time taken for a cpu online operation changes as follows: Dedicated LPAR (POWER6): 20-25ms. Shared LPAR (POWER5) : 60-80ms. In both these cases, it was found that the code polls through the loop only once indicating that 1ms is a reasonable value, atleast on pseries. The code needs testing on other powerpc platforms. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Acked-by: Joel Schopp <jschopp@austin.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-27powerpc: Fix __flush_icache_range on 44xJosh Boyer
The ptrace POKETEXT interface allows a process to modify the text pages of a child process being ptraced, usually to insert breakpoints via trap instructions. The kernel eventually calls copy_to_user_page, which in turn calls __flush_icache_range to invalidate the icache lines for the child process. However, this function does not work on 44x due to the icache being virtually indexed. This was noticed by a breakpoint being triggered after it had been cleared by ltrace on a 440EPx board. The convenient solution is to do a flash invalidate of the icache in the __flush_icache_range function. Signed-off-by: Josh Boyer <jwboyer@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-27powerpc/mm: Cleanup handling of execute permissionBenjamin Herrenschmidt
This is an attempt at cleaning up a bit the way we handle execute permission on powerpc. _PAGE_HWEXEC is gone, _PAGE_EXEC is now only defined by CPUs that can do something with it, and the myriad of #ifdef's in the I$/D$ coherency code is reduced to 2 cases that hopefully should cover everything. The logic on BookE is a little bit different than what it was though not by much. Since now, _PAGE_EXEC will be set by the generic code for executable pages, we need to filter out if they are unclean and recover it. However, I don't expect the code to be more bloated than it already was in that area due to that change. I could boast that this brings proper enforcing of per-page execute permissions to all BookE and 40x but in fact, we've had that now for some time as a side effect of my previous rework in that area (and I didn't even know it :-) We would only enable execute permission if the page was cache clean and we would only cache clean it if we took and exec fault. Since we now enforce that the later only work if VM_EXEC is part of the VMA flags, we de-fact already enforce per-page execute permissions... Unless I missed something Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-23timekeeping: Increase granularity of read_persistent_clock(), build fixMartin Schwidefsky
Fix the following build problem on powerpc: arch/powerpc/kernel/time.c: In function 'read_persistent_clock': arch/powerpc/kernel/time.c:788: error: 'return' with a value, in function returning void arch/powerpc/kernel/time.c:791: error: 'return' with a value, in function returning void Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: dwalker@fifo99.com Cc: johnstul@us.ibm.com LKML-Reference: <20090822222313.74b9619c@skybase> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-20Merge commit 'paulus-perf/master' into nextBenjamin Herrenschmidt
2009-08-20powerpc/vmlinux.lds: Move _edata downMichael Ellerman
Currently _edata does not include several data sections, this causes the kernel's report of memory usage at boot to not match reality, and also prevents kmemleak from working - because it scan between _sdata and _edata for pointers to allocated memory. This mirrors a similar change made recently to the x86 linker script. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-20powerpc: Enable GCOVMichael Ellerman
Make it possible to enable GCOV code coverage measurement on powerpc. Lightly tested on 64-bit, seems to work as expected. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-20powerpc: Use DIV_ROUND_CLOSEST in time init codeJulia Lawall
The kernel.h macro DIV_ROUND_CLOSEST performs the computation (x + d/2)/d but is perhaps more readable. The semantic patch that makes this change is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @haskernel@ @@ #include <linux/kernel.h> @depends on haskernel@ expression x,__divisor; @@ - (((x) + ((__divisor) / 2)) / (__divisor)) + DIV_ROUND_CLOSEST(x,__divisor) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-20powerpc/prom_init: Evaluate mem kernel parameter for early allocationBenjamin Krill
Evaluate mem kernel parameter for early memory allocations. If mem is set no allocation in the region above the given boundary is allowed. The current code doesn't take care about this and allocate memory above the given mem boundary. Signed-off-by: Benjamin Krill <ben@codiert.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-20powerpc: Add AMCC 460EX/460GT Rev. B support to cputable.cStefan Roese
Signed-off-by: Stefan Roese <sr@denx.de> Signed-off-by: Josh Boyer <jwboyer@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-20powerpc: Remaining 64-bit Book3E supportBenjamin Herrenschmidt
This contains all the bits that didn't fit in previous patches :-) This includes the actual exception handlers assembly, the changes to the kernel entry, other misc bits and wiring it all up in Kconfig. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-20powerpc: Add TLB management code for 64-bit Book3EBenjamin Herrenschmidt
This adds the TLB miss handler assembly, the low level TLB flush routines along with the necessary hook for dealing with our virtual page tables or indirect TLB entries that need to be flushes when PTE pages are freed. There is currently no support for hugetlbfs Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-20powerpc: Add PACA fields specific to 64-bit Book3E processorsBenjamin Herrenschmidt
This adds various fields in the PACA that are for use specifically by Book3E processors, such as exception save areas, current pgd pointer, special exceptions kernel stacks etc... Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-20powerpc: Add memory management headers for new 64-bit BookEBenjamin Herrenschmidt
This adds the PTE and pgtable format definitions, along with changes to the kernel memory map and other definitions related to implementing support for 64-bit Book3E. This also shields some asm-offset bits that are currently only relevant on 32-bit We also move the definition of the "linux" page size constants to the common mmu.h file and add a few sizes that are relevant to embedded processors. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-20powerpc: Move definitions of secondary CPU spinloop to header fileBenjamin Herrenschmidt
Those definitions are currently declared extern in the .c file where they are used, move them to a header file instead. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-20powerpc: Clean ifdef usage in copy_thread()Benjamin Herrenschmidt
Currently, a single ifdef covers SLB related bits and more generic ppc64 related bits, split this in two separate ifdef's since 64-bit BookE will need one but not the other. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-20powerpc/mm: Call mmu_context_init() from ppc64Benjamin Herrenschmidt
Our 64-bit hash context handling has no init function, but 64-bit Book3E will use the common mmu_context_nohash.c code which does, so define an empty inline mmu_context_init() for 64-bit server and call it from our 64-bit setup_arch() Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Kumar Gala <galak@kernel.crashing.org>
2009-08-20powerpc/of: Remove useless register save/restore when calling OF backBenjamin Herrenschmidt
enter_prom() used to save and restore registers such as CTR, XER etc.. which are volatile, or SRR0,1... which we don't care about. This removes a bunch of useless code and while at it turns an mtmsrd into an MTMSRD macro which will be useful to Book3E. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-20powerpc: Add compat_sys_truncateBenjamin Herrenschmidt
The truncate syscall has a signed long parameter, so when using a 32- bit userspace with a 64-bit kernel the argument is zero-extended instead of sign-extended. Adding the compat_sys_truncate function fixes the issue. This was noticed during an LSB truncate test failure. The test was checking for the correct error number set when truncate is called with a length of -1. The test can be found at: http://bzr.linuxfoundation.org/lsb/devel/runtime-test?cmd=inventory;rev=stewb%40linux-foundation.org-20090626205411-sfb23cc0tjj7jzgm;path=modules/vsx-pcts/tset/POSIX.os/files/truncate/ BenH: Added compat_sys_ftruncate() as well, same issue. Signed-off-by: Chase Douglas <cndougla@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-20powerpc: Remove use of a second scratch SPRG in STAB codeBenjamin Herrenschmidt
The STAB code used on Power3 and RS/64 uses a second scratch SPRG to save a GPR in order to decide whether to go to do_stab_bolted_* or to handle a normal data access exception. This prevents our scheme of freeing SPRG3 which is user visible for user uses since we cannot use SPRG0 which, on RS/64, seems to be read-only for supervisor mode (like POWER4). This reworks the STAB exception entry to use the PACA as temporary storage instead. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-20powerpc: Use names rather than numbers for SPRGs (v2)Benjamin Herrenschmidt
The kernel uses SPRG registers for various purposes, typically in low level assembly code as scratch registers or to hold per-cpu global infos such as the PACA or the current thread_info pointer. We want to be able to easily shuffle the usage of those registers as some implementations have specific constraints realted to some of them, for example, some have userspace readable aliases, etc.. and the current choice isn't always the best. This patch should not change any code generation, and replaces the usage of SPRN_SPRGn everywhere in the kernel with a named replacement and adds documentation next to the definition of the names as to what those are used for on each processor family. The only parts that still use the original numbers are bits of KVM or suspend/resume code that just blindly needs to save/restore all the SPRGs. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-20powerpc: Rename exception.h to exception-64s.hBenjamin Herrenschmidt
The file include/asm/exception.h contains definitions that are specific to exception handling on 64-bit server type processors. This renames the file to exception-64s.h to reflect that fact and avoid confusion. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-20powerpc: Move 64bit VDSO to improve context switch performanceAnton Blanchard
On 64bit applications the VDSO is the only thing in segment 0. Since the VDSO is position independent we can remove the hint and let get_unmapped_area pick an area. This will mean the vdso will be near other mmaps and will share an SLB entry: 10000000-10001000 r-xp 00000000 08:06 5778459 /root/context_switch_64 10010000-10011000 r--p 00000000 08:06 5778459 /root/context_switch_64 10011000-10012000 rw-p 00001000 08:06 5778459 /root/context_switch_64 fffa92ae000-fffa92b0000 rw-p 00000000 00:00 0 fffa92b0000-fffa9453000 r-xp 00000000 08:06 4334051 /lib64/power6/libc-2.9.so fffa9453000-fffa9462000 ---p 001a3000 08:06 4334051 /lib64/power6/libc-2.9.so fffa9462000-fffa9466000 r--p 001a2000 08:06 4334051 /lib64/power6/libc-2.9.so fffa9466000-fffa947c000 rw-p 001a6000 08:06 4334051 /lib64/power6/libc-2.9.so fffa947c000-fffa9480000 rw-p 00000000 00:00 0 fffa9480000-fffa94a8000 r-xp 00000000 08:06 4333852 /lib64/ld-2.9.so fffa94b3000-fffa94b4000 rw-p 00000000 00:00 0 fffa94b4000-fffa94b7000 r-xp 00000000 00:00 0 [vdso] <----- here I am fffa94b7000-fffa94b8000 r--p 00027000 08:06 4333852 /lib64/ld-2.9.so fffa94b8000-fffa94bb000 rw-p 00028000 08:06 4333852 /lib64/ld-2.9.so fffa94bb000-fffa94bc000 rw-p 00000000 00:00 0 fffe4c10000-fffe4c25000 rw-p 00000000 00:00 0 [stack] On a microbenchmark that bounces a token between two 64bit processes over pipes and calls gettimeofday each iteration (to access the VDSO), our context switch rate goes from 268k to 277k ctx switches/sec (tested on a 4GHz POWER6). Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-18perf_counter: powerpc: Add callchain supportPaul Mackerras
This adds support for tracing callchains for powerpc, both 32-bit and 64-bit, and both in the kernel and userspace, from PMU interrupt context. The first three entries stored for each callchain are the NIP (next instruction pointer), LR (link register), and the contents of the LR save area in the second stack frame (the first is ignored because the ABI convention on powerpc is that functions save their return address in their caller's stack frame). Because leaf functions don't have to save their return address (LR value) and don't have to establish a stack frame, it's possible for either or both of LR and the second stack frame's LR save area to have valid return addresses in them. This is basically impossible to disambiguate without either reading the code or looking at auxiliary information such as CFI tables. Since we don't want to do either of those things at interrupt time, we store both LR and the second stack frame's LR save area. Once we get past the second stack frame, there is no ambiguity; all return addresses we get are reliable. For kernel traces, we check whether they are valid kernel instruction addresses and store zero instead if they are not (rather than omitting them, which would make it impossible for userspace to know which was which). We also store zero instead of the second stack frame's LR save area value if it is the same as LR. For kernel traces, we check for interrupt frames, and for user traces, we check for signal frames. In each case, since we're starting a new trace, we store a PERF_CONTEXT_KERNEL/USER marker so that userspace knows that the next three entries are NIP, LR and the second stack frame for the interrupted context. We read user memory with __get_user_inatomic. On 64-bit, if this PMU interrupt occurred while interrupts are soft-disabled, and there is no MMU hash table entry for the page, we will get an -EFAULT return from __get_user_inatomic even if there is a valid Linux PTE for the page, since hash_page isn't reentrant. Thus we have code here to read the Linux PTE and access the page via the kernel linear mapping. Since 64-bit doesn't use (or need) highmem there is no need to do kmap_atomic. On 32-bit, we don't do soft interrupt disabling, so this complication doesn't occur and there is no need to fall back to reading the Linux PTE, since hash_page (or the TLB miss handler) will get called automatically if necessary. Note that we cannot get PMU interrupts in the interval during context switch between switch_mm (which switches the user address space) and switch_to (which actually changes current to the new process). On 64-bit this is because interrupts are hard-disabled in switch_mm and stay hard-disabled until they are soft-enabled later, after switch_to has returned. So there is no possibility of trying to do a user stack trace when the user address space is not current's address space. Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
2009-08-18powerpc: Allow perf_counters to access user memory at interrupt timePaul Mackerras
This provides a mechanism to allow the perf_counters code to access user memory in a PMU interrupt routine. Such an access can cause various kinds of interrupt: SLB miss, MMU hash table miss, segment table miss, or TLB miss, depending on the processor. This commit only deals with 64-bit classic/server processors, which use an MMU hash table. 32-bit processors are already able to access user memory at interrupt time. Since we don't soft-disable on 32-bit, we avoid the possibility of reentering hash_page or the TLB miss handlers, since they run with interrupts disabled. On 64-bit processors, an SLB miss interrupt on a user address will update the slb_cache and slb_cache_ptr fields in the paca. This is OK except in the case where a PMU interrupt occurs in switch_slb, which also accesses those fields. To prevent this, we hard-disable interrupts in switch_slb. Interrupts are already soft-disabled at this point, and will get hard-enabled when they get soft-enabled later. This also reworks slb_flush_and_rebolt: to avoid hard-disabling twice, and to make sure that it clears the slb_cache_ptr when called from other callers than switch_slb, the existing routine is renamed to __slb_flush_and_rebolt, which is called by switch_slb and the new version of slb_flush_and_rebolt. Similarly, switch_stab (used on POWER3 and RS64 processors) gets a hard_irq_disable() to protect the per-cpu variables used there and in ste_allocate. If a MMU hashtable miss interrupt occurs, normally we would call hash_page to look up the Linux PTE for the address and create a HPTE. However, hash_page is fairly complex and takes some locks, so to avoid the possibility of deadlock, we check the preemption count to see if we are in a (pseudo-)NMI handler, and if so, we don't call hash_page but instead treat it like a bad access that will get reported up through the exception table mechanism. An interrupt whose handler runs even though the interrupt occurred when soft-disabled (such as the PMU interrupt) is considered a pseudo-NMI handler, which should use nmi_enter()/nmi_exit() rather than irq_enter()/irq_exit(). Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
2009-08-15timekeeping: Increase granularity of read_persistent_clock()Martin Schwidefsky
The persistent clock of some architectures (e.g. s390) have a better granularity than seconds. To reduce the delta between the host clock and the guest clock in a virtualized system change the read_persistent_clock function to return a struct timespec. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Daniel Walker <dwalker@fifo99.com> LKML-Reference: <20090814134811.013873340@de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-14powerpc64: convert to dynamic percpu allocatorTejun Heo
Now that percpu allows arbitrary embedding of the first chunk, powerpc64 can easily be converted to dynamic percpu allocator. Convert it. powerpc supports several large page sizes. Cap atom_size at 1M. There isn't much to gain by going above that anyway. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-14Merge branch 'percpu-for-linus' into percpu-for-nextTejun Heo
Conflicts: arch/sparc/kernel/smp_64.c arch/x86/kernel/cpu/perf_counter.c arch/x86/kernel/setup_percpu.c drivers/cpufreq/cpufreq_ondemand.c mm/percpu.c Conflicts in core and arch percpu codes are mostly from commit ed78e1e078dd44249f88b1dd8c76dafb39567161 which substituted many num_possible_cpus() with nr_cpu_ids. As for-next branch has moved all the first chunk allocators into mm/percpu.c, the changes are moved from arch code to mm/percpu.c. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-08-10Merge branch 'perfcounters-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (27 commits) perf_counter: Zero dead bytes from ftrace raw samples size alignment perf_counter: Subtract the buffer size field from the event record size perf_counter: Require CAP_SYS_ADMIN for raw tracepoint data perf_counter: Correct PERF_SAMPLE_RAW output perf tools: callchain: Fix bad rounding of minimum rate perf_counter tools: Fix libbfd detection for systems with libz dependency perf: "Longum est iter per praecepta, breve et efficax per exempla" perf_counter: Fix a race on perf_counter_ctx perf_counter: Fix tracepoint sampling to be part of generic sampling perf_counter: Work around gcc warning by initializing tracepoint record unconditionally perf tools: callchain: Fix sum of percentages to be 100% by displaying amount of ignored chains in fractal mode perf tools: callchain: Fix 'perf report' display to be callchain by default perf tools: callchain: Fix spurious 'perf report' warnings: ignore empty callchains perf record: Fix the -A UI for empty or non-existent perf.data perf util: Fix do_read() to fail on EOF instead of busy-looping perf list: Fix the output to not include tracepoints without an id perf_counter/powerpc: Fix oops on cpus without perf_counter hardware support perf stat: Fix tool option consistency: rename -S/--scale to -c/--scale perf report: Add debug help for the finding of symbol bugs - show the symtab origin (DSO, build-id, kernel, etc) perf report: Fix per task mult-counter stat reporting ...
2009-08-10powerpc/dma: pci_set_dma_mask() shouldn't fail if mask fits in RAMBenjamin Herrenschmidt
On an iMac G5, the b43 driver is failing to initialise because trying to set the dma mask to 30-bit fails. Even though there's only 512MiB of RAM in the machine anyway: https://bugzilla.redhat.com/show_bug.cgi?id=514787 We should probably let it succeed if the available RAM in the system doesn't exceed the requested limit. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>