aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2008-03-04alloc_percpu() fails to allocate percpu dataEric Dumazet
Some oprofile results obtained while using tbench on a 2x2 cpu machine were very surprising. For example, loopback_xmit() function was using high number of cpu cycles to perform the statistic updates, supposed to be real cheap since they use percpu data pcpu_lstats = netdev_priv(dev); lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id()); lb_stats->packets++; /* HERE : serious contention */ lb_stats->bytes += skb->len; struct pcpu_lstats is a small structure containing two longs. It appears that on my 32bits platform, alloc_percpu(8) allocates a single cache line, instead of giving to each cpu a separate cache line. Using the following patch gave me impressive boost in various benchmarks ( 6 % in tbench) (all percpu_counters hit this bug too) Long term fix (ie >= 2.6.26) would be to let each CPU allocate their own block of memory, so that we dont need to roudup sizes to L1_CACHE_BYTES, or merging the SGI stuff of course... Note : SLUB vs SLAB is important here to *show* the improvement, since they dont have the same minimum allocation sizes (8 bytes vs 32 bytes). This could very well explain regressions some guys reported when they switched to SLUB. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04zlc_setup(): handle jiffies wraparoundKOSAKI Motohiro
jiffies subtraction may cause an overflow problem. It should be using time_after(). [akpm@linux-foundation.org: include jiffies.h] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Paul Jackson <pj@sgi.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-03slub: fix possible NULL pointer dereferenceCyrill Gorcunov
This patch fix possible NULL pointer dereference if kzalloc failed. To be able to return proper error code the function return type is changed to ssize_t (according to callees and sysfs definitions). Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03slub: Add kmalloc_large_node() to support kmalloc_node fallbackChristoph Lameter
Slub is missing some NUMA support for large kmallocs. Provide that. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03slub: look up object from the freelist oncePekka J Enberg
We only need to look up object from c->page->freelist once in __slab_alloc(). Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03slub: Fix up commentsChristoph Lameter
Provide comments and fix up various spelling / style issues. Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03slub: Rearrange #ifdef CONFIG_SLUB_DEBUG in calculate_sizes()Christoph Lameter
Group SLUB_DEBUG code together to reduce the number of #ifdefs. Move some debug checks under the #ifdef. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03slub: Remove BUG_ON() from ksize and omit checks for !SLUB_DEBUGChristoph Lameter
The BUG_ONs are useless since the pointer derefs will lead to NULL deref errors anyways. Some of the checks are not necessary if no debugging is possible. Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03slub: Use the objsize from the kmem_cache_cpu structureChristoph Lameter
No need to access the kmem_cache structure. We have the same value in kmem_cache_cpu. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03slub: Remove useless checks in alloc_debug_processingChristoph Lameter
Alloc debug processing is never called with a NULL object pointer. No reason to check for NULL. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03slub: Remove objsize check in kmem_cache_flags()Christoph Lameter
There is no page->offset anymore and also no associated limit on the number of objects. The page->offset field was removed for 2.6.24. So the check in kmem_cache_flags() is now also obsolete (should have been dropped earlier, somehow a hunk vanished). Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-by: Christoph Lameter <clameter@sgi.com>
2008-03-03slub: rename slab_objects to show_slab_objectsChristoph Lameter
The sysfs callback is better named show_slab_objects since it is always called from the xxx_show callbacks. We need the name for other purposes later. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03Revert "unique end pointer" patchChristoph Lameter
This only made sense for the alternate fastpath which was reverted last week. Mathieu is working on a new version that addresses the fastpath issues but that new code first needs to go through mm and it is not clear if we need the unique end pointers with his new scheme. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03docbook: fix kernel-api source filesRandy Dunlap
Fix docbook problems in kernel-api.tmpl. These cause the generated docbook to be incorrect. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-23memcgroup: return negative error code in mem_cgroup_create()Li Zefan
Cgroup requires the subsystem to return negative error code on error in the create method. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-23memcgroup: remove a useless VM_BUG_ON()Li Zefan
Remove this VM_BUG_ON(), as Balbir stated: We used to have a for loop with !list_empty() as a termination condition and VM_BUG_ON(!pc) is a spill over. With the new loop, VM_BUG_ON(!pc) does not make sense. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Balbir Singh <balbir@in.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-23Solve section mismatch for free_area_init_core.Alexander van Heukelum
WARNING: vmlinux.o(.meminit.text+0x649): Section mismatch in reference from the function free_area_init_core() to the function .init.text:setup_usemap() The function __meminit free_area_init_core() references a function __init setup_usemap(). If free_area_init_core is only used by setup_usemap then annotate free_area_init_core with a matching annotation. The warning is covers this stack of functions in mm/page_alloc.c: alloc_bootmem_node must be marked __init. alloc_bootmem_node is used by setup_usemap, if !SPARSEMEM. (usemap_size is only used by setup_usemap, if !SPARSEMEM.) setup_usemap is only used by free_area_init_core. free_area_init_core is only used by free_area_init_node. free_area_init_node is used by: arch/alpha/mm/numa.c: __init paging_init() arch/arm/mm/init.c: __init bootmem_init_node() arch/avr32/mm/init.c: __init paging_init() arch/cris/arch-v10/mm/init.c: __init paging_init() arch/cris/arch-v32/mm/init.c: __init paging_init() arch/m32r/mm/discontig.c: __init zone_sizes_init() arch/m32r/mm/init.c: __init zone_sizes_init() arch/m68k/mm/motorola.c: __init paging_init() arch/m68k/mm/sun3mmu.c: __init paging_init() arch/mips/sgi-ip27/ip27-memory.c: __init paging_init() arch/parisc/mm/init.c: __init paging_init() arch/sparc/mm/srmmu.c: __init srmmu_paging_init() arch/sparc/mm/sun4c.c: __init sun4c_paging_init() arch/sparc64/mm/init.c: __init paging_init() mm/page_alloc.c: __init free_area_init_nodes() mm/page_alloc.c: __init free_area_init() and mm/memory_hotplug.c: hotadd_new_pgdat() hotadd_new_pgdat can not be an __init function, but: It is compiled for MEMORY_HOTPLUG configurations only MEMORY_HOTPLUG depends on SPARSEMEM || X86_64_ACPI_NUMA X86_64_ACPI_NUMA depends on X86_64 ARCH_FLATMEM_ENABLE depends on X86_32 ARCH_DISCONTIGMEM_ENABLE depends on X86_32 So X86_64_ACPI_NUMA implies SPARSEMEM, right? So we can mark the stack of functions __init for !SPARSEMEM, but we must mark them __meminit for SPARSEMEM configurations. This is ok, because then the calls to alloc_bootmem_node are also avoided. Compile-tested on: silly minimal config defconfig x86_32 defconfig x86_64 defconfig x86_64 -HIBERNATION +MEMORY_HOTPLUG Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm> Reviewed-by: Sam Ravnborg <sam@ravnborg.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-23hugetlb: ensure we do not reference a surplus page after handing it to buddyAndy Whitcroft
When we free a page via free_huge_page and we detect that we are in surplus the page will be returned to the buddy. After this we no longer own the page. However at the end free_huge_page we clear out our mapping pointer from page private. Even where the page is not a surplus we free the page to the hugepage pool, drop the pool locks and then clear page private. In either case the page may have been reallocated. BAD. Make sure we clear out page private before we free the page. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Adam Litke <agl@us.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-19Revert "SLUB: Alternate fast paths using cmpxchg_local"Linus Torvalds
This reverts commit 1f84260c8ce3b1ce26d4c1d6dedc2f33a3a29c0c, which is suspected to be the reason for some very occasional and hard-to-trigger crashes that usually look related to memory allocation (mostly reported in networking, but since that's generally the most common source of shortlived allocations - and allocations in interrupt contexts - that in itself is not a big clue). See for example http://bugzilla.kernel.org/show_bug.cgi?id=9973 http://lkml.org/lkml/2008/2/19/278 etc. One promising suspicion for what the root cause of bug is (which also explains why it's so hard to trigger in practice) came from Eric Dumazet: "I wonder how SLUB_FASTPATH is supposed to work, since it is affected by a classical ABA problem of lockless algo. cmpxchg_local(&c->freelist, object, object[c->offset]) can succeed, while an interrupt came (on this cpu), and several allocations were done, and one free was performed at the end of this interruption, so 'object' was recycled. c->freelist can then contain the previous value (object), but object[c->offset] was changed by IRQ. We then put back in freelist an already allocated object." but another reason for the revert is simply that everybody agrees that this code was the main suspect just by virtue of the pattern of oopses. Cc: Torsten Kaiser <just.for.lkml@googlemail.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Ingo Molnar <mingo@elte.hu> Cc: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-14Merge branch 'slab-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/christoph/vm * 'slab-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/christoph/vm: slub: Support 4k kmallocs again to compensate for page allocator slowness slub: Fallback to kmalloc_large for failing higher order allocs slub: Determine gfpflags once and not every time a slab is allocated make slub.c:slab_address() static slub: kmalloc page allocator pass-through cleanup slab: avoid double initialization & do initialization in 1 place
2008-02-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86: x86: cpa, fix out of date comment KVM is not seen under X86 config with latest git (32 bit compile) x86: cpa: ensure page alignment x86: include proper prototypes for rodata_test x86: fix gart_iommu_init() x86: EFI set_memory_x()/set_memory_uc() fixes x86: make dump_pagetable() static x86: fix "BUG: sleeping function called from invalid context" in print_vma_addr()
2008-02-14d_path: Make d_path() use a struct pathJan Blunck
d_path() is used on a <dentry,vfsmount> pair. Lets use a struct path to reflect this. [akpm@linux-foundation.org: fix build in mm/memory.c] Signed-off-by: Jan Blunck <jblunck@suse.de> Acked-by: Bryan Wu <bryan.wu@analog.com> Acked-by: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Neil Brown <neilb@suse.de> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-14d_path: Make seq_path() use a struct path argumentJan Blunck
seq_path() is always called with a dentry and a vfsmount from a struct path. Make seq_path() take it directly as an argument. Signed-off-by: Jan Blunck <jblunck@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-14slub: Support 4k kmallocs again to compensate for page allocator slownessChristoph Lameter
Currently we hand off PAGE_SIZEd kmallocs to the page allocator in the mistaken belief that the page allocator can handle these allocations effectively. However, measurements indicate a minimum slowdown by the factor of 8 (and that is only SMP, NUMA is much worse) vs the slub fastpath which causes regressions in tbench. Increase the number of kmalloc caches by one so that we again handle 4k kmallocs directly from slub. 4k page buffering for the page allocator will be performed by slub like done by slab. At some point the page allocator fastpath should be fixed. A lot of the kernel would benefit from a faster ability to allocate a single page. If that is done then the 4k allocs may again be forwarded to the page allocator and this patch could be reverted. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-14slub: Fallback to kmalloc_large for failing higher order allocsChristoph Lameter
Slub already has two ways of allocating an object. One is via its own logic and the other is via the call to kmalloc_large to hand off object allocation to the page allocator. kmalloc_large is typically used for objects >= PAGE_SIZE. We can use that handoff to avoid failing if a higher order kmalloc slab allocation cannot be satisfied by the page allocator. If we reach the out of memory path then simply try a kmalloc_large(). kfree() can already handle the case of an object that was allocated via the page allocator and so this will work just fine (apart from object accounting...). For any kmalloc slab that already requires higher order allocs (which makes it impossible to use the page allocator fastpath!) we just use PAGE_ALLOC_COSTLY_ORDER to get the largest number of objects in one go from the page allocator slowpath. On a 4k platform this patch will lead to the following use of higher order pages for the following kmalloc slabs: 8 ... 1024 order 0 2048 .. 4096 order 3 (4k slab only after the next patch) We may waste some space if fallback occurs on a 2k slab but we are always able to fallback to an order 0 alloc. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-14slub: Determine gfpflags once and not every time a slab is allocatedChristoph Lameter
Currently we determine the gfp flags to pass to the page allocator each time a slab is being allocated. Determine the bits to be set at the time the slab is created. Store in a new allocflags field and add the flags in allocate_slab(). Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-14make slub.c:slab_address() staticAdrian Bunk
slab_address() can become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-14slub: kmalloc page allocator pass-through cleanupPekka Enberg
This adds a proper function for kmalloc page allocator pass-through. While it simplifies any code that does slab tracing code a lot, I think it's a worthwhile cleanup in itself. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-14slab: avoid double initialization & do initialization in 1 placeMarcin Slusarz
- alloc_slabmgmt: initialize all slab fields in 1 place - slab->nodeid was initialized twice: in alloc_slabmgmt and immediately after it in cache_grow Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> CC: Christoph Lameter <clameter@sgi.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-14x86: fix "BUG: sleeping function called from invalid context" in ↵Ingo Molnar
print_vma_addr() Jiri Kosina reported the following deadlock scenario with show_unhandled_signals enabled: [ 68.379022] gnome-settings-[2941] trap int3 ip:3d2c840f34 sp:7fff36f5d100 error:0<3>BUG: sleeping function called from invalid context at kernel/rwsem.c:21 [ 68.379039] in_atomic():1, irqs_disabled():0 [ 68.379044] no locks held by gnome-settings-/2941. [ 68.379050] Pid: 2941, comm: gnome-settings- Not tainted 2.6.25-rc1 #30 [ 68.379054] [ 68.379056] Call Trace: [ 68.379061] <#DB> [<ffffffff81064883>] ? __debug_show_held_locks+0x13/0x30 [ 68.379109] [<ffffffff81036765>] __might_sleep+0xe5/0x110 [ 68.379123] [<ffffffff812f2240>] down_read+0x20/0x70 [ 68.379137] [<ffffffff8109cdca>] print_vma_addr+0x3a/0x110 [ 68.379152] [<ffffffff8100f435>] do_trap+0xf5/0x170 [ 68.379168] [<ffffffff8100f52b>] do_int3+0x7b/0xe0 [ 68.379180] [<ffffffff812f4a6f>] int3+0x9f/0xd0 [ 68.379203] <<EOE>> [ 68.379229] in libglib-2.0.so.0.1505.0[3d2c800000+dc000] and tracked it down to: commit 03252919b79891063cf99145612360efbdf9500b Author: Andi Kleen <ak@suse.de> Date: Wed Jan 30 13:33:18 2008 +0100 x86: print which shared library/executable faulted in segfault etc. messages the problem is that we call down_read() from an atomic context. Solve this by returning from print_vma_addr() if the preempt count is elevated. Update preempt_conditional_sti / preempt_conditional_cli to unconditionally lift the preempt count even on !CONFIG_PREEMPT. Reported-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-13hugetlb: fix overcommit lockingNishanth Aravamudan
proc_doulongvec_minmax() calls copy_to_user()/copy_from_user(), so we can't hold hugetlb_lock over the call. Use a dummy variable to store the sysctl result, like in hugetlb_sysctl_handler(), then grab the lock to update nr_overcommit_huge_pages. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Reported-by: Miles Lane <miles.lane@gmail.com> Cc: Adam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-13remove final fastcall usersHarvey Harrison
fastcall always expands to empty, remove it. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-11mempolicy: silently restrict nodemask to allowed nodesKOSAKI Motohiro
Kosaki Motohito noted that "numactl --interleave=all ..." failed in the presence of memoryless nodes. This patch attempts to fix that problem. Some background: numactl --interleave=all calls set_mempolicy(2) with a fully populated [out to MAXNUMNODES] nodemask. set_mempolicy() [in do_set_mempolicy()] calls contextualize_policy() which requires that the nodemask be a subset of the current task's mems_allowed; else EINVAL will be returned. A task's mems_allowed will always be a subset of node_states[N_HIGH_MEMORY] i.e., nodes with memory. So, a fully populated nodemask will be declared invalid if it includes memoryless nodes. NOTE: the same thing will occur when running in a cpuset with restricted mem_allowed--for the same reason: node mask contains dis-allowed nodes. mbind(2), on the other hand, just masks off any nodes in the nodemask that are not included in the caller's mems_allowed. In each case [mbind() and set_mempolicy()], mpol_check_policy() will complain [again, resulting in EINVAL] if the nodemask contains any memoryless nodes. This is somewhat redundant as mpol_new() will remove memoryless nodes for interleave policy, as will bind_zonelist()--called by mpol_new() for BIND policy. Proposed fix: 1) modify contextualize_policy logic to: a) remember whether the incoming node mask is empty. b) if not, restrict the nodemask to allowed nodes, as is currently done in-line for mbind(). This guarantees that the resulting mask includes only nodes with memory. NOTE: this is a [benign, IMO] change in behavior for set_mempolicy(). Dis-allowed nodes will be silently ignored, rather than returning an error. c) fold this code into mpol_check_policy(), replace 2 calls to contextualize_policy() to call mpol_check_policy() directly and remove contextualize_policy(). 2) In existing mpol_check_policy() logic, after "contextualization": a) MPOL_DEFAULT: require that in coming mask "was_empty" b) MPOL_{BIND|INTERLEAVE}: require that contextualized nodemask contains at least one node. c) add a case for MPOL_PREFERRED: if in coming was not empty and resulting mask IS empty, user specified invalid nodes. Return EINVAL. c) remove the now redundant check for memoryless nodes 3) remove the now redundant masking of policy nodes for interleave policy from mpol_new(). 4) Now that mpol_check_policy() contextualizes the nodemask, remove the in-line nodes_and() from sys_mbind(). I believe that this restores mbind() to the behavior before the memoryless-nodes patch series. E.g., we'll no longer treat an invalid nodemask with MPOL_PREFERRED as local allocation. [ Patch history: v1 -> v2: - Communicate whether or not incoming node mask was empty to mpol_check_policy() for better error checking. - As suggested by David Rientjes, remove the now unused cpuset_nodes_subset_current_mems_allowed() from cpuset.h v2 -> v3: - As suggested by Kosaki Motohito, fold the "contextualization" of policy nodemask into mpol_check_policy(). Looks a little cleaner. ] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-11Be more robust about bad arguments in get_user_pages()Jonathan Corbet
So I spent a while pounding my head against my monitor trying to figure out the vmsplice() vulnerability - how could a failure to check for *read* access turn into a root exploit? It turns out that it's a buffer overflow problem which is made easy by the way get_user_pages() is coded. In particular, "len" is a signed int, and it is only checked at the *end* of a do {} while() loop. So, if it is passed in as zero, the loop will execute once and decrement len to -1. At that point, the loop will proceed until the next invalid address is found; in the process, it will likely overflow the pages array passed in to get_user_pages(). I think that, if get_user_pages() has been asked to grab zero pages, that's what it should do. Thus this patch; it is, among other things, enough to block the (already fixed) root exploit and any others which might be lurking in similar code. I also think that the number of pages should be unsigned, but changing the prototype of this function probably requires some more careful review. Signed-off-by: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-09memcontrol: add vm_match_cgroup()David Rientjes
mm_cgroup() is exclusively used to test whether an mm's mem_cgroup pointer is pointing to a specific cgroup. Instead of returning the pointer, we can just do the test itself in a new macro: vm_match_cgroup(mm, cgroup) returns non-zero if the mm's mem_cgroup points to cgroup. Otherwise it returns zero. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08mm: special mapping nopageNick Piggin
Convert special mapping install from nopage to fault. Because the "vm_file" is NULL for the special mapping, the generic VM code has messed up "vm_pgoff" thinking that it's an anonymous mapping and the offset does't matter. For that reason, we need to undo the vm_pgoff offset that got added into vmf->pgoff. [ We _really_ should clean that up - either by making this whole special mapping code just use a real file entry rather than that ugly array of "struct page" pointers, or by just making the VM code realize that even if vm_file is NULL it may not be a regular anonymous mmap. - Linus ] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08CONFIG_HIGHPTE vs. sub-page page tables.Martin Schwidefsky
Background: I've implemented 1K/2K page tables for s390. These sub-page page tables are required to properly support the s390 virtualization instruction with KVM. The SIE instruction requires that the page tables have 256 page table entries (pte) followed by 256 page status table entries (pgste). The pgstes are only required if the process is using the SIE instruction. The pgstes are updated by the hardware and by the hypervisor for a number of reasons, one of them is dirty and reference bit tracking. To avoid wasting memory the standard pte table allocation should return 1K/2K (31/64 bit) and 2K/4K if the process is using SIE. Problem: Page size on s390 is 4K, page table size is 1K or 2K. That means the s390 version for pte_alloc_one cannot return a pointer to a struct page. Trouble is that with the CONFIG_HIGHPTE feature on x86 pte_alloc_one cannot return a pointer to a pte either, since that would require more than 32 bit for the return value of pte_alloc_one (and the pte * would not be accessible since its not kmapped). Solution: The only solution I found to this dilemma is a new typedef: a pgtable_t. For s390 pgtable_t will be a (pte *) - to be introduced with a later patch. For everybody else it will be a (struct page *). The additional problem with the initialization of the ptl lock and the NR_PAGETABLE accounting is solved with a constructor pgtable_page_ctor and a destructor pgtable_page_dtor. The page table allocation and free functions need to call these two whenever a page table page is allocated or freed. pmd_populate will get a pgtable_t instead of a struct page pointer. To get the pgtable_t back from a pmd entry that has been installed with pmd_populate a new function pmd_pgtable is added. It replaces the pmd_page call in free_pte_range and apply_to_pte_range. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08mount-options-fix-tmpfs-fixAndrew Morton
Documentation/SubmitCheckist, please. Cc: Hugh Dickins <hugh@veritas.com> Cc: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08mount options: fix tmpfsakpm@linux-foundation.org
Add .show_options super operation to tmpfs. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08kill do_generic_mapping_readChristoph Hellwig
do_generic_mapping_read was used by gfs2 for internals reads, but this use of the interface was rather suboptimal (as was the whole interface) and has been replaced by an internal helper now. This patch kills do_generic_mapping_read and surrounding damage in preparation of additional cleanups for the buffered read path. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08Use pgoff_t instead of unsigned longJan Kara
Convert variables containing page indexes to pgoff_t. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08misc: removal of final callers using fastcallHarvey Harrison
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08hugetlb: add locking for overcommit sysctlNishanth Aravamudan
When I replaced hugetlb_dynamic_pool with nr_overcommit_hugepages I used proc_doulongvec_minmax() directly. However, hugetlb.c's locking rules require that all counter modifications occur under the hugetlb_lock. Add a callback into the hugetlb code similar to the one for nr_hugepages. Grab the lock around the manipulation of nr_overcommit_hugepages in proc_doulongvec_minmax(). Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Adam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07SLUB: fix checkpatch warningsIngo Molnar
fix checkpatch --file mm/slub.c errors and warnings. $ q-code-quality-compare errors lines of code errors/KLOC mm/slub.c [before] 22 4204 5.2 mm/slub.c [after] 0 4210 0 no code changed: text data bss dec hex filename 22195 8634 136 30965 78f5 slub.o.before 22195 8634 136 30965 78f5 slub.o.after md5: 93cdfbec2d6450622163c590e1064358 slub.o.before.asm 93cdfbec2d6450622163c590e1064358 slub.o.after.asm [clameter: rediffed against Pekka's cleanup patch, omitted moves of the name of a function to the start of line] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-07Use non atomic unlockNick Piggin
Slub can use the non-atomic version to unlock because other flags will not get modified with the lock held. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2008-02-07SLUB: Support for performance statisticsChristoph Lameter
The statistics provided here allow the monitoring of allocator behavior but at the cost of some (minimal) loss of performance. Counters are placed in SLUB's per cpu data structure. The per cpu structure may be extended by the statistics to grow larger than one cacheline which will increase the cache footprint of SLUB. There is a compile option to enable/disable the inclusion of the runtime statistics and its off by default. The slabinfo tool is enhanced to support these statistics via two options: -D Switches the line of information displayed for a slab from size mode to activity mode. -A Sorts the slabs displayed by activity. This allows the display of the slabs most important to the performance of a certain load. -r Report option will report detailed statistics on Example (tbench load): slabinfo -AD ->Shows the most active slabs Name Objects Alloc Free %Fast skbuff_fclone_cache 33 111953835 111953835 99 99 :0000192 2666 5283688 5281047 99 99 :0001024 849 5247230 5246389 83 83 vm_area_struct 1349 119642 118355 91 22 :0004096 15 66753 66751 98 98 :0000064 2067 25297 23383 98 78 dentry 10259 28635 18464 91 45 :0000080 11004 18950 8089 98 98 :0000096 1703 12358 10784 99 98 :0000128 762 10582 9875 94 18 :0000512 184 9807 9647 95 81 :0002048 479 9669 9195 83 65 anon_vma 777 9461 9002 99 71 kmalloc-8 6492 9981 5624 99 97 :0000768 258 7174 6931 58 15 So the skbuff_fclone_cache is of highest importance for the tbench load. Pretty high load on the 192 sized slab. Look for the aliases slabinfo -a | grep 000192 :0000192 <- xfs_btree_cur filp kmalloc-192 uid_cache tw_sock_TCP request_sock_TCPv6 tw_sock_TCPv6 skbuff_head_cache xfs_ili Likely skbuff_head_cache. Looking into the statistics of the skbuff_fclone_cache is possible through slabinfo skbuff_fclone_cache ->-r option implied if cache name is mentioned .... Usual output ... Slab Perf Counter Alloc Free %Al %Fr -------------------------------------------------- Fastpath 111953360 111946981 99 99 Slowpath 1044 7423 0 0 Page Alloc 272 264 0 0 Add partial 25 325 0 0 Remove partial 86 264 0 0 RemoteObj/SlabFrozen 350 4832 0 0 Total 111954404 111954404 Flushes 49 Refill 0 Deactivate Full=325(92%) Empty=0(0%) ToHead=24(6%) ToTail=1(0%) Looks good because the fastpath is overwhelmingly taken. skbuff_head_cache: Slab Perf Counter Alloc Free %Al %Fr -------------------------------------------------- Fastpath 5297262 5259882 99 99 Slowpath 4477 39586 0 0 Page Alloc 937 824 0 0 Add partial 0 2515 0 0 Remove partial 1691 824 0 0 RemoteObj/SlabFrozen 2621 9684 0 0 Total 5301739 5299468 Deactivate Full=2620(100%) Empty=0(0%) ToHead=0(0%) ToTail=0(0%) Descriptions of the output: Total: The total number of allocation and frees that occurred for a slab Fastpath: The number of allocations/frees that used the fastpath. Slowpath: Other allocations Page Alloc: Number of calls to the page allocator as a result of slowpath processing Add Partial: Number of slabs added to the partial list through free or alloc (occurs during cpuslab flushes) Remove Partial: Number of slabs removed from the partial list as a result of allocations retrieving a partial slab or by a free freeing the last object of a slab. RemoteObj/Froz: How many times were remotely freed object encountered when a slab was about to be deactivated. Frozen: How many times was free able to skip list processing because the slab was in use as the cpuslab of another processor. Flushes: Number of times the cpuslab was flushed on request (kmem_cache_shrink, may result from races in __slab_alloc) Refill: Number of times we were able to refill the cpuslab from remotely freed objects for the same slab. Deactivate: Statistics how slabs were deactivated. Shows how they were put onto the partial list. In general fastpath is very good. Slowpath without partial list processing is also desirable. Any touching of partial list uses node specific locks which may potentially cause list lock contention. Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-07SLUB: Alternate fast paths using cmpxchg_localChristoph Lameter
Provide an alternate implementation of the SLUB fast paths for alloc and free using cmpxchg_local. The cmpxchg_local fast path is selected for arches that have CONFIG_FAST_CMPXCHG_LOCAL set. An arch should only set CONFIG_FAST_CMPXCHG_LOCAL if the cmpxchg_local is faster than an interrupt enable/disable sequence. This is known to be true for both x86 platforms so set FAST_CMPXCHG_LOCAL for both arches. Currently another requirement for the fastpath is that the kernel is compiled without preemption. The restriction will go away with the introduction of a new per cpu allocator and new per cpu operations. The advantages of a cmpxchg_local based fast path are: 1. Potentially lower cycle count (30%-60% faster) 2. There is no need to disable and enable interrupts on the fast path. Currently interrupts have to be disabled and enabled on every slab operation. This is likely avoiding a significant percentage of interrupt off / on sequences in the kernel. 3. The disposal of freed slabs can occur with interrupts enabled. The alternate path is realized using #ifdef's. Several attempts to do the same with macros and inline functions resulted in a mess (in particular due to the strange way that local_interrupt_save() handles its argument and due to the need to define macros/functions that sometimes disable interrupts and sometimes do something else). [clameter: Stripped preempt bits and disabled fastpath if preempt is enabled] Signed-off-by: Christoph Lameter <clameter@sgi.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2008-02-07SLUB: Use unique end pointer for each slab page.Christoph Lameter
We use a NULL pointer on freelists to signal that there are no more objects. However the NULL pointers of all slabs match in contrast to the pointers to the real objects which are in different ranges for different slab pages. Change the end pointer to be a pointer to the first object and set bit 0. Every slab will then have a different end pointer. This is necessary to ensure that end markers can be matched to the source slab during cmpxchg_local. Bring back the use of the mapping field by SLUB since we would otherwise have to call a relatively expensive function page_address() in __slab_alloc(). Use of the mapping field allows avoiding a call to page_address() in various other functions as well. There is no need to change the page_mapping() function since bit 0 is set on the mapping as also for anonymous pages. page_mapping(slab_page) will therefore still return NULL although the mapping field is overloaded. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2008-02-07SLUB: Deal with annoying gcc warning on kfree()Christoph Lameter
gcc 4.2 spits out an annoying warning if one casts a const void * pointer to a void * pointer. No warning is generated if the conversion is done through an assignment. Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-07Introduce flags for reserve_bootmem()Bernhard Walle
This patchset adds a flags variable to reserve_bootmem() and uses the BOOTMEM_EXCLUSIVE flag in crashkernel reservation code to detect collisions between crashkernel area and already used memory. This patch: Change the reserve_bootmem() function to accept a new flag BOOTMEM_EXCLUSIVE. If that flag is set, the function returns with -EBUSY if the memory already has been reserved in the past. This is to avoid conflicts. Because that code runs before SMP initialisation, there's no race condition inside reserve_bootmem_core(). [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix powerpc build] Signed-off-by: Bernhard Walle <bwalle@suse.de> Cc: <linux-arch@vger.kernel.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>