aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2009-06-16oom: move oom_adj value from task_struct to mm_structDavid Rientjes
The per-task oom_adj value is a characteristic of its mm more than the task itself since it's not possible to oom kill any thread that shares the mm. If a task were to be killed while attached to an mm that could not be freed because another thread were set to OOM_DISABLE, it would have needlessly been terminated since there is no potential for future memory freeing. This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct mm_struct. This requires task_lock() on a task to check its oom_adj value to protect against exec, but it's already necessary to take the lock when dereferencing the mm to find the total VM size for the badness heuristic. This fixes a livelock if the oom killer chooses a task and another thread sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs because oom_kill_task() repeatedly returns 1 and refuses to kill the chosen task while select_bad_process() will repeatedly choose the same task during the next retry. Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in oom_kill_task() to check for threads sharing the same memory will be removed in the next patch in this series where it will no longer be necessary. Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since these threads are immune from oom killing already. They simply report an oom_adj value of OOM_DISABLE. Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16mm: reuse unused swap entry if necessaryKAMEZAWA Hiroyuki
Presently we can know a swap entry is just used as SwapCache via swap_map, without looking up swap cache. Then, we have a chance to reuse swap-cache-only swap entries in get_swap_pages(). This patch tries to free swap-cache-only swap entries if swap is not enough. Note: We hit following path when swap_cluster code cannot find a free cluster. Then, vm_swap_full() is not only condition to allow the kernel to reclaim unused swap. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16mm: modify swap_map and add SWAP_HAS_CACHE flagKAMEZAWA Hiroyuki
This is a part of the patches for fixing memcg's swap accountinf leak. But, IMHO, not a bad patch even if no memcg. There are 2 kinds of references to swap. - reference from swap entry - reference from swap cache Then, - If there is swap cache && swap's refcnt is 1, there is only swap cache. (*) swapcount(entry) == 1 && find_get_page(swapper_space, entry) != NULL This counting logic have worked well for a long time. But considering that we cannot know there is a _real_ reference or not by swap_map[], current usage of counter is not very good. This patch adds a flag SWAP_HAS_CACHE and recored information that a swap entry has a cache or not. This will remove -1 magic used in swapfile.c and be a help to avoid unnecessary find_get_page(). Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16mm: add swap cache interface for swap referenceKAMEZAWA Hiroyuki
In a following patch, the usage of swap cache is recorded into swap_map. This patch is for necessary interface changes to do that. 2 interfaces: - swapcache_prepare() - swapcache_free() are added for allocating/freeing refcnt from swap-cache to existing swap entries. But implementation itself is not changed under this patch. At adding swapcache_free(), memcg's hook code is moved under swapcache_free(). This is better than using scattered hooks. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16mm: remove CONFIG_UNEVICTABLE_LRU config optionKOSAKI Motohiro
Currently, nobody wants to turn UNEVICTABLE_LRU off. Thus this configurability is unnecessary. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andi Kleen <andi@firstfloor.org> Acked-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page-allocator: reset wmark_min and inactive ratio of zone when hotplug happensMinchan Kim
Solve two problems. Whenever memory hotplug sucessfully happens, zone->present_pages have to be changed. 1) Now memory hotplug calls setup_per_zone_wmark_min only when online_pages called, not offline_pages. It breaks balance. 2) If zone->present_pages is changed, we also have to change zone->inactive_ratio. That's because inactive_ratio depends on zone->present_pages. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page-allocator: add inactive ratio calculation function of each zoneMinchan Kim
Factor the per-zone arithemetic inside setup_per_zone_inactive_ratio()'s loop into a a separate function, calculate_zone_inactive_ratio(). This function will be used in a later patch [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page-allocator: clean up functions related to pages_minMinchan Kim
Change the names of two functions. It doesn't affect behavior. Presently, setup_per_zone_pages_min() changes low, high of zone as well as min. So a better name is setup_per_zone_wmarks(). That's because Mel changed zone->pages_[hig/low/min] to zone->watermark array in "page allocator: replace the watermark-related union in struct zone with a watermark[] array". * setup_per_zone_pages_min => setup_per_zone_wmarks Of course, we have to change init_per_zone_pages_min, too. There are not pages_min any more. * init_per_zone_pages_min => init_per_zone_wmark_min [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page-allocator: use integer fields lookup for gfp_zone and check for errors ↵Christoph Lameter
in flags passed to the page allocator This simplifies the code in gfp_zone() and also keeps the ability of the compiler to use constant folding to get rid of gfp_zone processing. The lookup of the zone is done using a bitfield stored in an integer. So the code in gfp_zone is a simple extraction of bits from a constant bitfield. The compiler is generating a load of a constant into a register and then performs a shift and mask operation to get the zone from a gfp_t. No cachelines are touched and no branches have to be predicted by the compiler. We are doing some macro tricks here to convince the compiler to always do the constant folding if possible. Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Mel Gorman <mel@csn.ul.ie> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16mm: check the argument of kunmap on architectures without highmemMatthew Wilcox
If you're using a non-highmem architecture, passing an argument with the wrong type to kunmap() doesn't give you a warning because the ifdef doesn't check the type. Using a static inline function solves the problem nicely. Reported-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16vmscan: prevent shrinking of active anon lru list in case of no swap space V3MinChan Kim
shrink_zone() can deactivate active anon pages even if we don't have a swap device. Many embedded products don't have a swap device. So the deactivation of anon pages is unnecessary. This patch prevents unnecessary deactivation of anon lru pages. But, it don't prevent aging of anon pages to swap out. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16migration: only migrate_prep() once per move_pages()Brice Goglin
migrate_prep() is fairly expensive (72us on 16-core barcelona 1.9GHz). Commit 3140a2273009c01c27d316f35ab76a37e105fdd8 improved move_pages() throughput by breaking it into chunks, but it also made migrate_prep() be called once per chunk (every 128pages or so) instead of once per move_pages(). This patch reverts to calling migrate_prep() only once per chunk as we did before 2.6.29. It is also a followup to commit 0aedadf91a70a11c4a3e7c7d99b21e5528af8d5d ("mm: move migrate_prep out from under mmap_sem"). This improves migration throughput on the above machine from 600MB/s to 750MB/s. Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16mm, PM/Freezer: Disable OOM killer when tasks are frozenRafael J. Wysocki
Currently, the following scenario appears to be possible in theory: * Tasks are frozen for hibernation or suspend. * Free pages are almost exhausted. * Certain piece of code in the suspend code path attempts to allocate some memory using GFP_KERNEL and allocation order less than or equal to PAGE_ALLOC_COSTLY_ORDER. * __alloc_pages_internal() cannot find a free page so it invokes the OOM killer. * The OOM killer attempts to kill a task, but the task is frozen, so it doesn't die immediately. * __alloc_pages_internal() jumps to 'restart', unsuccessfully tries to find a free page and invokes the OOM killer. * No progress can be made. Although it is now hard to trigger during hibernation due to the memory shrinking carried out by the hibernation code, it is theoretically possible to trigger during suspend after the memory shrinking has been removed from that code path. Moreover, since memory allocations are going to be used for the hibernation memory shrinking, it will be even more likely to happen during hibernation. To prevent it from happening, introduce the oom_killer_disabled switch that will cause __alloc_pages_internal() to fail in the situations in which the OOM killer would have been called and make the freezer set this switch after tasks have been successfully frozen. [akpm@linux-foundation.org: be nicer to the namespace] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Fengguang Wu <fengguang.wu@gmail.com> Cc: David Rientjes <rientjes@google.com> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16mm: madvise(): correct return codeNick Piggin
The posix_madvise() function succeeds (and does nothing) when called with parameters (NULL, 0, -1); according to LSB tests, it should fail with EINVAL because -1 is not a valid flag. When called with a valid address and size, it correctly fails. So perform an initial check for valid flags first. Reported-by: Jiri Dluhos <jdluhos@novell.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-and-Tested-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page-allocator: warn if __GFP_NOFAIL is used for a large allocationAndrew Morton
__GFP_NOFAIL is a bad fiction. Allocations _can_ fail, and callers should detect and suitably handle this (and not by lamely moving the infinite loop up to the caller level either). Attempting to use __GFP_NOFAIL for a higher-order allocation is even worse, so add a once-off runtime check for this to slap people around for even thinking about trying it. Cc: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16videobuf-dma-contig: zero copy USERPTR supportMagnus Damm
Since videobuf-dma-contig is designed to handle physically contiguous memory, this patch modifies the videobuf-dma-contig code to only accept a user space pointer to physically contiguous memory. For now only VM_PFNMAP vmas are supported, so forget hotplug. On SuperH Mobile we use this with our sh_mobile_ceu_camera driver together with various multimedia accelerator blocks that are exported to user space using UIO. The UIO kernel code exports physically contiguous memory to user space and lets the user space application mmap() this memory and pass a pointer using the USERPTR interface for V4L2 zero copy operation. With this approach we support zero copy capture, hardware scaling and various forms of hardware encoding and decoding. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Magnus Damm <damm@igel.co.jp> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Paul Mundt <lethal@linux-sh.org> Acked-by: Mauro Carvalho Chehab <mchehab@infradead.org> Cc: Hans Verkuil <hverkuil@xs4all.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16mm: introduce follow_pfn()Johannes Weiner
Analoguous to follow_phys(), add a helper that looks up the PFN at a user virtual address in an IO mapping or a raw PFN mapping. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@infradead.org> Acked-by: Magnus Damm <magnus.damm@gmail.com> Cc: Hans Verkuil <hverkuil@xs4all.nl> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16mm: use generic follow_pte() in follow_phys()Johannes Weiner
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@infradead.org> Acked-by: Magnus Damm <magnus.damm@gmail.com> Cc: Hans Verkuil <hverkuil@xs4all.nl> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16mm: introduce follow_pte()Johannes Weiner
A generic readonly page table lookup helper to map an address space and an address from it to a pte. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@infradead.org> Acked-by: Magnus Damm <magnus.damm@gmail.com> Cc: Hans Verkuil <hverkuil@xs4all.nl> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16mm: setup_per_zone_inactive_ratio - fix comment and make it __initCyrill Gorcunov
The caller of setup_per_zone_inactive_ratio is an __init function. There is no need to keep the callee after it completed as well. Also fix a comment. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16mm: setup_per_zone_inactive_ratio - do not call for int_sqrt if not neededCyrill Gorcunov
int_sqrt() returns 0 if its argument is zero so call it if only needed. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16vmscan: ZVC updates in shrink_active_list() can be done onceWu Fengguang
This effectively lifts the unit of updates to nr_inactive_* and pgdeactivate from PAGEVEC_SIZE=14 to SWAP_CLUSTER_MAX=32, or MAX_ORDER_NR_PAGES=1024 for reclaim_zone(). Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16vmscan: don't export nr_saved_scan in /proc/zoneinfoWu Fengguang
The lru->nr_saved_scan's are not meaningful counters for even kernel developers. They typically are smaller than 32 and are always 0 for large lists. So remove them from /proc/zoneinfo. Hopefully this interface change won't break too many scripts. /proc/zoneinfo is too unstructured to be script friendly, and I wonder the affected scripts - if there are any - are still bleeding since the not long ago commit "vmscan: split LRU lists into anon & file sets", which also touched the "scanned" line :) If we are to re-export accumulated vmscan counts in the future, they can go to new lines in /proc/zoneinfo instead of the current form, or to /sys/devices/system/node/node0/meminfo? Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16vmscan: cleanup the scan batching codeWu Fengguang
The vmscan batching logic is twisting. Move it into a standalone function nr_scan_try_batch() and document it. No behavior change. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Christoph Lameter <cl@linux-foundation.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16vmscan: evict use-once pages firstRik van Riel
When the file LRU lists are dominated by streaming IO pages, evict those pages first, before considering evicting other pages. This should be safe from deadlocks or performance problems because only three things can happen to an inactive file page: 1) referenced twice and promoted to the active list 2) evicted by the pageout code 3) under IO, after which it will get evicted or promoted The pages freed in this way can either be reused for streaming IO, or allocated for something else. If the pages are used for streaming IO, this pageout pattern continues. Otherwise, we will fall back to the normal pageout pattern. Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Elladan <elladan@eskimo.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16pagemap: add page-types toolWu Fengguang
Add page-types, a handy tool for querying page flags. It will expand some of the overloaded flags: PG_slob_free = PG_private PG_slub_frozen = PG_active PG_slub_debug = PG_error PG_readahead = PG_reclaim and mask out obscure flags except in -raw mode: PG_reserved PG_mlocked PG_mappedtodisk PG_private PG_private_2 PG_owner_priv_1 PG_arch_1 PG_uncached PG_compound* for non hugeTLB pages [akpm@linux-foundation.org: fix warning] Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16pagemap: document 9 more exported page flagsWu Fengguang
Also add short descriptions for all of the 20 exported page flags. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16pagemap: document clarificationsWu Fengguang
Some bit ranges were inclusive and some not. Fix them to be consistently inclusive. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16proc: export more page flags in /proc/kpageflagsWu Fengguang
Export all page flags faithfully in /proc/kpageflags. 11. KPF_MMAP (pseudo flag) memory mapped page 12. KPF_ANON (pseudo flag) memory mapped page (anonymous) 13. KPF_SWAPCACHE page is in swap cache 14. KPF_SWAPBACKED page is swap/RAM backed 15. KPF_COMPOUND_HEAD (*) 16. KPF_COMPOUND_TAIL (*) 17. KPF_HUGE hugeTLB pages 18. KPF_UNEVICTABLE page is in the unevictable LRU list 19. KPF_HWPOISON(TBD) hardware detected corruption 20. KPF_NOPAGE (pseudo flag) no page frame at the address 32-39. more obscure flags for kernel developers (*) For compound pages, exporting _both_ head/tail info enables users to tell where a compound page starts/ends, and its order. The accompanying page-types tool will handle the details like decoupling overloaded flags and hiding obscure flags to normal users. Thanks to KOSAKI and Andi for their valuable recommendations! Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16proc: kpagecount/kpageflags code cleanupWu Fengguang
Move increments of pfn/out to bottom of the loop. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Acked-by: Matt Mackall <mpm@selenic.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16mm: introduce PageHuge() for testing huge/gigantic pagesWu Fengguang
A series of patches to enhance the /proc/pagemap interface and to add a userspace executable which can be used to present the pagemap data. Export 10 more flags to end users (and more for kernel developers): 11. KPF_MMAP (pseudo flag) memory mapped page 12. KPF_ANON (pseudo flag) memory mapped page (anonymous) 13. KPF_SWAPCACHE page is in swap cache 14. KPF_SWAPBACKED page is swap/RAM backed 15. KPF_COMPOUND_HEAD (*) 16. KPF_COMPOUND_TAIL (*) 17. KPF_HUGE hugeTLB pages 18. KPF_UNEVICTABLE page is in the unevictable LRU list 19. KPF_HWPOISON hardware detected corruption 20. KPF_NOPAGE (pseudo flag) no page frame at the address (*) For compound pages, exporting _both_ head/tail info enables users to tell where a compound page starts/ends, and its order. a simple demo of the page-types tool # ./page-types -h page-types [options] -r|--raw Raw mode, for kernel developers -a|--addr addr-spec Walk a range of pages -b|--bits bits-spec Walk pages with specified bits -l|--list Show page details in ranges -L|--list-each Show page details one by one -N|--no-summary Don't show summay info -h|--help Show this usage message addr-spec: N one page at offset N (unit: pages) N+M pages range from N to N+M-1 N,M pages range from N to M-1 N, pages range from N to end ,M pages range from 0 to M bits-spec: bit1,bit2 (flags & (bit1|bit2)) != 0 bit1,bit2=bit1 (flags & (bit1|bit2)) == bit1 bit1,~bit2 (flags & (bit1|bit2)) == bit1 =bit1,bit2 flags == (bit1|bit2) bit-names: locked error referenced uptodate dirty lru active slab writeback reclaim buddy mmap anonymous swapcache swapbacked compound_head compound_tail huge unevictable hwpoison nopage reserved(r) mlocked(r) mappedtodisk(r) private(r) private_2(r) owner_private(r) arch(r) uncached(r) readahead(o) slob_free(o) slub_frozen(o) slub_debug(o) (r) raw mode bits (o) overloaded bits # ./page-types flags page-count MB symbolic-flags long-symbolic-flags 0x0000000000000000 487369 1903 _________________________________ 0x0000000000000014 5 0 __R_D____________________________ referenced,dirty 0x0000000000000020 1 0 _____l___________________________ lru 0x0000000000000024 34 0 __R__l___________________________ referenced,lru 0x0000000000000028 3838 14 ___U_l___________________________ uptodate,lru 0x0001000000000028 48 0 ___U_l_______________________I___ uptodate,lru,readahead 0x000000000000002c 6478 25 __RU_l___________________________ referenced,uptodate,lru 0x000100000000002c 47 0 __RU_l_______________________I___ referenced,uptodate,lru,readahead 0x0000000000000040 8344 32 ______A__________________________ active 0x0000000000000060 1 0 _____lA__________________________ lru,active 0x0000000000000068 348 1 ___U_lA__________________________ uptodate,lru,active 0x0001000000000068 12 0 ___U_lA______________________I___ uptodate,lru,active,readahead 0x000000000000006c 988 3 __RU_lA__________________________ referenced,uptodate,lru,active 0x000100000000006c 48 0 __RU_lA______________________I___ referenced,uptodate,lru,active,readahead 0x0000000000004078 1 0 ___UDlA_______b__________________ uptodate,dirty,lru,active,swapbacked 0x000000000000407c 34 0 __RUDlA_______b__________________ referenced,uptodate,dirty,lru,active,swapbacked 0x0000000000000400 503 1 __________B______________________ buddy 0x0000000000000804 1 0 __R________M_____________________ referenced,mmap 0x0000000000000828 1029 4 ___U_l_____M_____________________ uptodate,lru,mmap 0x0001000000000828 43 0 ___U_l_____M_________________I___ uptodate,lru,mmap,readahead 0x000000000000082c 382 1 __RU_l_____M_____________________ referenced,uptodate,lru,mmap 0x000100000000082c 12 0 __RU_l_____M_________________I___ referenced,uptodate,lru,mmap,readahead 0x0000000000000868 192 0 ___U_lA____M_____________________ uptodate,lru,active,mmap 0x0001000000000868 12 0 ___U_lA____M_________________I___ uptodate,lru,active,mmap,readahead 0x000000000000086c 800 3 __RU_lA____M_____________________ referenced,uptodate,lru,active,mmap 0x000100000000086c 31 0 __RU_lA____M_________________I___ referenced,uptodate,lru,active,mmap,readahead 0x0000000000004878 2 0 ___UDlA____M__b__________________ uptodate,dirty,lru,active,mmap,swapbacked 0x0000000000001000 492 1 ____________a____________________ anonymous 0x0000000000005808 4 0 ___U_______Ma_b__________________ uptodate,mmap,anonymous,swapbacked 0x0000000000005868 2839 11 ___U_lA____Ma_b__________________ uptodate,lru,active,mmap,anonymous,swapbacked 0x000000000000586c 30 0 __RU_lA____Ma_b__________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked total 513968 2007 # ./page-types -r flags page-count MB symbolic-flags long-symbolic-flags 0x0000000000000000 468002 1828 _________________________________ 0x0000000100000000 19102 74 _____________________r___________ reserved 0x0000000000008000 41 0 _______________H_________________ compound_head 0x0000000000010000 188 0 ________________T________________ compound_tail 0x0000000000008014 1 0 __R_D__________H_________________ referenced,dirty,compound_head 0x0000000000010014 4 0 __R_D___________T________________ referenced,dirty,compound_tail 0x0000000000000020 1 0 _____l___________________________ lru 0x0000000800000024 34 0 __R__l__________________P________ referenced,lru,private 0x0000000000000028 3794 14 ___U_l___________________________ uptodate,lru 0x0001000000000028 46 0 ___U_l_______________________I___ uptodate,lru,readahead 0x0000000400000028 44 0 ___U_l_________________d_________ uptodate,lru,mappedtodisk 0x0001000400000028 2 0 ___U_l_________________d_____I___ uptodate,lru,mappedtodisk,readahead 0x000000000000002c 6434 25 __RU_l___________________________ referenced,uptodate,lru 0x000100000000002c 47 0 __RU_l_______________________I___ referenced,uptodate,lru,readahead 0x000000040000002c 14 0 __RU_l_________________d_________ referenced,uptodate,lru,mappedtodisk 0x000000080000002c 30 0 __RU_l__________________P________ referenced,uptodate,lru,private 0x0000000800000040 8124 31 ______A_________________P________ active,private 0x0000000000000040 219 0 ______A__________________________ active 0x0000000800000060 1 0 _____lA_________________P________ lru,active,private 0x0000000000000068 322 1 ___U_lA__________________________ uptodate,lru,active 0x0001000000000068 12 0 ___U_lA______________________I___ uptodate,lru,active,readahead 0x0000000400000068 13 0 ___U_lA________________d_________ uptodate,lru,active,mappedtodisk 0x0000000800000068 12 0 ___U_lA_________________P________ uptodate,lru,active,private 0x000000000000006c 977 3 __RU_lA__________________________ referenced,uptodate,lru,active 0x000100000000006c 48 0 __RU_lA______________________I___ referenced,uptodate,lru,active,readahead 0x000000040000006c 5 0 __RU_lA________________d_________ referenced,uptodate,lru,active,mappedtodisk 0x000000080000006c 3 0 __RU_lA_________________P________ referenced,uptodate,lru,active,private 0x0000000c0000006c 3 0 __RU_lA________________dP________ referenced,uptodate,lru,active,mappedtodisk,private 0x0000000c00000068 1 0 ___U_lA________________dP________ uptodate,lru,active,mappedtodisk,private 0x0000000000004078 1 0 ___UDlA_______b__________________ uptodate,dirty,lru,active,swapbacked 0x000000000000407c 34 0 __RUDlA_______b__________________ referenced,uptodate,dirty,lru,active,swapbacked 0x0000000000000400 538 2 __________B______________________ buddy 0x0000000000000804 1 0 __R________M_____________________ referenced,mmap 0x0000000000000828 1029 4 ___U_l_____M_____________________ uptodate,lru,mmap 0x0001000000000828 43 0 ___U_l_____M_________________I___ uptodate,lru,mmap,readahead 0x000000000000082c 382 1 __RU_l_____M_____________________ referenced,uptodate,lru,mmap 0x000100000000082c 12 0 __RU_l_____M_________________I___ referenced,uptodate,lru,mmap,readahead 0x0000000000000868 192 0 ___U_lA____M_____________________ uptodate,lru,active,mmap 0x0001000000000868 12 0 ___U_lA____M_________________I___ uptodate,lru,active,mmap,readahead 0x000000000000086c 800 3 __RU_lA____M_____________________ referenced,uptodate,lru,active,mmap 0x000100000000086c 31 0 __RU_lA____M_________________I___ referenced,uptodate,lru,active,mmap,readahead 0x0000000000004878 2 0 ___UDlA____M__b__________________ uptodate,dirty,lru,active,mmap,swapbacked 0x0000000000001000 492 1 ____________a____________________ anonymous 0x0000000000005008 2 0 ___U________a_b__________________ uptodate,anonymous,swapbacked 0x0000000000005808 4 0 ___U_______Ma_b__________________ uptodate,mmap,anonymous,swapbacked 0x000000000000580c 1 0 __RU_______Ma_b__________________ referenced,uptodate,mmap,anonymous,swapbacked 0x0000000000005868 2839 11 ___U_lA____Ma_b__________________ uptodate,lru,active,mmap,anonymous,swapbacked 0x000000000000586c 29 0 __RU_lA____Ma_b__________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked total 513968 2007 # ./page-types --raw --list --no-summary --bits reserved offset count flags 0 15 _____________________r___________ 31 4 _____________________r___________ 159 97 _____________________r___________ 4096 2067 _____________________r___________ 6752 2390 _____________________r___________ 9355 3 _____________________r___________ 9728 14526 _____________________r___________ This patch: Introduce PageHuge(), which identifies huge/gigantic pages by their dedicated compound destructor functions. Also move prep_compound_gigantic_page() to hugetlb.c and make __free_pages_ok() non-static. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16mm: use alloc_pages_exact() in alloc_large_system_hash() to avoid duplicated ↵Mel Gorman
logic alloc_large_system_hash() has logic for freeing pages at the end of an excessively large power-of-two buffer that is a duplicate of what is in alloc_pages_exact(). This patch converts alloc_large_system_hash() to use alloc_pages_exact(). Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: sanity check order in the page allocator slow pathMel Gorman
Callers may speculatively call different allocators in order of preference trying to allocate a buffer of a given size. The order needed to allocate this may be larger than what the page allocator can normally handle. While the allocator mostly does the right thing, it should not direct reclaim or wakeup kswapd with a bogus order. This patch sanity checks the order in the slow path and returns NULL if it is too large. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: move free_page_mlock() to page_alloc.cKOSAKI Motohiro
Currently, free_page_mlock() is only called from page_alloc.c. Thus, we can move it to page_alloc.c. Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: slab: use nr_online_nodes to check for a NUMA platformMel Gorman
SLAB currently avoids checking a bitmap repeatedly by checking once and storing a flag. When the addition of nr_online_nodes as a cheaper version of num_online_nodes(), this check can be replaced by nr_online_nodes. (Christoph did a patch that this is lifted almost verbatim from) Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: use a pre-calculated value instead of num_online_nodes() in ↵Christoph Lameter
fast paths num_online_nodes() is called in a number of places but most often by the page allocator when deciding whether the zonelist needs to be filtered based on cpusets or the zonelist cache. This is actually a heavy function and touches a number of cache lines. This patch stores the number of online nodes at boot time and updates the value when nodes get onlined and offlined. The value is then used in a number of important paths in place of num_online_nodes(). [rientjes@google.com: do not override definition of node_set_online() with macro] Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: get the pageblock migratetype without disabling interruptsMel Gorman
Local interrupts are disabled when freeing pages to the PCP list. Part of that free checks what the migratetype of the pageblock the page is in but it checks this with interrupts disabled and interupts should never be disabled longer than necessary. This patch checks the pagetype with interrupts enabled with the impact that it is possible a page is freed to the wrong list when a pageblock changes type. As that block is now already considered mixed from an anti-fragmentation perspective, it's not of vital importance. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: update NR_FREE_PAGES only as necessaryMel Gorman
When pages are being freed to the buddy allocator, the zone NR_FREE_PAGES counter must be updated. In the case of bulk per-cpu page freeing, it's updated once per page. This retouches cache lines more than necessary. Update the counters one per per-cpu bulk free. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: use allocation flags as an index to the zone watermarkMel Gorman
ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether pages_min, pages_low or pages_high is used as the zone watermark when allocating the pages. Two branches in the allocator hotpath determine which watermark to use. This patch uses the flags as an array index into a watermark array that is indexed with WMARK_* defines accessed via helpers. All call sites that use zone->pages_* are updated to use the helpers for accessing the values and the array offsets for setting. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: do not check for compound pages during the page allocator ↵Nick Piggin
sanity checks A number of sanity checks are made on each page allocation and free including that the page count is zero. page_count() checks for compound pages and checks the count of the head page if true. However, in these paths, we do not care if the page is compound or not as the count of each tail page should also be zero. This patch makes two changes to the use of page_count() in the free path. It converts one check of page_count() to a VM_BUG_ON() as the count should have been unconditionally checked earlier in the free path. It also avoids checking for compound pages. [mel@csn.ul.ie: Wrote changelog] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: do not setup zonelist cache when there is only one nodeMel Gorman
There is a zonelist cache which is used to track zones that are not in the allowed cpuset or found to be recently full. This is to reduce cache footprint on large machines. On smaller machines, it just incurs cost for no gain. This patch only uses the zonelist cache when there are NUMA nodes. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: do not disable interrupts in free_page_mlock()Mel Gorman
free_page_mlock() tests and clears PG_mlocked using locked versions of the bit operations. If set, it disables interrupts to update counters and this happens on every page free even though interrupts are disabled very shortly afterwards a second time. This is wasteful. This patch splits what free_page_mlock() does. The bit check is still made. However, the update of counters is delayed until the interrupts are disabled and the non-lock version for clearing the bit is used. One potential weirdness with this split is that the counters do not get updated if the bad_page() check is triggered but a system showing bad pages is getting screwed already. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: do not call get_pageblock_migratetype() more than necessaryMel Gorman
get_pageblock_migratetype() is potentially called twice for every page free. Once, when being freed to the pcp lists and once when being freed back to buddy. When freeing from the pcp lists, it is known what the pageblock type was at the time of free so use it rather than rechecking. In low memory situations under memory pressure, this might skew anti-fragmentation slightly but the interference is minimal and decisions that are fragmenting memory are being made anyway. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: inline __rmqueue_fallback()Mel Gorman
__rmqueue_fallback() is in the slow path but has only one call site. Because there is only one call-site, this function can then be inlined without causing text bloat. On an x86-based config, it made no difference as the savings were padded out by NOP instructions. Milage varies but text will either decrease in size or remain static. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: inline buffered_rmqueue()Mel Gorman
buffered_rmqueue() is in the fast path so inline it. Because it only has one call site, this function can then be inlined without causing text bloat. On an x86-based config, it made no difference as the savings were padded out by NOP instructions. Milage varies but text will either decrease in size or remain static. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: inline __rmqueue_smallest()Mel Gorman
Inline __rmqueue_smallest by altering flow very slightly so that there is only one call site. Because there is only one call-site, this function can then be inlined without causing text bloat. On an x86-based config, this patch reduces text by 16 bytes. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: remove a branch by assuming __GFP_HIGH == ALLOC_HIGHMel Gorman
Allocations that specify __GFP_HIGH get the ALLOC_HIGH flag. If these flags are equal to each other, we can eliminate a branch. [akpm@linux-foundation.org: Suggested the hack] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: calculate the alloc_flags for allocation only oncePeter Zijlstra
Factor out the mapping between GFP and alloc_flags only once. Once factored out, it only needs to be calculated once but some care must be taken. [neilb@suse.de says] As the test: - if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE))) - && !in_interrupt()) { - if (!(gfp_mask & __GFP_NOMEMALLOC)) { has been replaced with a slightly weaker one: + if (alloc_flags & ALLOC_NO_WATERMARKS) { Without care, this would allow recursion into the allocator via direct reclaim. This patch ensures we do not recurse when PF_MEMALLOC is set but TF_MEMDIE callers are now allowed to directly reclaim where they would have been prevented in the past. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Neil Brown <neilb@suse.de> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: calculate the migratetype for allocation only onceMel Gorman
GFP mask is converted into a migratetype when deciding which pagelist to take a page from. However, it is happening multiple times per allocation, at least once per zone traversed. Calculate it once. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: calculate the preferred zone for allocation only onceMel Gorman
get_page_from_freelist() can be called multiple times for an allocation. Part of this calculates the preferred_zone which is the first usable zone in the zonelist but the zone depends on the GFP flags specified at the beginning of the allocation call. This patch calculates preferred_zone once. It's safe to do this because if preferred_zone is NULL at the start of the call, no amount of direct reclaim or other actions will change the fact the allocation will fail. [akpm@linux-foundation.org: remove (void) casts] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>