aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2008-06-21Add return value to reserve_bootmem_node()Bernhard Walle
This patch changes the function reserve_bootmem_node() from void to int, returning -ENOMEM if the allocation fails. This fixes a build problem on x86 with CONFIG_KEXEC=y and CONFIG_NEED_MULTIPLE_NODES=y Signed-off-by: Bernhard Walle <bwalle@suse.de> Reported-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-20Reinstate ZERO_PAGE optimization in 'get_user_pages()' and fix XIPLinus Torvalds
KAMEZAWA Hiroyuki and Oleg Nesterov point out that since the commit 557ed1fa2620dc119adb86b34c614e152a629a80 ("remove ZERO_PAGE") removed the ZERO_PAGE from the VM mappings, any users of get_user_pages() will generally now populate the VM with real empty pages needlessly. We used to get the ZERO_PAGE when we did the "handle_mm_fault()", but since fault handling no longer uses ZERO_PAGE for new anonymous pages, we now need to handle that special case in follow_page() instead. In particular, the removal of ZERO_PAGE effectively removed the core file writing optimization where we would skip writing pages that had not been populated at all, and increased memory pressure a lot by allocating all those useless newly zeroed pages. This reinstates the optimization by making the unmapped PTE case the same as for a non-existent page table, which already did this correctly. While at it, this also fixes the XIP case for follow_page(), where the caller could not differentiate between the case of a page that simply could not be used (because it had no "struct page" associated with it) and a page that just wasn't mapped. We do that by simply returning an error pointer for pages that could not be turned into a "struct page *". The error is arbitrarily picked to be EFAULT, since that was what get_user_pages() already used for the equivalent IO-mapped page case. [ Also removed an impossible test for pte_offset_map_lock() failing: that's not how that function works ] Acked-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-16Merge branch 'linus' into tracing/ftraceIngo Molnar
2008-06-12pagemap: pass mm into pagewalkersDave Hansen
We need this at least for huge page detection for now, because powerpc needs the vm_area_struct to be able to determine whether a virtual address is referring to a huge page (its pmd_huge() doesn't work). It might also come in handy for some of the other users. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-12mm: fix incorrect variable type in do_try_to_free_pages()kosaki.motohiro@jp.fujitsu.com
"Smarter retry of costly-order allocations" patch series change behaver of do_try_to_free_pages(). But unfortunately ret variable type was unchanged. Thus an overflow is possible. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-12nommu: Correct kobjsize() page validity checks.Paul Mundt
This implements a few changes on top of the recent kobjsize() refactoring introduced by commit 6cfd53fc03670c7a544a56d441eb1a6cc800d72b. As Christoph points out: virt_to_head_page cannot return NULL. virt_to_page also does not return NULL. pfn_valid() needs to be used to figure out if a page is valid. Otherwise the page struct reference that was returned may have PageReserved() set to indicate that it is not a valid page. As discussed further in the thread, virt_addr_valid() is the preferable way to validate the object pointer in this case. In addition to fixing up the reserved page case, it also has the benefit of encapsulating the hack introduced by commit 4016a1390d07f15b267eecb20e76a48fd5c524ef on the impacted platforms, allowing us to get rid of the extra checking in kobjsize() for the platforms that don't perform this type of bizarre memory_end abuse (every nommu platform that isn't blackfin). If blackfin decides to get in line with every other platform and use PageReserved for the DMA pages in question, kobjsize() will also continue to work fine. It also turns out that compound_order() will give us back 0-order for non-head pages, so we can get rid of the PageCompound check and just use compound_order() directly. Clean that up while we're at it. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Reviewed-by: Christoph Lameter <clameter@sgi.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-10mm, x86: shrink_active_range() should check allYinghai Lu
Now we are using register_e820_active_regions() instead of add_active_range() directly. So end_pfn could be different between the value in early_node_map to node_end_pfn. So we need to make shrink_active_range() smarter. shrink_active_range() is a generic MM function in mm/page_alloc.c but it is only used on 32-bit x86. Should we move it back to some file in arch/x86? Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-09mm: Minor clean-up of page flags in mm/page_alloc.cRuss Anderson
Minor source code cleanup of page flags in mm/page_alloc.c. Move the definition of the groups of bits to page-flags.h. The purpose of this clean up is that the next patch will conditionally add a page flag to the groups. Doing that in a header file is cleaner than adding #ifdefs to the C code. Signed-off-by: Russ Anderson <rja@sgi.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06nommu: fix kobjsize() for SLOB and SLUBPaul Mundt
kobjsize() has been abusing page->index as a method for sorting out compound order, which blows up both for page cache pages, and SLOB's reuse of the index in struct slob_page. Presently we are not able to accurately size arbitrary pointers that don't come from kmalloc(), so the best we can do is sort out the compound order from the head page if it's a compound page, or default to 0-order if it's impossible to ksize() the object. Obviously this leaves quite a bit to be desired in terms of object sizing accuracy, but the behaviour is unchanged over the existing implementation, while fixing the page->index oopses originally reported here: http://marc.info/?l=linux-mm&m=121127773325245&w=2 Accuracy could also be improved by having SLUB and SLOB both set PG_slab on ksizeable pages, rather than just handling the __GFP_COMP cases irregardless of the PG_slab setting, as made possibly with Pekka's patches: http://marc.info/?l=linux-kernel&m=121139439900534&w=2 http://marc.info/?l=linux-kernel&m=121139440000537&w=2 http://marc.info/?l=linux-kernel&m=121139440000540&w=2 This is primarily a bugfix for nommu systems for 2.6.26, with the aim being to gradually kill off kobjsize() and its particular brand of object abuse entirely. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Paul Mundt <lethal@linux-sh.org> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06brk: make sys_brk() honor COMPAT_BRK when computing lower boundJiri Kosina
Fix a regression introduced by commit 4cc6028d4040f95cdb590a87db478b42b8be0508 Author: Jiri Kosina <jkosina@suse.cz> Date: Wed Feb 6 22:39:44 2008 +0100 brk: check the lower bound properly The check in sys_brk() on minimum value the brk might have must take CONFIG_COMPAT_BRK setting into account. When this option is turned on (i.e. we support ancient legacy binaries, e.g. libc5-linked stuff), the lower bound on brk value is mm->end_code, otherwise the brk start is allowed to be arbitrarily shifted. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06hugetlb: fix lockdep errorNick Piggin
============================================= [ INFO: possible recursive locking detected ] 2.6.26-rc4 #30 --------------------------------------------- heap-overflow/2250 is trying to acquire lock: (&mm->page_table_lock){--..}, at: [<c0000000000cf2e8>] .copy_hugetlb_page_range+0x108/0x280 but task is already holding lock: (&mm->page_table_lock){--..}, at: [<c0000000000cf2dc>] .copy_hugetlb_page_range+0xfc/0x280 other info that might help us debug this: 3 locks held by heap-overflow/2250: #0: (&mm->mmap_sem){----}, at: [<c000000000050e44>] .dup_mm+0x134/0x410 #1: (&mm->mmap_sem/1){--..}, at: [<c000000000050e54>] .dup_mm+0x144/0x410 #2: (&mm->page_table_lock){--..}, at: [<c0000000000cf2dc>] .copy_hugetlb_page_range+0xfc/0x280 stack backtrace: Call Trace: [c00000003b2774e0] [c000000000010ce4] .show_stack+0x74/0x1f0 (unreliable) [c00000003b2775a0] [c0000000003f10e0] .dump_stack+0x20/0x34 [c00000003b277620] [c0000000000889bc] .__lock_acquire+0xaac/0x1080 [c00000003b277740] [c000000000089000] .lock_acquire+0x70/0xb0 [c00000003b2777d0] [c0000000003ee15c] ._spin_lock+0x4c/0x80 [c00000003b277870] [c0000000000cf2e8] .copy_hugetlb_page_range+0x108/0x280 [c00000003b277950] [c0000000000bcaa8] .copy_page_range+0x558/0x790 [c00000003b277ac0] [c000000000050fe0] .dup_mm+0x2d0/0x410 [c00000003b277ba0] [c000000000051d24] .copy_process+0xb94/0x1020 [c00000003b277ca0] [c000000000052244] .do_fork+0x94/0x310 [c00000003b277db0] [c000000000011240] .sys_clone+0x60/0x80 [c00000003b277e30] [c0000000000078c4] .ppc_clone+0x8/0xc Fix is the same way that mm/memory.c copy_page_range does the lockdep annotation. Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Adam Litke <agl@us.ibm.com> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-03x86, numa, 32-bit: print out debug info on all kvasYinghai Lu
also fix the print out of node_remap_end_vaddr Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: slub: ksize() abuse checks slob: Fix to return wrong pointer
2008-05-24memory hotplug: fix early allocation handlingHeiko Carstens
Trying to add memory via add_memory() from within an initcall function results in bootmem alloc of 163840 bytes failed! Kernel panic - not syncing: Out of memory This is caused by zone_wait_table_init() which uses system_state to decide if it should use the bootmem allocator or not. When initcalls are handled the system_state is still SYSTEM_BOOTING but the bootmem allocator doesn't work anymore. So the allocation will fail. To fix this use slab_is_available() instead as indicator like we do it everywhere else. [akpm@linux-foundation.org: coding-style fix] Reviewed-by: Andy Whitcroft <apw@shadowen.org> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24zonelists: handle a node zonelist with no applicable entriesAndy Whitcroft
When booting 2.6.26-rc3 on a multi-node x86_32 numa system we are seeing panics when trying node local allocations: BUG: unable to handle kernel NULL pointer dereference at 0000034c IP: [<c1042507>] get_page_from_freelist+0x4a/0x18e *pdpt = 00000000013a7001 *pde = 0000000000000000 Oops: 0000 [#1] SMP Modules linked in: Pid: 0, comm: swapper Not tainted (2.6.26-rc3-00003-g5abc28d #82) EIP: 0060:[<c1042507>] EFLAGS: 00010282 CPU: 0 EIP is at get_page_from_freelist+0x4a/0x18e EAX: c1371ed8 EBX: 00000000 ECX: 00000000 EDX: 00000000 ESI: f7801180 EDI: 00000000 EBP: 00000000 ESP: c1371ec0 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 Process swapper (pid: 0, ti=c1370000 task=c12f5b40 task.ti=c1370000) Stack: 00000000 00000000 00000000 00000000 000612d0 000412d0 00000000 000412d0 f7801180 f7c0101c f7c01018 c10426e4 f7c01018 00000001 00000044 00000000 00000001 c12f5b40 00000001 00000010 00000000 000412d0 00000286 000412d0 Call Trace: [<c10426e4>] __alloc_pages_internal+0x99/0x378 [<c10429ca>] __alloc_pages+0x7/0x9 [<c105e0e8>] kmem_getpages+0x66/0xef [<c105ec55>] cache_grow+0x8f/0x123 [<c105f117>] ____cache_alloc_node+0xb9/0xe4 [<c105f427>] kmem_cache_alloc_node+0x92/0xd2 [<c122118c>] setup_cpu_cache+0xaf/0x177 [<c105e6ca>] kmem_cache_create+0x2c8/0x353 [<c13853af>] kmem_cache_init+0x1ce/0x3ad [<c13755c5>] start_kernel+0x178/0x1ee This occurs when we are scanning the zonelists looking for a ZONE_NORMAL page. In this system there is only ZONE_DMA and ZONE_NORMAL memory on node 0, all other nodes are mapped above 4GB physical. Here is a dump of the zonelists from this system: zonelists pgdat=c1400000 0: c14006c0:2 f7c006c0:2 f7e006c0:2 c1400360:1 c1400000:0 1: c14006c0:2 c1400360:1 c1400000:0 zonelists pgdat=f7c00000 0: f7c006c0:2 f7e006c0:2 c14006c0:2 c1400360:1 c1400000:0 1: f7c006c0:2 zonelists pgdat=f7e00000 0: f7e006c0:2 c14006c0:2 f7c006c0:2 c1400360:1 c1400000:0 1: f7e006c0:2 When performing a node local allocation we call get_page_from_freelist() looking for a page. It in turn calls first_zones_zonelist() which returns a preferred_zone. Where there are no applicable zones this will be NULL. However we use this unconditionally, leading to this panic. Where there are no applicable zones there is no possibility of a successful allocation, so simply fail the allocation. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24mm: fix atomic_t overflow in vmAlan Cox
The atomic_t type is 32bit but a 64bit system can have more than 2^32 pages of virtual address space available. Without this we overflow on ludicrously large mappings Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24mm: don't drop a partial page in a zone's memory map sizeJohannes Weiner
In a zone's present pages number, account for all pages occupied by the memory map, including a partial. Signed-off-by: Johannes Weiner <hannes@saeurebad.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24mm: allow pfnmap ->fault()sNick Piggin
Take out an assertion to allow ->fault handlers to service PFNMAP regions. This is required to reimplement .nopfn handlers with .fault handlers and subsequently remove nopfn. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Jes Sorensen <jes@sgi.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-23ftrace: limit trace entriesSteven Rostedt
Currently there is no protection from the root user to use up all of memory for trace buffers. If the root user allocates too many entries, the OOM killer might start kill off all tasks. This patch adds an algorith to check the following condition: pages_requested > (freeable_memory + current_trace_buffer_pages) / 4 If the above is met then the allocation fails. The above prevents more than 1/4th of freeable memory from being used by trace buffers. To determine the freeable_memory, I made determine_dirtyable_memory in mm/page-writeback.c global. Special thanks goes to Peter Zijlstra for suggesting the above calculation. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-22slub: ksize() abuse checksPekka Enberg
Add a WARN_ON for pages that don't have PageSlab nor PageCompound set to catch the worst abusers of ksize() in the kernel. Acked-by: Christoph Lameter <clameter@sgi.com> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-05-20mm: bdi: fix race in bdi_class device creationGreg Kroah-Hartman
There is a race from when a device is created with device_create() and then the drvdata is set with a call to dev_set_drvdata() in which a sysfs file could be open, yet the drvdata will be NULL, causing all sorts of bad things to happen. This patch fixes the problem by using the new function, device_create_vargs(). Many thanks to Arthur Jones <ajones@riverbed.com> for reporting the bug, and testing patches out. Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Arthur Jones <ajones@riverbed.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-05-19slob: Fix to return wrong pointerMinChan Kim
Although slob_alloc return NULL, __kmalloc_node returns NULL + align. Because align always can be changed, it is very hard for debugging problem of no page if it don't return NULL. We have to return NULL in case of no page. [penberg@cs.helsinki.fi: fix formatting as suggested by Matt.] Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: MinChan Kim <minchan.kim@gmail.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-05-14memory_hotplug: always initialize pageblock bitmapHeiko Carstens
Trying to online a new memory section that was added via memory hotplug sometimes results in crashes when the new pages are added via __free_page. Reason for that is that the pageblock bitmap isn't initialized and hence contains random stuff. That means that get_pageblock_migratetype() returns also random stuff and therefore list_add(&page->lru, &zone->free_area[order].free_list[migratetype]); in __free_one_page() tries to do a list_add to something that isn't even necessarily a list. This happens since 86051ca5eaf5e560113ec7673462804c54284456 ("mm: fix usemap initialization") which makes sure that the pageblock bitmap gets only initialized for pages present in a zone. Unfortunately for hot-added memory the zones "grow" after the memmap and the pageblock memmap have been initialized. Which means that the new pages have an unitialized bitmap. To solve this the calls to grow_zone_span() and grow_pgdat_span() are moved to __add_zone() just before the initialization happens. The patch also moves the two functions since __add_zone() is the only caller and I didn't want to add a forward declaration. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-14mprotect: prevent alteration of the PAT bitsVenki Pallipadi
There is a defect in mprotect, which lets the user change the page cache type bits by-passing the kernel reserve_memtype and free_memtype wrappers. Fix the problem by not letting mprotect change the PAT bits. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-14memory_hotplug: check for walk_memory_resource() failure in online_pages()Geoff Levand
Add a check to online_pages() to test for failure of walk_memory_resource(). This fixes a condition where a failure of walk_memory_resource() can lead to online_pages() returning success without the requested pages being onlined. Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Keith Mannthey <kmannth@us.ibm.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-14memory hotplug: memmap_init_zone called twiceHeiko Carstens
__add_zone calls memmap_init_zone twice if memory gets attached to an empty zone. Once via init_currently_empty_zone and once explictly right after that call. Looks like this is currently not a bug, however the call is superfluous and might lead to subtle bugs if memmap_init_zone gets changed. So make sure it is called only once. Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-14mm: fix infinite loop in filemap_faultMiklos Szeredi
filemap_fault will go into an infinite loop if ->readpage() fails asynchronously. AFAICS the bug was introduced by this commit, which removed the wait after the final readpage: commit d00806b183152af6d24f46f0c33f14162ca1262a Author: Nick Piggin <npiggin@suse.de> Date: Thu Jul 19 01:46:57 2007 -0700 mm: fix fault vs invalidate race for linear mappings Fix by reintroducing the wait_on_page_locked() after ->readpage() to make sure the page is up-to-date before jumping back to the beginning of the function. I've noticed this while testing nfs exporting on fuse. The patch fixes it. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-14fix SMP data race in pagetable setup vs walkingNick Piggin
There is a possible data race in the page table walking code. After the split ptlock patches, it actually seems to have been introduced to the core code, but even before that I think it would have impacted some architectures (powerpc and sparc64, at least, walk the page tables without taking locks eg. see find_linux_pte()). The race is as follows: The pte page is allocated, zeroed, and its struct page gets its spinlock initialized. The mm-wide ptl is then taken, and then the pte page is inserted into the pagetables. At this point, the spinlock is not guaranteed to have ordered the previous stores to initialize the pte page with the subsequent store to put it in the page tables. So another Linux page table walker might be walking down (without any locks, because we have split-leaf-ptls), and find that new pte we've inserted. It might try to take the spinlock before the store from the other CPU initializes it. And subsequently it might read a pte_t out before stores from the other CPU have cleared the memory. There are also similar races in higher levels of the page tables. They obviously don't involve the spinlock, but could see uninitialized memory. Arch code and hardware pagetable walkers that walk the pagetables without locks could see similar uninitialized memory problems, regardless of whether split ptes are enabled or not. I prefer to put the barriers in core code, because that's where the higher level logic happens, but the page table accessors are per-arch, and open-coding them everywhere I don't think is an option. I'll put the read-side barriers in alpha arch code for now (other architectures perform data-dependent loads in order). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-13mm/pdflush.c: merge the same code in two pathDenis Cheng
Signed-off-by: Denis Cheng <crquan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-13make vmstat cpu-unplug safeKOSAKI Motohiro
When accessing cpu_online_map, we should prevent dynamic changing of cpu_online_map by get_online_cpus(). Unfortunately, all_vm_events() doesn't do that. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Christoph Lameter <clameter@sgi.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-08Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: Revert "relay: fix splice problem" docbook: fix bio missing parameter block: use unitialized_var() in bio_alloc_bioset() block: avoid duplicate calls to get_part() in disk stat code cfq-iosched: make io priorities inherit CPU scheduling class as well as nice block: optimize generic_unplug_device() block: get rid of likely/unlikely predictions in merge logic vfs: splice remove_suid() cleanup cfq-iosched: fix RCU race in the cfq io_context destructor handling block: adjust tagging function queue bit locking block: sysfs store function needs to grab queue_lock and use queue_flag_*()
2008-05-08slub: fix atomic usage in any_slab_objects()Benjamin Herrenschmidt
any_slab_objects() does an atomic_read on an atomic_long_t, this fixes it to use atomic_long_read instead. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Christoph Lameter <clameter@sgi.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-07vfs: splice remove_suid() cleanupMiklos Szeredi
generic_file_splice_write() duplicates remove_suid() just because it doesn't hold i_mutex. But it grabs i_mutex inside splice_from_pipe() anyway, so this is rather pointless. Move locking to generic_file_splice_write() and call remove_suid() and __splice_from_pipe() instead. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-06x86: fix PAE pmd_bad bootup warningHugh Dickins
Fix warning from pmd_bad() at bootup on a HIGHMEM64G HIGHPTE x86_32. That came from 9fc34113f6880b215cbea4e7017fc818700384c2 x86: debug pmd_bad(); but we understand now that the typecasting was wrong for PAE in the previous version: pagetable pages above 4GB looked bad and stopped Arjan from booting. And revert that cded932b75ab0a5f9181ee3da34a0a488d1a14fd x86: fix pmd_bad and pud_bad to support huge pages. It was the wrong way round: we shouldn't weaken every pmd_bad and pud_bad check to let huge pages slip through - in part they check that we _don't_ have a huge page where it's not expected. Put the x86 pmd_bad() and pud_bad() definitions back to what they have long been: they can be improved (x86_32 should use PTE_MASK, to stop PAE thinking junk in the upper word is good; and x86_64 should follow x86_32's stricter comparison, to stop thinking any subset of required bits is good); but that should be a later patch. Fix Hans' good observation that follow_page() will never find pmd_huge() because that would have already failed the pmd_bad test: test pmd_huge in between the pmd_none and pmd_bad tests. Tighten x86's pmd_huge() check? No, once it's a hugepage entry, it can get quite far from a good pmd: for example, PROT_NONE leaves it with only ACCESSED of the KERN_PGTABLE bits. However... though follow_page() contains this and another test for huge pages, so it's nice to keep it working on them, where does it actually get called on a huge page? get_user_pages() checks is_vm_hugetlb_page(vma) to to call alternative hugetlb processing, as does unmap_vmas() and others. Signed-off-by: Hugh Dickins <hugh@veritas.com> Earlier-version-tested-by: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jeff Chua <jeff.chua.linux@gmail.com> Cc: Hans Rosenfeld <hans.rosenfeld@amd.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-02slub: #ifdef simplificationChristoph Lameter
If we make SLUB_DEBUG depend on SYSFS then we can simplify some #ifdefs and avoid others. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-05-02slub: Whitespace cleanup and use of strict_strtoulChristoph Lameter
Fix some issues with wrapping and use strict_strtoul to make parameter passing from sysfs safer. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-05-01memcg: simple stats for memory resource controllerBalaji Rao
Implement trivial statistics for the memory resource controller. Signed-off-by: Balaji Rao <balajirrao@gmail.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-01docbook: fix vmalloc missing parameter notationRandy Dunlap
Fix vmalloc kernel-doc warning: Warning(linux-2.6.25-git14//mm/vmalloc.c:555): No description found for parameter 'caller' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-01remove div_long_long_remRoman Zippel
x86 is the only arch right now, which provides an optimized for div_long_long_rem and it has the downside that one has to be very careful that the divide doesn't overflow. The API is a little akward, as the arguments for the unsigned divide are signed. The signed version also doesn't handle a negative divisor and produces worse code on 64bit archs. There is little incentive to keep this API alive, so this converts the few users to the new API. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30revert "memory hotplug: allocate usemap on the section with pgdat"Andrew Morton
This: commit 86f6dae1377523689bd8468fed2f2dd180fc0560 Author: Yasunori Goto <y-goto@jp.fujitsu.com> Date: Mon Apr 28 02:13:33 2008 -0700 memory hotplug: allocate usemap on the section with pgdat Usemaps are allocated on the section which has pgdat by this. Because usemap size is very small, many other sections usemaps are allocated on only one page. If a section has usemap, it can't be removed until removing other sections. This dependency is not desirable for memory removing. Pgdat has similar feature. When a section has pgdat area, it must be the last section for removing on the node. So, if section A has pgdat and section B has usemap for section A, Both sections can't be removed due to dependency each other. To solve this issue, this patch collects usemap on same section with pgdat. If other sections doesn't have any dependency, this section will be able to be removed finally. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> broke davem's sparc64 bootup. Revert it while we work out what went wrong. Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30mm: fix warning on memory offlineNick Piggin
KAMEZAWA Hiroyuki found a warning message in the buffer dirtying code that is coming from page migration caller. WARNING: at fs/buffer.c:720 __set_page_dirty+0x330/0x360() Call Trace: [<a000000100015220>] show_stack+0x80/0xa0 [<a000000100015270>] dump_stack+0x30/0x60 [<a000000100089ed0>] warn_on_slowpath+0x90/0xe0 [<a0000001001f8b10>] __set_page_dirty+0x330/0x360 [<a0000001001ffb90>] __set_page_dirty_buffers+0xd0/0x280 [<a00000010012fec0>] set_page_dirty+0xc0/0x260 [<a000000100195670>] migrate_page_copy+0x5d0/0x5e0 [<a000000100197840>] buffer_migrate_page+0x2e0/0x3c0 [<a000000100195eb0>] migrate_pages+0x770/0xe00 What was happening is that migrate_page_copy wants to transfer the PG_dirty bit from old page to new page, so what it would do is set_page_dirty(newpage). However set_page_dirty() is used to set the entire page dirty, wheras in this case, only part of the page was dirty, and it also was not uptodate. Marking the whole page dirty with set_page_dirty would lead to corruption or unresolvable conditions -- a dirty && !uptodate page and dirty && !uptodate buffers. Possibly we could just ClearPageDirty(oldpage); SetPageDirty(newpage); however in the interests of keeping the change minimal... Signed-off-by: Nick Piggin <npiggin@suse.de> Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30mm: remove remaining __FUNCTION__ occurrencesHarvey Harrison
__FUNCTION__ is gcc-specific, use __func__ Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30infrastructure to debug (dynamic) objectsThomas Gleixner
We can see an ever repeating problem pattern with objects of any kind in the kernel: 1) freeing of active objects 2) reinitialization of active objects Both problems can be hard to debug because the crash happens at a point where we have no chance to decode the root cause anymore. One problem spot are kernel timers, where the detection of the problem often happens in interrupt context and usually causes the machine to panic. While working on a timer related bug report I had to hack specialized code into the timer subsystem to get a reasonable hint for the root cause. This debug hack was fine for temporary use, but far from a mergeable solution due to the intrusiveness into the timer code. The code further lacked the ability to detect and report the root cause instantly and keep the system operational. Keeping the system operational is important to get hold of the debug information without special debugging aids like serial consoles and special knowledge of the bug reporter. The problems described above are not restricted to timers, but timers tend to expose it usually in a full system crash. Other objects are less explosive, but the symptoms caused by such mistakes can be even harder to debug. Instead of creating specialized debugging code for the timer subsystem a generic infrastructure is created which allows developers to verify their code and provides an easy to enable debug facility for users in case of trouble. The debugobjects core code keeps track of operations on static and dynamic objects by inserting them into a hashed list and sanity checking them on object operations and provides additional checks whenever kernel memory is freed. The tracked object operations are: - initializing an object - adding an object to a subsystem list - deleting an object from a subsystem list Each operation is sanity checked before the operation is executed and the subsystem specific code can provide a fixup function which allows to prevent the damage of the operation. When the sanity check triggers a warning message and a stack trace is printed. The list of operations can be extended if the need arises. For now it's limited to the requirements of the first user (timers). The core code enqueues the objects into hash buckets. The hash index is generated from the address of the object to simplify the lookup for the check on kfree/vfree. Each bucket has it's own spinlock to avoid contention on a global lock. The debug code can be compiled in without being active. The runtime overhead is minimal and could be optimized by asm alternatives. A kernel command line option enables the debugging code. Thanks to Ingo Molnar for review, suggestions and cleanup patches. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30mm: Add NR_WRITEBACK_TEMP counterMiklos Szeredi
Fuse will use temporary buffers to write back dirty data from memory mappings (normal writes are done synchronously). This is needed, because there cannot be any guarantee about the time in which a write will complete. By using temporary buffers, from the MM's point if view the page is written back immediately. If the writeout was due to memory pressure, this effectively migrates data from a full zone to a less full zone. This patch adds a new counter (NR_WRITEBACK_TEMP) for the number of pages used as temporary buffers. [Lee.Schermerhorn@hp.com: add vmstat_text for NR_WRITEBACK_TEMP] Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30mm: bdi: export bdi_writeout_inc()Miklos Szeredi
Fuse needs this for writable mmap support. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30mm: bdi: add separate writeback accounting capabilityMiklos Szeredi
Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is set, then don't update the per-bdi writeback stats from test_set_page_writeback() and test_clear_page_writeback(). Misc cleanups: - convert bdi_cap_writeback_dirty() and friends to static inline functions - create a flag that includes all three dirty/writeback related flags, since almst all users will want to have them toghether Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30mm: bdi: move statistics to debugfsMiklos Szeredi
Move BDI statistics to debugfs: /sys/kernel/debug/bdi/<bdi>/stats Use postcore_initcall() to initialize the sysfs class and debugfs, because debugfs is initialized in core_initcall(). Update descriptions in ABI documentation. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30mm: bdi: allow setting a maximum for the bdi dirty limitPeter Zijlstra
Add "max_ratio" to /sys/class/bdi. This indicates the maximum percentage of the global dirty threshold allocated to this bdi. [mszeredi@suse.cz] - fix parsing in max_ratio_store(). - export bdi_set_max_ratio() to modules - limit bdi_dirty with bdi->max_ratio - document new sysfs attribute Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30mm: bdi: allow setting a minimum for the bdi dirty limitPeter Zijlstra
Under normal circumstances each device is given a part of the total write-back cache that relates to its current avg writeout speed in relation to the other devices. min_ratio - allows one to assign a minimum portion of the write-back cache to a particular device. This is useful in situations where you might want to provide a minimum QoS. (One request for this feature came from flash based storage people who wanted to avoid writing out at all costs - they of course needed some pdflush hacks as well) max_ratio - allows one to assign a maximum portion of the dirty limit to a particular device. This is useful in situations where you want to avoid one device taking all or most of the write-back cache. Eg. an NFS mount that is prone to get stuck, or a FUSE mount which you don't trust to play fair. Add "min_ratio" to /sys/class/bdi. This indicates the minimum percentage of the global dirty threshold allocated to this bdi. [mszeredi@suse.cz] - fix parsing in min_ratio_store() - document new sysfs attribute Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30mm: bdi: export BDI attributes in sysfsPeter Zijlstra
Provide a place in sysfs (/sys/class/bdi) for the backing_dev_info object. This allows us to see and set the various BDI specific variables. In particular this properly exposes the read-ahead window for all relevant users and /sys/block/<block>/queue/read_ahead_kb should be deprecated. With patient help from Kay Sievers and Greg KH [mszeredi@suse.cz] - split off NFS and FUSE changes into separate patches - document new sysfs attributes under Documentation/ABI - do bdi_class_init as a core_initcall, otherwise the "default" BDI won't be initialized - remove bdi_init_fmt macro, it's not used very much [akpm@linux-foundation.org: fix ia64 warning] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Kay Sievers <kay.sievers@vrfy.org> Acked-by: Greg KH <greg@kroah.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>