aboutsummaryrefslogtreecommitdiff
path: root/mm/memory.c
AgeCommit message (Collapse)Author
2007-02-11[PATCH] Numerous fixes to kernel-doc info in source files.Robert P. J. Day
A variety of (mostly) innocuous fixes to the embedded kernel-doc content in source files, including: * make multi-line initial descriptions single line * denote some function names, constants and structs as such * change erroneous opening '/*' to '/**' in a few places * reword some text for clarity Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Cc: "Randy.Dunlap" <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] do not disturb page referenced state when unmapping memory rangeKen Chen
When kernel unmaps an address range, it needs to transfer PTE state into page struct. Currently, kernel transfer access bit via mark_page_accessed(). The call to mark_page_accessed in the unmap path doesn't look logically correct. At unmap time, calling mark_page_accessed will causes page LRU state to be bumped up one step closer to more recently used state. It is causing quite a bit headache in a scenario when a process creates a shmem segment, touch a whole bunch of pages, then unmaps it. The unmapping takes a long time because mark_page_accessed() will start moving pages from inactive to active list. I'm not too much concerned with moving the page from one list to another in LRU. Sooner or later it might be moved because of multiple mappings from various processes. But it just doesn't look logical that when user asks a range to be unmapped, it's his intention that the process is no longer interested in these pages. Moving those pages to active list (or bumping up a state towards more active) seems to be an over reaction. It also prolongs unmapping latency which is the core issue I'm trying to solve. As suggested by Peter, we should still preserve the info on pte young pages, but not more. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ken Chen <kenchen@google.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] page_mkwrite caller race fixHugh Dickins
After do_wp_page has tested page_mkwrite, it must release old_page after acquiring page table lock, not before: at some stage that ordering got reversed, leaving a (very unlikely) window in which old_page might be truncated, freed, and reused in the same position. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26[PATCH] i386 vDSO: use VM_ALWAYSDUMPRoland McGrath
This patch fixes core dumps to include the vDSO vma, which is left out now. It removes the special-case core writing macros, which were not doing the right thing for the vDSO vma anyway. Instead, it uses VM_ALWAYSDUMP in the vma; there is no need for the fixmap page to be installed. It handles the CONFIG_COMPAT_VDSO case by making elf_core_dump use the fake vma from get_gate_vma after real vmas in the same way the /proc/PID/maps code does. This changes core dumps so they no longer include the non-PT_LOAD phdrs from the vDSO. I made the change to add them in the first place, but in turned out that nothing ever wanted them there since the advent of NT_AUXV. It's cleaner to leave them out, and just let the phdrs inside the vDSO image speak for themselves. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26[PATCH] Fix gate_vma.vm_flagsRoland McGrath
This patch fixes the initialization of gate_vma.vm_flags and gate_vma.vm_page_prot to reflect reality. This makes the "[vdso]" line in /proc/PID/maps correctly show r-xp instead of ---p, when gate_vma is used (CONFIG_COMPAT_VDSO on i386). Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-08[ARM] pass vma for flush_anon_page()Russell King
Since get_user_pages() may be used with processes other than the current process and calls flush_anon_page(), flush_anon_page() has to cope in some way with non-current processes. It may not be appropriate, or even desirable to flush a region of virtual memory cache in the current process when that is different to the process that we want the flush to occur for. Therefore, pass the vma into flush_anon_page() so that the architecture can work out whether the 'vmaddr' is for the current process or not. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-12-22[PATCH] mm: more rmap debuggingNick Piggin
Add more debugging in the rmap code in an attempt to locate to source of the occasional "mapcount went negative" assertions. Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13[PATCH] Pass vma argument to copy_user_highpage().Atsushi Nemoto
To allow a more effective copy_user_highpage() on certain architectures, a vma argument is added to the function and cow_user_page() allowing the implementation of these functions to check for the VM_EXEC bit. The main part of this patch was originally written by Ralf Baechle; Atushi Nemoto did the the debugging. Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp> Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10[PATCH] read_zero_pagealigned() locking fixHugh Dickins
Ramiro Voicu hits the BUG_ON(!pte_none(*pte)) in zeromap_pte_range: kernel bugzilla 7645. Right: read_zero_pagealigned uses down_read of mmap_sem, but another thread's racing read of /dev/zero, or a normal fault, can easily set that pte again, in between zap_page_range and zeromap_page_range getting there. It's been wrong ever since 2.4.3. The simple fix is to use down_write instead, but that would serialize reads of /dev/zero more than at present: perhaps some app would be badly affected. So instead let zeromap_page_range return the error instead of BUG_ON, and read_zero_pagealigned break to the slower clear_user loop in that case - there's no need to optimize for it. Use -EEXIST for when a pte is found: BUG_ON in mmap_zero (the other user of zeromap_page_range), though it really isn't interesting there. And since mmap_zero wants -EAGAIN for out-of-memory, the zeromaps better return that than -ENOMEM. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Ramiro Voicu: <Ramiro.Voicu@cern.ch> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07[PATCH] remove EXPORT_UNUSED_SYMBOL'ed symbolsAdrian Bunk
In time for 2.6.20, we can get rid of this junk. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07[PATCH] grab swap token reorderedAshwin Chaugule
Make sure the contention for the token happens _before_ any read-in and kicks the swap-token algo only when the VM is under pressure. Signed-off-by: Ashwin Chaugule <ashwin.chaugule@celunite.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-20[PATCH] mm: D-cache aliasing issue in cow_user_pageDmitriy Monakhov
--=-=-= from mm/memory.c: 1434 static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va) 1435 { 1436 /* 1437 * If the source page was a PFN mapping, we don't have 1438 * a "struct page" for it. We do a best-effort copy by 1439 * just copying from the original user address. If that 1440 * fails, we just zero-fill it. Live with it. 1441 */ 1442 if (unlikely(!src)) { 1443 void *kaddr = kmap_atomic(dst, KM_USER0); 1444 void __user *uaddr = (void __user *)(va & PAGE_MASK); 1445 1446 /* 1447 * This really shouldn't fail, because the page is there 1448 * in the page tables. But it might just be unreadable, 1449 * in which case we just give up and fill the result with 1450 * zeroes. 1451 */ 1452 if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) 1453 memset(kaddr, 0, PAGE_SIZE); 1454 kunmap_atomic(kaddr, KM_USER0); #### D-cache have to be flushed here. #### It seems it is just forgotten. 1455 return; 1456 1457 } 1458 copy_user_highpage(dst, src, va); #### Ok here. flush_dcache_page() called from this func if arch need it 1459 } Following is the patch fix this issue: Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-06[PATCH] page fault retry with NOPAGE_REFAULTBenjamin Herrenschmidt
Add a way for a no_page() handler to request a retry of the faulting instruction. It goes back to userland on page faults and just tries again in get_user_pages(). I added a cond_resched() in the loop in that later case. The problem I have with signal and spufs is an actual bug affecting apps and I don't see other ways of fixing it. In addition, we are having issues with infiniband and 64k pages (related to the way the hypervisor deals with some HV cards) that will require us to muck around with the MMU from within the IB driver's no_page() (it's a pSeries specific driver) and return to the caller the same way using NOPAGE_REFAULT. And to add to this, the graphics folks have been following a new approach of memory management that involves transparently swapping objects between video ram and main meory. To do that, they need installing PTEs from a no_page() handler as well and that also requires returning with NOPAGE_REFAULT. (For the later, they are currently using io_remap_pfn_range to install one PTE from no_page() which is a bit racy, we need to add a check for the PTE having already been installed afer taking the lock, but that's ok, they are only at the proof-of-concept stage. I'll send a patch adding a "clean" function to do that, we can use that from spufs too and get rid of the sparsemem hacks we do to create struct page for SPEs. Basically, that provides a generic solution for being able to have no_page() map hardware devices, which is something that I think sound driver folks have been asking for some time too). All of these things depend on having the NOPAGE_REFAULT exit path from no_page() handlers. Signed-off-by: Benjamin Herrenchmidt <benh@kernel.crashing.org> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01[PATCH] paravirt: lazy mmu mode hooks.patchZachary Amsden
Implement lazy MMU update hooks which are SMP safe for both direct and shadow page tables. The idea is that PTE updates and page invalidations while in lazy mode can be batched into a single hypercall. We use this in VMI for shadow page table synchronization, and it is a win. It also can be used by PPC and for direct page tables on Xen. For SMP, the enter / leave must happen under protection of the page table locks for page tables which are being modified. This is because otherwise, you end up with stale state in the batched hypercall, which other CPUs can race ahead of. Doing this under the protection of the locks guarantees the synchronization is correct, and also means that spurious faults which are generated during this window by remote CPUs are properly handled, as the page fault handler must re-check the PTE under protection of the same lock. Signed-off-by: Zachary Amsden <zach@vmware.com> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01[PATCH] paravirt: pte clear not presentZachary Amsden
Change pte_clear_full to a more appropriately named pte_clear_not_present, allowing optimizations when not-present mapping changes need not be reflected in the hardware TLB for protected page table modes. There is also another case that can use it in the fremap code. Signed-off-by: Zachary Amsden <zach@vmware.com> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01[PATCH] paravirt: remove read hazard from cowZachary Amsden
We don't want to read PTEs directly like this after they have been modified, as a lazy MMU implementation of direct page tables may not have written the updated PTE back to memory yet. Signed-off-by: Zachary Amsden <zach@vmware.com> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29[PATCH] mm: fix a race condition under SMC + COWSiddha, Suresh B
Failing context is a multi threaded process context and the failing sequence is as follows. One thread T0 doing self modifying code on page X on processor P0 and another thread T1 doing COW (breaking the COW setup as part of just happened fork() in another thread T2) on the same page X on processor P1. T0 doing SMC can endup modifying the new page Y (allocated by the T1 doing COW on P1) but because of different I/D TLB's, P0 ITLB will not see the new mapping till the flush TLB IPI from P1 is received. During this interval, if T0 executes the code created by SMC it can result in an app error (as ITLB still points to old page X and endup executing the content in page X rather than using the content in page Y). Fix this issue by first clearing the PTE and flushing it, before updating it with new entry. Hugh sayeth: I was a bit sceptical, in the habit of thinking that Self Modifying Code must look such issues itself: but I guess there's nothing it can do to avoid this one. Fair enough, what you're changing it to is pretty much what powerpc and s390 were already doing, and is a more robust way of proceeding, consistent with how ptes are set everywhere else. The ptep_clear_flush is a bit heavy-handed (it's anxious to return the pte that was atomically cleared), but we'd have to wander through lots of arches to get the right minimal behaviour. It'd also be nice to eliminate ptep_establish completely, now only used to define other macros/inlines: it always seemed obfuscation to me, what you've got there now is clearer. Let's put those cleanups on a TODO list. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: "David S. Miller" <davem@davemloft.net> Acked-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27[PATCH] NOMMU: Check that access_process_vm() has a valid targetDavid Howells
Check that access_process_vm() is accessing a valid mapping in the target process. This limits ptrace() accesses and accesses through /proc/<pid>/maps to only those regions actually mapped by a program. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27[PATCH] do_no_pfn()Jes Sorensen
Implement do_no_pfn() for handling mapping of memory without a struct page backing it. This avoids creating fake page table entries for regions which are not backed by real memory. This feature is used by the MSPEC driver and other users, where it is highly undesirable to have a struct page sitting behind the page (for instance if the page is accessed in cached mode via the struct page in parallel to the the driver accessing it uncached, which can result in data corruption on some architectures, such as ia64). This version uses specific NOPFN_{SIGBUS,OOM} return values, rather than expect all negative pfn values would be an error. It also bugs on cow mappings as this would not work with the VM. [akpm@osdl.org: micro-optimise] Signed-off-by: Jes Sorensen <jes@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26[PATCH] Add kerneldocs for some functions in mm/memory.cRolf Eike Beer
These functions are already documented quite well with long comments. Now add kerneldoc style header to make this turn up in everyones favorite doc format. Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: "Randy.Dunlap" <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26[PATCH] mm: fixup do_wp_page()Peter Zijlstra
Wrt. the recent modifications in do_wp_page() Hugh Dickins pointed out: "I now realize it's right to the first order (normal case) and to the second order (ptrace poke), but not to the third order (ptrace poke anon page here to be COWed - perhaps can't occur without intervening mprotects)." This patch restores the old COW behaviour for anonymous pages. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26[PATCH] mm: balance dirty pagesPeter Zijlstra
Now that we can detect writers of shared mappings, throttle them. Avoids OOM by surprise. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26[PATCH] mm: tracking shared dirty pagesPeter Zijlstra
Tracking of dirty pages in shared writeable mmap()s. The idea is simple: write protect clean shared writeable pages, catch the write-fault, make writeable and set dirty. On page write-back clean all the PTE dirty bits and write protect them once again. The implementation is a tad harder, mainly because the default backing_dev_info capabilities were too loosely maintained. Hence it is not enough to test the backing_dev_info for cap_account_dirty. The current heuristic is as follows, a VMA is eligible when: - its shared writeable (vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED) - it is not a 'special' mapping (vm_flags & (VM_PFNMAP|VM_INSERTPAGE)) == 0 - the backing_dev_info is cap_account_dirty mapping_cap_account_dirty(vma->vm_file->f_mapping) - f_op->mmap() didn't change the default page protection Page from remap_pfn_range() are explicitly excluded because their COW semantics are already horrid enough (see vm_normal_page() in do_wp_page()) and because they don't have a backing store anyway. mprotect() is taught about the new behaviour as well. However it overrides the last condition. Cleaning the pages on write-back is done with page_mkclean() a new rmap call. It can be called on any page, but is currently only implemented for mapped pages, if the page is found the be of a VMA that accounts dirty pages it will also wrprotect the PTE. Finally, in fs/buffers.c:try_to_free_buffers(); remove clear_page_dirty() from under ->private_lock. This seems to be safe, since ->private_lock is used to serialize access to the buffers, not the page itself. This is needed because clear_page_dirty() will call into page_mkclean() and would thereby violate locking order. [dhowells@redhat.com: Provide a page_mkclean() implementation for NOMMU] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14[PATCH] per-task-delay-accounting: sync block I/O and swapin delay collectionShailabh Nagar
Unlike earlier iterations of the delay accounting patches, now delays are only collected for the actual I/O waits rather than try and cover the delays seen in I/O submission paths. Account separately for block I/O delays incurred as a result of swapin page faults whose frequency can be affected by the task/process' rss limit. Hence swapin delays can act as feedback for rss limit changes independent of I/O priority changes. Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> Signed-off-by: Balbir Singh <balbir@in.ibm.com> Cc: Jes Sorensen <jes@sgi.com> Cc: Peter Chubb <peterc@gelato.unsw.edu.au> Cc: Erich Focht <efocht@ess.nec.de> Cc: Levent Serinol <lserinol@gmail.com> Cc: Jay Lan <jlan@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14[PATCH] ia64: race flushing icache in COW pathAnil Keshavamurthy
There is a race condition that showed up in a threaded JIT environment. The situation is that a process with a JIT code page forks, so the page is marked read-only, then some threads are created in the child. One of the threads attempts to add a new code block to the JIT page, so a copy-on-write fault is taken, and the kernel allocates a new page, copies the data, installs the new pte, and then calls lazy_mmu_prot_update() to flush caches to make sure that the icache and dcache are in sync. Unfortunately, the other thread runs right after the new pte is installed, but before the caches have been flushed. It tries to execute some old JIT code that was already in this page, but it sees some garbage in the i-cache from the previous users of the new physical page. Fix: we must make the caches consistent before installing the pte. This is an ia64 only fix because lazy_mmu_prot_update() is a no-op on all other architectures. Signed-off-by: Anil Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] mm/memory.c: EXPORT_UNUSED_SYMBOLAdrian Bunk
This patch marks an unused export as EXPORT_UNUSED_SYMBOL. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: annotate mmIngo Molnar
Teach special (recursive) locking code to the lock validator. Has no effect on non-lockdep kernels. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30[PATCH] Light weight event countersChristoph Lameter
The remaining counters in page_state after the zoned VM counter patches have been applied are all just for show in /proc/vmstat. They have no essential function for the VM. We use a simple increment of per cpu variables. In order to avoid the most severe races we disable preempt. Preempt does not prevent the race between an increment and an interrupt handler incrementing the same statistics counter. However, that race is exceedingly rare, we may only loose one increment or so and there is no requirement (at least not in kernel) that the vm event counters have to be accurate. In the non preempt case this results in a simple increment for each counter. For many architectures this will be reduced by the compiler to a single instruction. This single instruction is atomic for i386 and x86_64. And therefore even the rare race condition in an interrupt is avoided for both architectures in most cases. The patchset also adds an off switch for embedded systems that allows a building of linux kernels without these counters. The implementation of these counters is through inline code that hopefully results in only a single instruction increment instruction being emitted (i386, x86_64) or in the increment being hidden though instruction concurrency (EPIC architectures such as ia64 can get that done). Benefits: - VM event counter operations usually reduce to a single inline instruction on i386 and x86_64. - No interrupt disable, only preempt disable for the preempt case. Preempt disable can also be avoided by moving the counter into a spinlock. - Handling is similar to zoned VM counters. - Simple and easily extendable. - Can be omitted to reduce memory use for embedded use. References: RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2 RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2 local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2 V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2 V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2 V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2 Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30[PATCH] zoned vm counters: conversion of nr_pagetables to per zone counterChristoph Lameter
Conversion of nr_page_table_pages to a per zone counter [akpm@osdl.org: bugfix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] add page_mkwrite() vm_operations methodDavid Howells
Add a new VMA operation to notify a filesystem or other driver about the MMU generating a fault because userspace attempted to write to a page mapped through a read-only PTE. This facility permits the filesystem or driver to: (*) Implement storage allocation/reservation on attempted write, and so to deal with problems such as ENOSPC more gracefully (perhaps by generating SIGBUS). (*) Delay making the page writable until the contents have been written to a backing cache. This is useful for NFS/AFS when using FS-Cache/CacheFS. It permits the filesystem to have some guarantee about the state of the cache. (*) Account and limit number of dirty pages. This is one piece of the puzzle needed to make shared writable mapping work safely in FUSE. Needed by cachefs (Or is it cachefiles? Or fscache? <head spins>). At least four other groups have stated an interest in it or a desire to use the functionality it provides: FUSE, OCFS2, NTFS and JFFS2. Also, things like EXT3 really ought to use it to deal with the case of shared-writable mmap encountering ENOSPC before we permit the page to be dirtied. From: Peter Zijlstra <a.p.zijlstra@chello.nl> get_user_pages(.write=1, .force=1) can generate COW hits on read-only shared mappings, this patch traps those as mkpage_write candidates and fails to handle them the old way. Signed-off-by: David Howells <dhowells@redhat.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Joel Becker <Joel.Becker@oracle.com> Cc: Mark Fasheh <mark.fasheh@oracle.com> Cc: Anton Altaparmakov <aia21@cantab.net> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] Swapless page migration: add R/W migration entriesChristoph Lameter
Implement read/write migration ptes We take the upper two swapfiles for the two types of migration ptes and define a series of macros in swapops.h. The VM is modified to handle the migration entries. migration entries can only be encountered when the page they are pointing to is locked. This limits the number of places one has to fix. We also check in copy_pte_range and in mprotect_pte_range() for migration ptes. We check for migration ptes in do_swap_cache and call a function that will then wait on the page lock. This allows us to effectively stop all accesses to apge. Migration entries are created by try_to_unmap if called for migration and removed by local functions in migrate.c From: Hugh Dickins <hugh@veritas.com> Several times while testing swapless page migration (I've no NUMA, just hacking it up to migrate recklessly while running load), I've hit the BUG_ON(!PageLocked(p)) in migration_entry_to_page. This comes from an orphaned migration entry, unrelated to the current correctly locked migration, but hit by remove_anon_migration_ptes as it checks an address in each vma of the anon_vma list. Such an orphan may be left behind if an earlier migration raced with fork: copy_one_pte can duplicate a migration entry from parent to child, after remove_anon_migration_ptes has checked the child vma, but before it has removed it from the parent vma. (If the process were later to fault on this orphaned entry, it would hit the same BUG from migration_entry_wait.) This could be fixed by locking anon_vma in copy_one_pte, but we'd rather not. There's no such problem with file pages, because vma_prio_tree_add adds child vma after parent vma, and the page table locking at each end is enough to serialize. Follow that example with anon_vma: add new vmas to the tail instead of the head. (There's no corresponding problem when inserting migration entries, because a missed pte will leave the page count and mapcount high, which is allowed for. And there's no corresponding problem when migrating via swap, because a leftover swap entry will be correctly faulted. But the swapless method has no refcounting of its entries.) From: Ingo Molnar <mingo@elte.hu> pte_unmap_unlock() takes the pte pointer as an argument. From: Hugh Dickins <hugh@veritas.com> Several times while testing swapless page migration, gcc has tried to exec a pointer instead of a string: smells like COW mappings are not being properly write-protected on fork. The protection in copy_one_pte looks very convincing, until at last you realize that the second arg to make_migration_entry is a boolean "write", and SWP_MIGRATION_READ is 30. Anyway, it's better done like in change_pte_range, using is_write_migration_entry and make_migration_entry_read. From: Hugh Dickins <hugh@veritas.com> Remove unnecessary obfuscation from sys_swapon's range check on swap type, which blew up causing memory corruption once swapless migration made MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> From: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] Page Migration: Make do_swap_page redo the faultChristoph Lameter
It is better to redo the complete fault if do_swap_page() finds that the page is not in PageSwapCache() because the page migration code may have replaced the swap pte already with a pte pointing to valid memory. do_swap_page() may interpret an invalid swap entry without this patch because we do not reload the pte if we are looping back. The page migration code may already have reused the swap entry referenced by our local swp_entry. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] Don't pass boot parameters to argv_init[]OGAWA Hirofumi
The boot cmdline is parsed in parse_early_param() and parse_args(,unknown_bootoption). And __setup() is used in obsolete_checksetup(). start_kernel() -> parse_args() -> unknown_bootoption() -> obsolete_checksetup() If __setup()'s callback (->setup_func()) returns 1 in obsolete_checksetup(), obsolete_checksetup() thinks a parameter was handled. If ->setup_func() returns 0, obsolete_checksetup() tries other ->setup_func(). If all ->setup_func() that matched a parameter returns 0, a parameter is seted to argv_init[]. Then, when runing /sbin/init or init=app, argv_init[] is passed to the app. If the app doesn't ignore those arguments, it will warning and exit. This patch fixes a wrong usage of it, however fixes obvious one only. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivialLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: drivers/char/ftape/lowlevel/fdc-io.c: Correct a comment Kconfig help: MTD_JEDECPROBE already supports Intel Remove ugly debugging stuff do_mounts.c: Minor ROOT_DEV comment cleanup BUG_ON() Conversion in drivers/s390/block/dasd_devmap.c BUG_ON() Conversion in mm/mempool.c BUG_ON() Conversion in mm/memory.c BUG_ON() Conversion in kernel/fork.c BUG_ON() Conversion in ipc/sem.c BUG_ON() Conversion in fs/ext2/ BUG_ON() Conversion in fs/hfs/ BUG_ON() Conversion in fs/dcache.c BUG_ON() Conversion in fs/buffer.c BUG_ON() Conversion in input/serio/hp_sdc_mlc.c BUG_ON() Conversion in md/dm-table.c BUG_ON() Conversion in md/dm-path-selector.c BUG_ON() Conversion in drivers/isdn BUG_ON() Conversion in drivers/char BUG_ON() Conversion in drivers/mtd/
2006-03-26[PATCH] Add API for flushing Anon pagesJames Bottomley
Currently, get_user_pages() returns fully coherent pages to the kernel for anything other than anonymous pages. This is a problem for things like fuse and the SCSI generic ioctl SG_IO which can potentially wish to do DMA to anonymous pages passed in by users. The fix is to add a new memory management API: flush_anon_page() which is used in get_user_pages() to make anonymous pages coherent. Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26BUG_ON() Conversion in mm/memory.cEric Sesterhenn
this changes if() BUG(); constructs to BUG_ON() which is cleaner, contains unlikely() and can better optimized away. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-03-25[PATCH] mm: restore vm_normal_page checkNick Piggin
Hugh is rightly concerned that the CONFIG_DEBUG_VM coverage has gone too far in vm_normal_page, considering that we expect production kernels to be shipped with the option turned off, and that the code has been under some large changes recently. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] hugepage: Fix hugepage logic in free_pgtables() harderDavid Gibson
Turns out the hugepage logic in free_pgtables() was doubly broken. The loop coalescing multiple normal page VMAs into one call to free_pgd_range() had an off by one error, which could mean it would coalesce one hugepage VMA into the same bundle (checking 'vma' not 'next' in the loop). I transferred this bug into the new is_vm_hugetlb_page() based version. Here's the fix. This one didn't bite on powerpc previously for the same reason the is_hugepage_only_range() problem didn't: powerpc's hugetlb_free_pgd_range() is identical to free_pgd_range(). It didn't bite on ia64 because the hugepage region is distant enough from any other region that the separated PMD_SIZE distance test would always prevent coalescing the two together. No libhugetlbfs testsuite regressions (ppc64, POWER5). Signed-off-by: David Gibson <dwg@au1.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] hugepage: Fix hugepage logic in free_pgtables()David Gibson
free_pgtables() has special logic to call hugetlb_free_pgd_range() instead of the normal free_pgd_range() on hugepage VMAs. However, the test it uses to do so is incorrect: it calls is_hugepage_only_range on a hugepage sized range at the start of the vma. is_hugepage_only_range() will return true if the given range has any intersection with a hugepage address region, and in this case the given region need not be hugepage aligned. So, for example, this test can return true if called on, say, a 4k VMA immediately preceding a (nicely aligned) hugepage VMA. At present we get away with this because the powerpc version of hugetlb_free_pgd_range() is just a call to free_pgd_range(). On ia64 (the only other arch with a non-trivial is_hugepage_only_range()) we get away with it for a different reason; the hugepage area is not contiguous with the rest of the user address space, and VMAs are not permitted in between, so the test can't return a false positive there. Nonetheless this should be fixed. We do that in the patch below by replacing the is_hugepage_only_range() test with an explicit test of the VMA using is_vm_hugetlb_page(). This in turn changes behaviour for platforms where is_hugepage_only_range() returns false always (everything except powerpc and ia64). We address this by ensuring that hugetlb_free_pgd_range() is defined to be identical to free_pgd_range() (instead of a no-op) on everything except ia64. Even so, it will prevent some otherwise possible coalescing of calls down to free_pgd_range(). Since this only happens for hugepage VMAs, removing this small optimization seems unlikely to cause any trouble. This patch causes no regressions on the libhugetlbfs testsuite - ppc64 POWER5 (8-way), ppc64 G5 (2-way) and i386 Pentium M (UP). Signed-off-by: David Gibson <dwg@au1.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: more CONFIG_DEBUG_VMNick Piggin
Put a few more checks under CONFIG_DEBUG_VM Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: split highorder pagesNick Piggin
Have an explicit mm call to split higher order pages into individual pages. Should help to avoid bugs and be more explicit about the code's intention. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Russell King <rmk@arm.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Yoichi Yuasa <yoichi_yuasa@tripeaks.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-17[PATCH] fix free swap cache latencyHugh Dickins
Lee Revell reported 28ms latency when process with lots of swapped memory exits. 2.6.15 introduced a latency regression when unmapping: in accounting the zap_work latency breaker, pte_none counted 1, pte_present PAGE_SIZE, but a swap entry counted nothing at all. We think of pages present as the slow case, but Lee's trace shows that free_swap_and_cache's radix tree lookup can make a lot of work - and we could have been doing it many thousands of times without a latency break. Move the zap_work update up to account swap entries like pages present. This does account non-linear pte_file entries, and unmap_mapping_range skipping over swap entries, by the same amount even though they're quick: but neither of those cases deserves complicating the code (and they're treated no worse than they were in 2.6.14). Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-17[PATCH] x86_64: Add boot option to disable randomized mappings and cleanupAndi Kleen
AMD SimNow!'s JIT doesn't like them at all in the guest. For distribution installation it's easiest if it's a boot time option. Also I moved the variable to a more appropiate place and make it independent from sysctl And marked __read_mostly which it is. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] Direct Migration V9: PageSwapCache checksChristoph Lameter
Check for PageSwapCache after looking up and locking a swap page. The page migration code may change a swap pte to point to a different page under lock_page(). If that happens then the vm must retry the lookup operation in the swap space to find the correct page number. There are a couple of locations in the VM where a lock_page() is done on a swap page. In these locations we need to check afterwards if the page was migrated. If the page was migrated then the old page that was looked up before was freed and no longer has the PageSwapCache bit set. Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Christoph Lameter <clameter@@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-09[PATCH] mutex subsystem, semaphore to mutex: VFS, ->i_semJes Sorensen
This patch converts the inode semaphore to a mutex. I have tested it on XFS and compiled as much as one can consider on an ia64. Anyway your luck with it might be different. Modified-by: Ingo Molnar <mingo@elte.hu> (finished the conversion) Signed-off-by: Jes Sorensen <jes@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2006-01-09[PATCH] spufs: The SPU file system, baseArnd Bergmann
This is the current version of the spu file system, used for driving SPEs on the Cell Broadband Engine. This release is almost identical to the version for the 2.6.14 kernel posted earlier, which is available as part of the Cell BE Linux distribution from http://www.bsc.es/projects/deepcomputing/linuxoncell/. The first patch provides all the interfaces for running spu application, but does not have any support for debugging SPU tasks or for scheduling. Both these functionalities are added in the subsequent patches. See Documentation/filesystems/spufs.txt on how to use spufs. Signed-off-by: Arnd Bergmann <arndb@de.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-01-06[PATCH] mm: pfault optimisationNick Piggin
This atomic operation is superfluous: the pte will be added with the referenced bit set, and the page will be referenced through this mapping after the page fault handler returns anyway. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] mm: rmap optimisationNick Piggin
Optimise rmap functions by minimising atomic operations when we know there will be no concurrent modifications. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] madvise(MADV_REMOVE): remove pages from tmpfs shm backing storeBadari Pulavarty
Here is the patch to implement madvise(MADV_REMOVE) - which frees up a given range of pages & its associated backing store. Current implementation supports only shmfs/tmpfs and other filesystems return -ENOSYS. "Some app allocates large tmpfs files, then when some task quits and some client disconnect, some memory can be released. However the only way to release tmpfs-swap is to MADV_REMOVE". - Andrea Arcangeli Databases want to use this feature to drop a section of their bufferpool (shared memory segments) - without writing back to disk/swap space. This feature is also useful for supporting hot-plug memory on UML. Concerns raised by Andrew Morton: - "We have no plan for holepunching! If we _do_ have such a plan (or might in the future) then what would the API look like? I think sys_holepunch(fd, start, len), so we should start out with that." - Using madvise is very weird, because people will ask "why do I need to mmap my file before I can stick a hole in it?" - None of the other madvise operations call into the filesystem in this manner. A broad question is: is this capability an MM operation or a filesytem operation? truncate, for example, is a filesystem operation which sometimes has MM side-effects. madvise is an mm operation and with this patch, it gains FS side-effects, only they're really, really significant ones." Comments: - Andrea suggested the fs operation too but then it's more efficient to have it as a mm operation with fs side effects, because they don't immediatly know fd and physical offset of the range. It's possible to fixup in userland and to use the fs operation but it's more expensive, the vmas are already in the kernel and we can use them. Short term plan & Future Direction: - We seem to need this interface only for shmfs/tmpfs files in the short term. We have to add hooks into the filesystem for correctness and completeness. This is what this patch does. - In the future, plan is to support both fs and mmap apis also. This also involves (other) filesystem specific functions to be implemented. - Current patch doesn't support VM_NONLINEAR - which can be addressed in the future. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Andrea Arcangeli <andrea@suse.de> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-16Make sure we copy pages inserted with "vm_insert_page()" on forkLinus Torvalds
The logic that decides that a fork() might be able to avoid copying a VM area when it can be re-created by page faults didn't know about the new vm_insert_page() case. Also make some things a bit more anal wrt VM_PFNMAP. Pointed out by Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>