Age | Commit message (Collapse) | Author |
|
After do_wp_page has tested page_mkwrite, it must release old_page after
acquiring page table lock, not before: at some stage that ordering got
reversed, leaving a (very unlikely) window in which old_page might be
truncated, freed, and reused in the same position.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This early break prevents us from displaying info for the vm stats thresholds
if the zone doesn't have any pages in its per-cpu pagesets.
So my 800MB i386 box says:
Node 0, zone DMA
pages free 2365
min 16
low 20
high 24
active 0
inactive 0
scanned 0 (a: 0 i: 0)
spanned 4096
present 4044
nr_anon_pages 0
nr_mapped 1
nr_file_pages 0
nr_slab_reclaimable 0
nr_slab_unreclaimable 0
nr_page_table_pages 0
nr_dirty 0
nr_writeback 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
protection: (0, 868, 868)
pagesets
all_unreclaimable: 0
prev_priority: 12
start_pfn: 0
Node 0, zone Normal
pages free 199713
min 934
low 1167
high 1401
active 10215
inactive 4507
scanned 0 (a: 0 i: 0)
spanned 225280
present 222420
nr_anon_pages 2685
nr_mapped 1110
nr_file_pages 12055
nr_slab_reclaimable 2216
nr_slab_unreclaimable 1527
nr_page_table_pages 213
nr_dirty 0
nr_writeback 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
protection: (0, 0, 0)
pagesets
cpu: 0 pcp: 0
count: 152
high: 186
batch: 31
cpu: 0 pcp: 1
count: 13
high: 62
batch: 15
vm stats threshold: 16
cpu: 1 pcp: 0
count: 34
high: 186
batch: 31
cpu: 1 pcp: 1
count: 10
high: 62
batch: 15
vm stats threshold: 16
all_unreclaimable: 0
prev_priority: 12
start_pfn: 4096
Just nuke all that search-for-the-first-non-empty-pageset code. Dunno why it
was there in the first place..
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
find_min_pfn_for_node() and find_min_pfn_with_active_regions() sort
early_node_map[] on every call. This is an excessive amount of sorting and
that can be avoided. This patch always searches the whole early_node_map[]
in find_min_pfn_for_node() instead of returning the first value found. The
map is then only sorted once when required. Successfully boot tested on a
number of machines.
[akpm@osdl.org: cleanup]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
work structure
Use the pointer passed to cache_reap to determine the work pointer and
consolidate exit paths.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Clean up __cache_alloc and __cache_alloc_node functions a bit. We no
longer need to do NUMA_BUILD tricks and the UMA allocation path is much
simpler. No functional changes in this patch.
Note: saves few kernel text bytes on x86 NUMA build due to using gotos in
__cache_alloc_node() and moving __GFP_THISNODE check in to
fallback_alloc().
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Christoph Lameter <christoph@lameter.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The PageSlab debug check in kfree_debugcheck() is broken for compound
pages. It is also redundant as we already do BUG_ON for non-slab pages in
page_get_cache() and page_get_slab() which are always called before we free
any actual objects.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch adds a utility function install_special_mapping, for creating a
special vma using a fixed set of preallocated pages as backing, such as for a
vDSO. This consolidates some nearly identical code used for vDSO mapping
reimplemented for different architectures.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Also split that long line up - people like to send us wordwrapped oom-kill
traces.
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
__unmap_hugepage_range() is buggy that it does not preserve dirty state of
huge_pte when unmapping hugepage range. It causes data corruption in the
event of dop_caches being used by sys admin. For example, an application
creates a hugetlb file, modify pages, then unmap it. While leaving the
hugetlb file alive, comes along sys admin doing a "echo 3 >
/proc/sys/vm/drop_caches".
drop_pagecache_sb() will happily free all pages that aren't marked dirty if
there are no active mapping. Later when application remaps the hugetlb
file back and all data are gone, triggering catastrophic flip over on
application.
Not only that, the internal resv_huge_pages count will also get all messed
up. Fix it up by marking page dirty appropriately.
Signed-off-by: Ken Chen <kenchen@google.com>
Cc: "Nish Aravamudan" <nish.aravamudan@gmail.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: <stable@kernel.org>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Remove find_trylock_page as per the removal schedule.
Signed-off-by: Nick Piggin <npiggin@suse.de>
[ Let's see if anybody screams ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This reverts commit e80ee884ae0e3794ef2b65a18a767d502ad712ee.
Pawel Sikora had a boot-time oops due to it - because the sign change
invalidates the following comparisons, since 'free_pages' can be
negative.
The micro-optimization just isn't worth it.
Bisected-by: Pawel Sikora <pluto@agmk.net>
Acked-by: Andrew Morton <akpm@osdl.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When expanding the stack, we don't currently check if the VMA will cross
into an area of the address space that is reserved for hugetlb pages.
Subsequent faults on the expanded portion of such a VMA will confuse the
low-level MMU code, resulting in an OOPS. Check for this.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Nick Piggin points out that page accounting on MIPS multiple ZERO_PAGEs
is not maintained by its move_pte, and could lead to freeing a ZERO_PAGE.
Instead of complicating that move_pte, just forget the minor optimization
when mremapping, and change the one thing which needed it for correctness
- filemap_xip use ZERO_PAGE(0) throughout instead of according to address.
[ "There is no block device driver one could use for XIP on mips
platforms" - Carsten Otte ]
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This makes balance_dirty_page() always base its calculations on the
amount of non-highmem memory in the machine, rather than try to base it
on total memory and then falling back on non-highmem memory if the
mapping it was writing wasn't highmem capable.
This not only fixes a situation where two different writers can have
wildly different notions about what is a "balanced" dirty state, but it
also means that people with highmem machines don't run into an OOM
situation when regular memory fills up with dirty pages.
We used to try to handle the latter case by scaling down the dirty_ratio
if the machine had a lot of highmem pages in page_writeback_init(), but
it wasn't aggressive enough for some situations, and since basing the
dirty ratio on highmem memory was broken in the first place, let's just
stop doing so.
(A variation of this theme fixed Justin Piszcz's OOM problem when
copying an 18GB file on a RAID setup).
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Justin Piszcz <jpiszcz@lucidpixels.com>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
NFS can handle the case where invalidate_inode_pages2_range() fails, so the
premise behind commit 8258d4a574d3a8c01f0ef68aa26b969398a0e140 is now gone.
Remove the WARN_ON_ONCE() which is causing users grief as we can see from
http://bugzilla.kernel.org/show_bug.cgi?id=7826
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch fixes core dumps to include the vDSO vma, which is left out now.
It removes the special-case core writing macros, which were not doing the
right thing for the vDSO vma anyway. Instead, it uses VM_ALWAYSDUMP in the
vma; there is no need for the fixmap page to be installed. It handles the
CONFIG_COMPAT_VDSO case by making elf_core_dump use the fake vma from
get_gate_vma after real vmas in the same way the /proc/PID/maps code does.
This changes core dumps so they no longer include the non-PT_LOAD phdrs from
the vDSO. I made the change to add them in the first place, but in turned out
that nothing ever wanted them there since the advent of NT_AUXV. It's cleaner
to leave them out, and just let the phdrs inside the vDSO image speak for
themselves.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch fixes the initialization of gate_vma.vm_flags and
gate_vma.vm_page_prot to reflect reality. This makes the "[vdso]" line in
/proc/PID/maps correctly show r-xp instead of ---p, when gate_vma is used
(CONFIG_COMPAT_VDSO on i386).
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
It's not pretty, but it appears that ext3 with data=journal will clean
pages without ever actually telling the VM that they are clean. This,
in turn, will result in the VM (and balance_dirty_pages() in particular)
to never realize that the pages got cleaned, and wait forever for an
event that already happened.
Technically, this seems to be a problem with ext3 itself, but it used to
be hidden by 'try_to_free_buffers()' noticing this situation on its own,
and just working around the filesystem problem.
This commit re-instates that hack, in order to avoid a regression for
the 2.6.20 release. This fixes bugzilla 7844:
http://bugzilla.kernel.org/show_bug.cgi?id=7844
Peter Zijlstra points out that we should probably retain the debugging
code that this removes from cancel_dirty_page(), and I agree, but for
the imminent release we might as well just silence the warning too
(since it's not a new bug: anything that triggers that warning has been
around forever).
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently one can specify an arbitrary node mask to mbind that includes
nodes not allowed. If that is done with an interleave policy then we will
go around all the nodes. Those outside of the currently allowed cpuset
will be redirected to the border nodes. Interleave will then create
imbalances at the borders of the cpuset.
This patch restricts the nodes to the currently allowed cpuset.
The RFC for this patch was discussed at
http://marc.theaimsgroup.com/?t=116793842100004&r=1&w=2
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently we issue a bounce trace when __blk_queue_bounce() is called,
but that merely means that the device has a lower dma mask than the
higher pages in the system. The bio itself may still be lower pages. So
move the bounce trace into __blk_queue_bounce(), when we know there will
actually be page bouncing.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
NFS: Fix race in nfs_release_page()
invalidate_inode_pages2() may find the dirty bit has been set on a page
owing to the fact that the page may still be mapped after it was locked.
Only after the call to unmap_mapping_range() are we sure that the page
can no longer be dirtied.
In order to fix this, NFS has hooked the releasepage() method and tries
to write the page out between the call to unmap_mapping_range() and the
call to remove_mapping(). This, however leads to deadlocks in the page
reclaim code, where the page may be locked without holding a reference
to the inode or dentry.
Fix is to add a new address_space_operation, launder_page(), which will
attempt to write out a dirty page without releasing the page lock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Also, the bare SetPageDirty() can skew all sort of accounting leading to
other nasties.
[akpm@osdl.org: cleanup]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Fix an oops experienced on the Cell architecture when init-time functions,
early_*(), are called at runtime. It alters the call paths to make sure
that the callers explicitly say whether the call is being made on behalf of
a hotplug even, or happening at boot-time.
It has been compile tested on ppc64, ia64, s390, i386 and x86_64.
Acked-by: Arnd Bergmann <arndb@de.ibm.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
* master.kernel.org:/home/rmk/linux-2.6-arm:
[ARM] Provide basic printk_clock() implementation
[ARM] Resolve fuse and direct-IO failures due to missing cache flushes
[ARM] pass vma for flush_anon_page()
[ARM] Fix potential MMCI bug
[ARM] Fix kernel-mode undefined instruction aborts
[ARM] 4082/1: iop3xx: fix iop33x gpio register offset
[ARM] 4070/1: arch/arm/kernel: fix warnings from missing includes
[ARM] 4079/1: iop: Update MAINTAINERS
|
|
Since get_user_pages() may be used with processes other than the
current process and calls flush_anon_page(), flush_anon_page() has to
cope in some way with non-current processes.
It may not be appropriate, or even desirable to flush a region of
virtual memory cache in the current process when that is different to
the process that we want the flush to occur for.
Therefore, pass the vma into flush_anon_page() so that the architecture
can work out whether the 'vmaddr' is for the current process or not.
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
|
|
At the end of shrink_all_memory() we forget to recalculate lru_pages: it can
be zero.
Fix that up, and add a helper function for this operation too.
Also, recalculate lru_pages each time around the inner loop to get the
balancing correct.
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
These days, if you swapoff when there isn't enough memory, OOM killer gives
"BUG: scheduling while atomic" and the machine hangs: badness() needs to do
its PF_SWAPOFF return after the task_unlock (tasklist_lock is also held
here, so p isn't going to be freed: PF_SWAPOFF might get turned off at any
moment, but that doesn't really matter).
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Both process_zones() and drain_node_pages() check for populated zones
before touching pagesets. However, __drain_pages does not do so,
This may result in a NULL pointer dereference for pagesets in unpopulated
zones if a NUMA setup is combined with cpu hotplug.
Initially the unpopulated zone has the pcp pointers pointing to the boot
pagesets. Since the zone is not populated the boot pageset pointers will
not be changed during page allocator and slab bootstrap.
If a cpu is later brought down (first call to __drain_pages()) then the pcp
pointers for cpus in unpopulated zones are set to NULL since __drain_pages
does not first check for an unpopulated zone.
If the cpu is then brought up again then we call process_zones() which will
ignore the unpopulated zone. So the pageset pointers will still be NULL.
If the cpu is then again brought down then __drain_pages will attempt to
drain pages by following the NULL pageset pointer for unpopulated zones.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
pdflush hit the BUG_ON(!PageSlab(page)) in kmem_freepages called from
fallback_alloc: cache_grow already freed those pages when alloc_slabmgmt
failed. But it wouldn't have freed them if __GFP_NO_GROW, so make sure
fallback_alloc doesn't waste its time on that case.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Pekka J Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
At the moment the inode/dentry cache hash tables (common by way of
alloc_large_system_hash()) are incorrectly sized by their respective
detection logic when we attempt to use large base pages on systems with
little memory.
This results in odd behaviour when using a 64kB PAGE_SIZE, such as:
Dentry cache hash table entries: 8192 (order: -1, 32768 bytes)
Inode-cache hash table entries: 4096 (order: -2, 16384 bytes)
The mount cache hash table is seemingly the only one that gets this right
by directly taking PAGE_SIZE in to account.
The following patch attempts to catch the bogus values and round it up to
at least 0-order.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
In the kernels later than 2.6.19 there is a regression that makes swsusp
fail if the resume device is not explicitly specified.
It can be fixed by adding an additional parameter to
mm/swapfile.c:swap_type_of() allowing us to pass the (struct block_device
*) corresponding to the first available swap back to the caller.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Fix a rather obvious buglet. Noticed while instrumenting the VM using
/proc/vmstat.
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
(akpm: macros are wonderful)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Recent cleanup of slab.h broke SLOB allocator: the routine kmem_cache_init
has now the __init attribute for both slab.c and slob.c. This routine
cannot be removed after init in the case of slob.c -- it serves as a timer
callback.
Provide a separate timer callback routine, call it once from kmem_cache_init,
keep the __init attribute on the latter.
Signed-off-by: Dimitri Gorokhovik <dimitri.gorokhovik@free.fr>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
take2
constrained_alloc(), which is called to detect where oom is from, checks
passed zone_list(). If zone_list doesn't include all nodes, it thinks oom
is from mempolicy.
But there is memory-less-node. memory-less-node's zones are never included
in zonelist[].
contstrained_alloc() should get memory_less_node into count. Otherwise, it
always thinks 'oom is from mempolicy'. This means that current process
dies at any time. This patch fix it.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The VM layer (on the face of it, fairly reasonably) expected that when
it does a ->writepage() call to the filesystem, it would write out the
full page at that point in time. Especially since it had earlier marked
the whole page dirty with "set_page_dirty()".
But that isn't actually the case: ->writepage() does not actually write
a page, it writes the parts of the page that have been explicitly marked
dirty before, *and* that had not got written out for other reasons since
the last time we told it they were dirty.
That last caveat is the important one.
Which _most_ of the time ends up being the whole page (since we had
called "set_page_dirty()" on the page earlier), but if the filesystem
had done any dirty flushing of its own (for example, to honor some
internal write ordering guarantees), it might end up doing only a
partial page IO (or none at all) when ->writepage() is actually called.
That is the correct thing in general (since we actually often _want_
only the known-dirty parts of the page to be written out), but the
shared dirty page handling had implicitly forgotten about these details,
and had a number of cases where it was doing just the "->writepage()"
part, without telling the low-level filesystem that the whole page might
have been re-dirtied as part of being mapped writably into user space.
Since most of the time the FS did actually write out the full page, we
didn't notice this for a loong time, and this needed some really odd
patterns to trigger. But it caused occasional corruption with rtorrent
and with the Debian "apt" database, because both use shared mmaps to
update the end result.
This fixes it. Finally. After way too much hair-pulling.
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Martin J. Bligh <mbligh@google.com>
Acked-by: Martin Michlmayr <tbm@cyrius.com>
Acked-by: Martin Johansson <martin@fatbob.nu>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Andrei Popa <andrei.popa@i-neo.ro>
Cc: High Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@osdl.org>,
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Segher Boessenkool <segher@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Gordon Farquharson <gordonfarquharson@gmail.com>
Cc: Guillaume Chazarain <guichaz@yahoo.fr>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Kenneth Cheng <kenneth.w.chen@intel.com>
Cc: Tobias Diedrich <ranma@tdiedrich.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Make cancel_dirty_page() act more like all the other dirty and writeback
accounting functions: test for "mapping" being NULL, and do the
NR_FILE_DIRY accounting purely based on mapping_cap_account_dirty()).
Also, add it to the exports, so that modular filesystems can use it.
Acked-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
- add flush_cache_page() for all those virtual indexed cache
architectures.
- handle s390.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add more debugging in the rmap code in an attempt to locate to source of
the occasional "mapcount went negative" assertions.
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The version of mm/vmscan.c in Linus' current tree has swapped parameters in
the shrink_all_zones declaration and call, used by the various
suspend-to-disk implementations. This doesn't seem to have any great
adverse effect, but it's clearly wrong.
Signed-off-by: Nigel Cunningham <nigel@suspend2.net>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Fix kernel-doc warnings in 2.6.20-rc1.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The declaration of kmem_ptr_validate in slab.h does not match the
one in slab.c. Remove the fastcall attribute (this is the only use in
slab.c).
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Ran into BUG() while doing madvise(REMOVE) testing. If we are punching a
hole into shared memory segment using madvise(REMOVE) and the entire hole
is below the indirect blocks, we hit following assert.
BUG_ON(limit <= SHMEM_NR_DIRECT);
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Only (un)account for IO and page-dirtying for devices which have real backing
store (ie: not tmpfs or ramdisks).
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
truncate presently invalidates the dirty page's buffer_heads then shoots down
the page. But try_to_free_buffers() will now bale out because the page is
dirty.
Net effect: the LRU gets filled with dirty pages which have invalidated
buffer_heads attached. They have no ->mapping and hence cannot be cleaned.
The machine leaks memory at an enormous rate.
Fix this by cleaning the page before running try_to_free_buffers(), so
try_to_free_buffers() can do its work.
Also, remember to do dirty-page-acoounting in cancel_dirty_page() so the
machine won't wedge up trying to write non-existent dirty pages.
Probably still wrong, but now less so.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
They were horribly easy to mis-use because of their tempting naming, and
they also did way more than any users of them generally wanted them to
do.
A dirty page can become clean under two circumstances:
(a) when we write it out. We have "clear_page_dirty_for_io()" for
this, and that function remains unchanged.
In the "for IO" case it is not sufficient to just clear the dirty
bit, you also have to mark the page as being under writeback etc.
(b) when we actually remove a page due to it becoming inaccessible to
users, notably because it was truncate()'d away or the file (or
metadata) no longer exists, and we thus want to cancel any
outstanding dirty state.
For the (b) case, we now introduce "cancel_dirty_page()", which only
touches the page state itself, and verifies that the page is not mapped
(since cancelling writes on a mapped page would be actively wrong as it
is still accessible to users).
Some filesystems need to be fixed up for this: CIFS, FUSE, JFS,
ReiserFS, XFS all use the old confusing functions, and will be fixed
separately in subsequent commits (with some of them just removing the
offending logic, and others using clear_page_dirty_for_io()).
This was confirmed by Martin Michlmayr to fix the apt database
corruption on ARM.
Cc: Martin Michlmayr <tbm@cyrius.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Andrei Popa <andrei.popa@i-neo.ro>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Gordon Farquharson <gordonfarquharson@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
fix a typo, sys_mincore() needs min().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus "I'm a moron" Torvalds <torvalds@osdl.org>
|
|
Hugh Dickins correctly points out that mincore() is actually _supposed_
to fail on an unmapped hole in the user address space, rather than
return valid ("empty") information about the hole. This just simplifies
the problem further (I had been misled by our previous confusing and
complicated way of doing mincore()).
Also, in the unlikely situation that we can't allocate a temporary
kernel buffer, we should actually return EAGAIN, not ENOMEM, to keep the
"unmapped hole" and "allocation failure" error cases separate.
Finally, add a comment about our stupid historical lack of support for
anonymous mappings. I'll fix that if somebody reminds me after 2.6.20
is out.
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Doug Chapman noticed that mincore() will doa "copy_to_user()" of the
result while holding the mmap semaphore for reading, which is a big
no-no. While a recursive read-lock on a semaphore in the case of a page
fault happens to work, we don't actually allow them due to deadlock
schenarios with writers due to fairness issues.
Doug and Marcel sent in a patch to fix it, but I decided to just rewrite
the mess instead - not just fixing the locking problem, but making the
code smaller and (imho) much easier to understand.
Cc: Doug Chapman <dchapman@redhat.com>
Cc: Marcel Holtmann <holtmann@redhat.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
To allow a more effective copy_user_highpage() on certain architectures,
a vma argument is added to the function and cow_user_page() allowing
the implementation of these functions to check for the VM_EXEC bit.
The main part of this patch was originally written by Ralf Baechle;
Atushi Nemoto did the the debugging.
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
When some objects are allocated by one CPU but freed by another CPU we can
consume lot of cycles doing divides in obj_to_index().
(Typical load on a dual processor machine where network interrupts are
handled by one particular CPU (allocating skbufs), and the other CPU is
running the application (consuming and freeing skbufs))
Here on one production server (dual-core AMD Opteron 285), I noticed this
divide took 1.20 % of CPU_CLK_UNHALTED events in kernel. But Opteron are
quite modern cpus and the divide is much more expensive on oldest
architectures :
On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1
cycle for a multiply.
Doing some math, we can use a reciprocal multiplication instead of a divide.
If we want to compute V = (A / B) (A and B being u32 quantities)
we can instead use :
V = ((u64)A * RECIPROCAL(B)) >> 32 ;
where RECIPROCAL(B) is precalculated to ((1LL << 32) + (B - 1)) / B
Note :
I wrote pure C code for clarity. gcc output for i386 is not optimal but
acceptable :
mull 0x14(%ebx)
mov %edx,%eax // part of the >> 32
xor %edx,%edx // useless
mov %eax,(%esp) // could be avoided
mov %edx,0x4(%esp) // useless
mov (%esp),%ebx
[akpm@osdl.org: small cleanups]
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|