aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2007-07-16slob: initial NUMA supportPaul Mundt
This adds preliminary NUMA support to SLOB, primarily aimed at systems with small nodes (tested all the way down to a 128kB SRAM block), whether asymmetric or otherwise. We follow the same conventions as SLAB/SLUB, preferring current node placement for new pages, or with explicit placement, if a node has been specified. Presently on UP NUMA this has the side-effect of preferring node#0 allocations (since numa_node_id() == 0, though this could be reworked if we could hand off a pfn to determine node placement), so single-CPU NUMA systems will want to place smaller nodes further out in terms of node id. Once a page has been bound to a node (via explicit node id typing), we only do block allocations from partial free pages that have a matching node id in the page flags. The current implementation does have some scalability problems, in that all partial free pages are tracked in the global freelist (with contention due to the single spinlock). However, these are things that are being reworked for SMP scalability first, while things like per-node freelists can easily be built on top of this sort of functionality once it's been added. More background can be found in: http://marc.info/?l=linux-mm&m=118117916022379&w=2 http://marc.info/?l=linux-mm&m=118170446306199&w=2 http://marc.info/?l=linux-mm&m=118187859420048&w=2 and subsequent threads. Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Paul Mundt <lethal@linux-sh.org> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16speed up madvise_need_mmap_write() usageJason Baron
In the new madvise_need_mmap_write() call we can avoid an extra case statement and function call as follows. Signed-off-by: Jason Baron <jbaron@redhat.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16mm/slab.c: start_cpu_timer() should be __cpuinitAdrian Bunk
start_cpu_timer() should be __cpuinit (which also matches what it's callers are). __devinit didn't cause problems, it simply wasted a few bytes of memory for the common CONFIG_HOTPLUG_CPU=n case. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16mm: more __meminit annotationsPaul Mundt
Currently zone_spanned_pages_in_node() and zone_absent_pages_in_node() are non-static for ARCH_POPULATES_NODE_MAP and static otherwise. However, only the non-static versions are __meminit annotated, despite only being called from __meminit functions in either case. zone_init_free_lists() is currently non-static and not __meminit annotated either, despite only being called once in the entire tree by init_currently_empty_zone(), which too is __meminit. So make it static and properly annotated. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16kill vmalloc_earlyreserveJan Beulich
This symbol got orphaned quite a while ago. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16mm: fix improper .init-type section referencesJan Beulich
.. which modpost started warning about. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16numa: mempolicy: trivial debug fixes.Paul Mundt
Enabling debugging fails to build due to the nodemask variable in do_mbind() having changed names, and then oopses on boot due to the assumption that the nodemask can be dereferenced -- which doesn't work out so well when the policy is changed to MPOL_DEFAULT with a NULL nodemask by numa_default_policy(). This fixes it up, and switches from PDprintk() to pr_debug() while we're at it. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16oom: stop allocating user memory if TIF_MEMDIE is setEthan Solomita
get_user_pages() can try to allocate a nearly unlimited amount of memory on behalf of a user process, even if that process has been OOM killed. The OOM kill occurs upon return to user space via a SIGKILL, but get_user_pages() will try allocate all its memory before returning. Change get_user_pages() to check for TIF_MEMDIE, and if set then return immediately. Signed-off-by: Ethan Solomita <solo@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16numa: mempolicy: dynamic interleave map for system initPaul Mundt
This converts the default system init memory policy to use a dynamically created node map instead of defaulting to all online nodes. Nodes of a certain size (>= 16MB) are judged to be suitable for interleave, and are added to the map. If all nodes are smaller in size, the largest one is automatically selected. Without this, tiny nodes find themselves out of memory before we even make it to userspace. Systems with large nodes will notice no change. Only the system init policy is effected by this change, the regular MPOL_DEFAULT policy is still switched to later on in the boot process as normal. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Cc: Andi Kleen <ak@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16SLUB: support slub_debug on by defaultChristoph Lameter
Add a new configuration variable CONFIG_SLUB_DEBUG_ON If set then the kernel will be booted by default with slab debugging switched on. Similar to CONFIG_SLAB_DEBUG. By default slab debugging is available but must be enabled by specifying "slub_debug" as a kernel parameter. Also add support to switch off slab debugging for a kernel that was built with CONFIG_SLUB_DEBUG_ON. This works by specifying slub_debug=- as a kernel parameter. Dave Jones wanted this feature. http://marc.info/?l=linux-kernel&m=118072189913045&w=2 [akpm@linux-foundation.org: clean up switch statement] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16invalidate_mapping_pages(): add cond_reschedAndrew Morton
invalidate_mapping_pages() can sometimes take a long time (millions of pages to free). Long enough for the softlockup detector to trigger. We used to have a cond_resched() in there but I took it out because the drop_caches code calls invalidate_mapping_pages() under inode_lock. The patch adds a nasty flag and puts the cond_resched() back. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16mm: debug check for the fault vs invalidate raceNick Piggin
Add a bugcheck for Andrea's pagefault vs invalidate race. This is triggerable for both linear and nonlinear pages with a userspace test harness (using direct IO and truncate, respectively). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16hugetlb: fix race in alloc_fresh_huge_page()Joe Jin
That static `nid' index needs locking. Without it we can end up calling alloc_pages_node() with an illegal node ID and the kernel crashes. Acked-by: gurudas pai <gurudas.pai@oracle.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16vmscan: fix comments related to shrink_list()Anderson Briglia
Fix the shrink_list name on some files under mm/ directory. Signed-off-by: Anderson Briglia <anderson.briglia@indt.org.br> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16slob: improved alignment handlingNick Piggin
Remove the core slob allocator's minimum alignment restrictions, and instead introduce the alignment restrictions at the slab API layer. This lets us heed the ARCH_KMALLOC/SLAB_MINALIGN directives, and also use __alignof__ (unsigned long) for the default alignment (which should allow relaxed alignment architectures to take better advantage of SLOB's small minimum alignment). Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16slob: remove bigblock trackingNick Piggin
Remove the bigblock lists in favour of using compound pages and going directly to the page allocator. Allocation size is stored in page->private, which also makes ksize more accurate than it previously was. Saves ~.5K of code, and 12-24 bytes overhead per >= PAGE_SIZE allocation. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16slob: rework freelist handlingNick Piggin
Improve slob by turning the freelist into a list of pages using struct page fields, then each page has a singly linked freelist of slob blocks via a pointer in the struct page. - The first benefit is that the slob freelists can be indexed by a smaller type (2 bytes, if the PAGE_SIZE is reasonable). - Next is that freeing is much quicker because it does not have to traverse the entire freelist. Allocation can be slightly faster too, because we can skip almost-full freelist pages completely. - Slob pages are then freed immediately when they become empty, rather than having a periodic timer try to free them. This gives efficiency and memory consumption improvement. Then, we don't encode seperate size and next fields into each slob block, rather we use the sign bit to distinguish between "size" or "next". Then size 1 blocks contain a "next" offset, and others contain the "size" in the first unit and "next" in the second unit. - This allows minimum slob allocation alignment to go from 8 bytes to 2 bytes on 32-bit and 12 bytes to 2 bytes on 64-bit. In practice, it is best to align them to word size, however some architectures (eg. cris) could gain space savings from turning off this extra alignment. Then, make kmalloc use its own slob_block at the front of the allocation in order to encode allocation size, rather than rely on not overwriting slob's existing header block. - This reduces kmalloc allocation overhead similarly to alignment reductions. - Decouples kmalloc layer from the slob allocator. Then, add a page flag specific to slob pages. - This means kfree of a page aligned slob block doesn't have to traverse the bigblock list. I would get benchmarks, but my test box's network doesn't come up with slob before this patch. I think something is timing out. Anyway, things are faster after the patch. Code size goes up about 1K, however dynamic memory usage _should_ be lower even on relatively small memory systems. Future todo item is to restore the cyclic free list search, rather than to always begin at the start. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16MM: alloc_large_system_hash() can free some memory for non power-of-two ↵Eric Dumazet
bucketsize alloc_large_system_hash() is called at boot time to allocate space for several large hash tables. Lately, TCP hash table was changed and its bucketsize is not a power-of-two anymore. On most setups, alloc_large_system_hash() allocates one big page (order > 0) with __get_free_pages(GFP_ATOMIC, order). This single high_order page has a power-of-two size, bigger than the needed size. We can free all pages that wont be used by the hash table. On a 1GB i386 machine, this patch saves 128 KB of LOWMEM memory. TCP established hash table entries: 32768 (order: 6, 393216 bytes) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16Make /proc/slabinfo use seq_list_xxx helpersPavel Emelianov
This entry prints a header in .start callback. This is OK, but the more elegant solution would be to move this into the .show callback and use seq_list_start_head() in .start one. I have left it as is in order to make the patch just switch to new API and noting more. [adobriyan@sw.ru: Wrong pointer was used as kmem_cache pointer] Signed-off-by: Pavel Emelianov <xemul@openvz.org> Cc: Christoph Lameter <clameter@sgi.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16MM: use DIV_ROUND_UP() in mm/memory.cRolf Eike Beer
Replace a hand coded version of DIV_ROUND_UP(). Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16hugetlb: remove unnecessary nid initializationNishanth Aravamudan
nid is initialized to numa_node_id() but will either be overwritten in the loop or not used in the conditional. So remove the initialization. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16change zonelist order: zonelist order selection logicKAMEZAWA Hiroyuki
Make zonelist creation policy selectable from sysctl/boot option v6. This patch makes NUMA's zonelist (of pgdat) order selectable. Available order are Default(automatic)/ Node-based / Zone-based. [Default Order] The kernel selects Node-based or Zone-based order automatically. [Node-based Order] This policy treats the locality of memory as the most important parameter. Zonelist order is created by each zone's locality. This means lower zones (ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion. IOW. ZONE_DMA will be in the middle of zonelist. current 2.6.21 kernel uses this. Pros. * A user can expect local memory as much as possible. Cons. * lower zone will be exhansted before higher zone. This may cause OOM_KILL. Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL because of ZONE_DMA exhaution and you need the best locality. (example) assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL. *node(0)'s memory allocation order: node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL. *node(1)'s memory allocation order: node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA. [Zone-based order] This policy treats the zone type as the most important parameter. Zonelist order is created by zone-type order. This means lower zone never be used bofere higher zone exhaustion. IOW. ZONE_DMA will be always at the tail of zonelist. Pros. * OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted. Cons. * memory locality may not be best. (example) assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL. *node(0)'s memory allocation order: node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA. *node(1)'s memory allocation order: node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA. bootoption "numa_zonelist_order=" and proc/sysctl is supporetd. command: %echo N > /proc/sys/vm/numa_zonelist_order Will rebuild zonelist in Node-based order. command: %echo Z > /proc/sys/vm/numa_zonelist_order Will rebuild zonelist in Zone-based order. Thanks to Lee Schermerhorn, he gives me much help and codes. [Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order] [akpm@linux-foundation.org: build fix] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: "jesse.barnes@intel.com" <jesse.barnes@intel.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-11security: Protection for exploiting null dereference using mmapEric Paris
Add a new security check on mmap operations to see if the user is attempting to mmap to low area of the address space. The amount of space protected is indicated by the new proc tunable /proc/sys/vm/mmap_min_addr and defaults to 0, preserving existing behavior. This patch uses a new SELinux security class "memprotect." Policy already contains a number of allow rules like a_t self:process * (unconfined_t being one of them) which mean that putting this check in the process class (its best current fit) would make it useless as all user processes, which we also want to protect against, would be allowed. By taking the memprotect name of the new class it will also make it possible for us to move some of the other memory protect permissions out of 'process' and into the new class next time we bump the policy version number (which I also think is a good future idea) Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
2007-07-11Merge branch 'for-2.6.23' into mergePaul Mackerras
2007-07-10xip sendfile removalCarsten Otte
This patch removes xip_file_sendfile, the sendfile implementation for xip without replacement. Those customers that use xip on s390 are not using sendfile() as far as we know, and so far s390 is the only platform this could potentially be used on so far. Having sendfile is not a popular feature for execute in place file systems, however we have a working implementation of splice_read() based on fs/splice.c if anyone asks for it. At this point in time, it does not seem preferable to merge splice_read() for xip because it causes extra maintenence effort due to code duplication and it requires struct page behind the xip memory segment. We'd like to get rid of that in favor of supporting flash based embedded platforms (Monta Vista work) soon. Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-10shmem: convert to using splice instead of sendfile()Hugh Dickins
Remove shmem_file_sendfile and resurrect shmem_readpage, as used by tmpfs to support loop and sendfile in 2.4 and 2.5. Now tmpfs can support splice, loop and sendfile in the simplest way, using generic_file_splice_read and generic_file_splice_write (with the aid of shmem_prepare_write). We could make some efficiency tweaks later, if there's a real need; but this is stable and works well as is. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-10sendfile: kill generic_file_sendfile()Jens Axboe
It's no longer used. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-08mm: double mark_page_accessed() in read_cache_page_async()Peter Zijlstra
Fix a post-2.6.21 regression. read_cache_page_async() has two invocations of mark_page_accessed() which will launch pages right onto the active list. Remove the first one, keeping the latter one. This avoids marking unwanted pages active (in the retry loop). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-06slub: remove useless EXPORT_SYMBOLChristoph Lameter
kmem_cache_open is static. EXPORT_SYMBOL was leftover from some earlier time period where kmem_cache_open was usable outside of slub. (Fixes powerpc build error) Signed-off-by: Chrsitoph Lameter <clameter@sgi.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-06mm: fixup /proc/vmstat outputPeter Zijlstra
Line up the vmstat_text with zone_stat_item enum zone_stat_item { /* First 128 byte cacheline (assuming 64 bit words) */ NR_FREE_PAGES, NR_INACTIVE, NR_ACTIVE, We current have nr_active and nr_inactive reversed. [ "OK with patch, though using initializers canbe handy to prevent such things in future: static const char * const vmstat_text[] = { [NR_FREE_PAGES] = "nr_free_pages", ..." - Alexey ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-05Fix slab redzone alignmentDavid Woodhouse
Commit b46b8f19c9cd435ecac4d9d12b39d78c137ecd66 fixed a couple of bugs by switching the redzone to 64 bits. Unfortunately, it neglected to ensure that the _second_ redzone, after the slab object, is aligned correctly. This caused illegal instruction faults on sparc32, which for some reason not entirely clear to me are not trapped and fixed up. Two things need to be done to fix this: - increase the object size, rounding up to alignof(long long) so that the second redzone can be aligned correctly. - If SLAB_STORE_USER is set but alignof(long long)==8, allow a full 64 bits of space for the user word at the end of the buffer, even though we may not _use_ the whole 64 bits. This patch should be a no-op on any 64-bit architecture or any 32-bit architecture where alignof(long long) == 4. Of the others, it's tested on ppc32 by myself and a very similar patch was tested on sparc32 by Mark Fortescue, who reported the new problem. Also, fix the conditions for FORCED_DEBUG, which hadn't been adjusted to the new sizes. Again noticed by Mark. Signed-off-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-03SLUB: Make lockdep happy by not calling add_partial with interrupts enabled ↵Christoph Lameter
during bootstrap If we move the local_irq_enable() to the end of the function then add_partial() in early_kmem_cache_node_alloc() will be called with interrupts disabled like during regular operations. This makes lockdep happy. Signed-off-by: Christoph Lameter <clameter@sgi.com> Tested-by: Andre Noll <maan@systemlinux.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-01SLAB: remove WARN_ON_ONCE for zero sized objects for 2.6.22 releaseChristoph Lameter
We agreed to remove the WARN_ON_ONCE before 2.6.22 is released. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-28mm: kill validate_anon_vma to avoid mapcount BUGHugh Dickins
validate_anon_vma gave a useful check on the integrity of the anon_vma list when Andrea was developing obj rmap; but it was not enabled in SLES9 itself, nor in mainline, until Nick changed commented-out RMAP_DEBUG to configurable CONFIG_DEBUG_VM in 2.6.17. Now Petr Vandrovec reports that its BUG_ON(mapcount > 100000) can easily crash a CONFIG_DEBUG_VM=y system. That limit was just an arbitrary number to protect against an infinite loop. We could raise it to something enormous (depending on sizeof struct vma and size of memory?); but I rather think validate_anon_vma has outlived its usefulness, and is better just removed - which gives a magnificent performance boost to anything like Petr's test program ;) Of course, a very long anon_vma list is bad news for preemption latency, and I believe there has been one recent report of such: let's not forget that, but validate_anon_vma only makes it worse not better. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Petr Vandrovec <petr@vmware.com> Acked-by: Nick Piggin <npiggin@suse.de> Cc: Andrea Arcangeli <andrea@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-24SLUB: fix behavior if the text output of list_locations overflows PAGE_SIZEChristoph Lameter
If slabs are allocated or freed from a large set of call sites (typical for the kmalloc area) then we may create more output than fits into a single PAGE and sysfs only gives us one page. The output should be truncated. This patch fixes the checks to do the truncation properly. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-21[PARISC] Handle wrapping in expand_upwards()Helge Deller
Function expand_upwards() did not guarded against wrapping around to address 0. This fixes the adjtimex02 testcase from the Linux Test Project on a 32bit PARISC kernel. [expand_upwards is only used on parisc and ia64; it looks like it does the right thing on both. --kyle] Signed-off-by: Helge Deller <deller@gmx.de> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Kyle McMartin <kyle@parisc-linux.org>
2007-06-16SLUB: minimum alignment fixesChristoph Lameter
If ARCH_KMALLOC_MINALIGN is set to a value greater than 8 (SLUBs smallest kmalloc cache) then SLUB may generate duplicate slabs in sysfs (yes again) because the object size is padded to reach ARCH_KMALLOC_MINALIGN. Thus the size of the small slabs is all the same. No arch sets ARCH_KMALLOC_MINALIGN larger than 8 though except mips which for some reason wants a 128 byte alignment. This patch increases the size of the smallest cache if ARCH_KMALLOC_MINALIGN is greater than 8. In that case more and more of the smallest caches are disabled. If we do that then the count of the active general caches that is displayed on boot is not correct anymore since we may skip elements of the kmalloc array. So count them separately. This approach was tested by Havard yesterday. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-16Rework ptep_set_access_flags and fix sun4cBenjamin Herrenschmidt
Some changes done a while ago to avoid pounding on ptep_set_access_flags and update_mmu_cache in some race situations break sun4c which requires update_mmu_cache() to always be called on minor faults. This patch reworks ptep_set_access_flags() semantics, implementations and callers so that it's now responsible for returning whether an update is necessary or not (basically whether the PTE actually changed). This allow fixing the sparc implementation to always return 1 on sun4c. [akpm@linux-foundation.org: fixes, cleanups] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Hugh Dickins <hugh@veritas.com> Cc: David Miller <davem@davemloft.net> Cc: Mark Fortescue <mark@mtfhpc.demon.co.uk> Acked-by: William Lee Irwin III <wli@holomorphy.com> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-16SLUB slab validation: Alloc while interrupts are disabled must use GFP_ATOMICChristoph Lameter
The data structure to manage the information gathered about functions allocating and freeing objects is allocated when the list_lock has already been taken. We need to allocate with GFP_ATOMIC instead of GFP_KERNEL. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-15mm: Fix memory/cpu hotplug section mismatch and oops.Paul Mundt
When building with memory hotplug enabled and cpu hotplug disabled, we end up with the following section mismatch: WARNING: mm/built-in.o(.text+0x4e58): Section mismatch: reference to .init.text: (between 'free_area_init_node' and '__build_all_zonelists') This happens as a result of: -> free_area_init_node() -> free_area_init_core() -> zone_pcp_init() <-- all __meminit up to this point -> zone_batchsize() <-- marked as __cpuinit fo This happens because CONFIG_HOTPLUG_CPU=n sets __cpuinit to __init, but CONFIG_MEMORY_HOTPLUG=y unsets __meminit. Changing zone_batchsize() to __devinit fixes this. __devinit is the only thing that is common between CONFIG_HOTPLUG_CPU=y and CONFIG_MEMORY_HOTPLUG=y. In the long run, perhaps this should be moved to another section identifier completely. Without this, memory hot-add of offline nodes (via hotadd_new_pgdat()) will oops if CPU hotplug is not also enabled. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> -- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
2007-06-14[POWERPC] unmap_vm_area becomes unmap_kernel_range for the publicBenjamin Herrenschmidt
This makes unmap_vm_area static and a wrapper around a new exported unmap_kernel_range that takes an explicit range instead of a vm_area struct. This makes it more versatile for code that wants to play with kernel page tables outside of the standard vmalloc area. (One example is some rework of the PowerPC PCI IO space mapping code that depends on that patch and removes some code duplication and horrible abuse of forged struct vm_struct). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-06-08Move three functions that are only needed for CONFIG_MEMORY_HOTPLUGStephen Rothwell
into the appropriate #ifdef. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-08SLUB: return ZERO_SIZE_PTR for kmalloc(0)Christoph Lameter
Instead of returning the smallest available object return ZERO_SIZE_PTR. A ZERO_SIZE_PTR can be legitimately used as an object pointer as long as it is not deferenced. The dereference of ZERO_SIZE_PTR causes a distinctive fault. kfree can handle a ZERO_SIZE_PTR in the same way as NULL. This enables functions to use zero sized object. e.g. n = number of objects. objects = kmalloc(n * sizeof(object)); for (i = 0; i < n; i++) objects[i].x = y; kfree(objects); Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-08slab: fix alien cache handlingChristoph Lameter
cache_free_alien must be called regardless if we use alien caches or not. cache_free_alien() will do the right thing if there are no alien caches available. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Paul Mundt <lethal@linux-sh.org> Acked-by: Pekka J Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-08mount -t tmpfs -o mpol=: check nodes onlineHugh Dickins
Randy Dunlap reports that a tmpfs, mounted with NUMA mpol= specifying an offline node, crashes as soon as data is allocated upon it. Now restrict it to online nodes, where before it restricted to MAX_NUMNODES. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Robin Holt <holt@sgi.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Tested-and-acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-08sh: memory hot-add for sparsemem users support.Paul Mundt
This enables simple hotplug support for sparsemem users. Presently this only permits memory being added in to node 0 on ZONE_NORMAL. Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2007-06-01SLUB: fix locking for hotplug callbacksChristoph Lameter
Hotplug callbacks are performed with interrupts enabled. Slub requires interrupts to be disabled for flushing caches. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Michal Piotrowski <michal.k.k.piotrowski@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-01memory hotplug: fix unnecessary calling of init_currenty_empty_zone()Yasunori Goto
zone->present_pages is updated in online_pages(). But, __add_zone() can be called twice or more before calling online_pages(). So, init_currenty_empty_zone() can be called unnecessary times. It is cause of memory leak of zone's wait_table. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-01x86_64: allocate sparsemem memmap above 4GZou Nan hai
On systems with huge amount of physical memory, VFS cache and memory memmap may eat all available system memory under 4G, then the system may fail to allocate swiotlb bounce buffer. There was a fix for this issue in arch/x86_64/mm/numa.c, but that fix dose not cover sparsemem model. This patch add fix to sparsemem model by first try to allocate memmap above 4G. Signed-off-by: Zou Nan hai <nanhai.zou@intel.com> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Andi Kleen <ak@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-31m68k: discontinuous memory supportRoman Zippel
Fix support for discontinuous memory Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>