aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2007-02-11[PATCH] uml: network driver whitespace and style fixesJeff Dike
Some whitespace and coding style cleanups in the network driver code. Signed-off-by: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Acked-by: Jeff Garzik <jeff@garzik.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] uml: add locking to network transport registrationJeff Dike
The registration of host network transports needed some locking. The transport list itself is locked, but calls to the registration routines are not. This is compensated for by checking that a transport structure is not yet on any list. I also took the opportunity to const all fields in the transport structure except the list, which obviously can be modified. Signed-off-by: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] uml: lock the irqs_to_free listJeff Dike
Fix (i.e. add some) the locking around the irqs_to_free list. Signed-off-by: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] uml: console whitespace and comment tidyingJeff Dike
Some comment and whitespace cleanups in the console and mconsole code. Signed-off-by: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] uml: return hotplug errors to hostJeff Dike
I noticed that errors happening while hotplugging devices from the host were never returned back to the mconsole client. In some cases, success was returned instead of even an information-free error. This patch cleans that up by having the low-level configuration code pass back an error string along with an error code. At the top level, which knows whether it is early boot time or responding to an mconsole request, the string is printk'd or returned to the mconsole client. There are also whitespace and trivial code cleanups in the surrounding code. Signed-off-by: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] uml: console locking fixesJeff Dike
Clean up the console driver locking. There are various problems here, including sleeping under a spinlock and spinlock recursion, some of which are fixed here. This patch deals with the locking involved with opens and closes. The problem is that an mconsole request to change a console's configuration can race with an open. Changing a configuration should only be done when a console isn't opened. Also, an open must be looking at a stable configuration. In addition, a get configuration request must observe the same locking since it must also see a stable configuration. With the old locking, it was possible for this to hang indefinitely in some cases because open would block for a long time waiting for a connection from the host while holding the lock needed by the mconsole request. As explained in the long comment, this is fixed by adding a spinlock for the use count and configuration and a mutex for the actual open and close. Signed-off-by: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] CRIS: TLB handling: turn local_save_flags() + local_irq_disable() ↵Jiri Kosina
into local_irq_save() TLB handling for CRIS contains local_irq_disable() after local_save_flags(). Turn this into local_irq_save(). Signed-off-by: Jiri Kosina <jkosina@suse.cz> Cc: Mikael Starvik <starvik@axis.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] CRIS: user ARRAY_SIZE macro when appropriateAhmed S. Darwish
Use ARRAY_SIZE macro already defined in linux/kernel.h Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com> Cc: Mikael Starvik <starvik@axis.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] CRIS: turn local_save_flags() + local_irq_disable() into ↵Jiri Kosina
local_irq_save() in headers Various headers for CRIS architecture contain local_irq_disable() after local_save_flags(). Turn it into local_irq_save(). Signed-off-by: Jiri Kosina <jkosina@suse.cz> Cc: Mikael Starvik <starvik@axis.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] CRIS: local_irq_disable() is redundant after local_irq_save()Jiri Kosina
arch/cris/arch-v10/kernel/time.c::get_ns_in_jiffie() contains local_irq_disable() call after local_irq_save(). This looks redundant. arch/cris/kernel/time.c::do_gettimeofday() contains local_irq_disable() call after local_irq_save(). This looks redundant. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Cc: Mikael Starvik <starvik@axis.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] m68k: don't include asm-m68k/page.h in asm-m68k/user.hMike Frysinger
We don't actually use anything from asm-m68k/page.h in asm-m68k/user.h, so don't bother including it Signed-off-by: Mike Frysinger <vapier@gentoo.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] M68K: user ARRAY_SIZE macro when appropriateAhmed S. Darwish
Use ARRAY_SIZE macro already defined in linux/kernel.h Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] M68KNOMMU: user ARRAY_SIZE macro when appropriateAhmed S. Darwish
Use ARRAY_SIZE macro already defined in linux/kernel.h Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com> Signed-off-by: Greg Ungerer <gerg@uclinux.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] kernel/time/clocksource.c needs struct task_struct on m68kMathieu Desnoyers
kernel/time/clocksource.c needs struct task_struct on m68k. Because it uses spin_unlock_irq(), which, on m68k, uses hardirq_count(), which uses preempt_count(), which needs to dereference struct task_struct, we have to include sched.h. Because it would cause a loop inclusion, we cannot include sched.h in any other of asm-m68k/system.h, linux/thread_info.h, linux/hardirq.h, which leaves this ugly include in a C file as the only simple solution. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] m68k: work around binutils tokenizer changeAl Viro
Recent as(1) doesn't think that . terminates a macro name, so getuser.l is _not_ treated as invoking getuser with .l as the first argument. arch/m68k/math-emu relies on old behaviour, so it gets a lot of undefined macros with more or less current binutils. Note that this behaviour remains in all recent versions and is unrelated to another binutils problems we used to have for a while (having (%a0)+ parsed as two arguments). This one is there to stay; it's an intentional and documented change. .irp <identifier> <words> [text] .endr expands to a copy of text per each word, with \<identifier> replaced with corresponding word. Again, what happens depends on whether gas_ident.x is treated as one or as two tokens; in the former case we'll get old_gas incremented once, in the latter - twice. The rest is obvious. Unlike .macro argument list _anything_ is explicitly allowed after .irp <identifier>; here we are on very safe ground. And yes, it does work with all gas variants I've got here (including vanilla 2.15, 2.16, 2.16.1 and 2.17, plus debian and FC binutils). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] m32r: cosmetic updates and trivial fixesHirokazu Takata
Cosmetic updates and trivial fixes of m32r arch-dependent files. - Remove RCS ID strings and trailing white lines - Other misc. cosmetic updates Signed-off-by: Hirokazu Takata <takata@linux-m32r.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] m32r: fix kernel entry address of vmlinuxHirokazu Takata
This patch fixes the kernel entry point address of vmlinux. The m32r kernel entry address is 0x08002000 (physical). But, so far, the ENTRY point written in vmlinux.lds.S was not point the correct kernel entry address. (before fix) $ objdump -x vmlinux vmlinux: file format elf32-m32r-linux vmlinux architecture: m32r2, flags 0x00000112: EXEC_P, HAS_SYMS, D_PAGED start address 0x88002090 /* NG */ : Sections: Idx Name Size VMA LMA File off Algn 0 .empty_zero_page 00001000 88001000 88001000 00001000 2**12 CONTENTS, ALLOC, LOAD, DATA 1 .boot 0000008c 88002000 88002000 00002000 2**2 CONTENTS, ALLOC, LOAD, READONLY, CODE 2 .text 001ab694 88002090 88002090 00002090 2**4 CONTENTS, ALLOC, LOAD, READONLY, CODE : (after fix) $ objdump -x vmlinux vmlinux: file format elf32-m32r-linux vmlinux architecture: m32r2, flags 0x00000112: EXEC_P, HAS_SYMS, D_PAGED start address 0x08002000 /* OK */ : This fix also remedies the following GDB error message (of gdb-6.4 or after) at the first operation of kernel debugging: "Previous frame identical to this frame (corrupt stack?)". Signed-off-by: Hirokazu Takata <takata@linux-m32r.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] m32r: update defconfig files for v2.6.19Hirokazu Takata
This patch upgrades defconfig files for all m32r platforms. Signed-off-by: Hirokazu Takata <takata@linux-m32r.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] m32r: fix do_page_fault and update_mmu_cacheHirokazu Takata
Fix do_page_fault and update_mmu_cache. * Fix do_page_fault (vmalloc_fault:) to pass error_code correctly to update_mmu_cache by using a thread-fault code for all m32r chips. * Fix update_mmu_cache for OPSP chip - #ifdef CONFIG_CHIP_OPSP portion is a workaround of OPSP; Add a notfound-case operation to update_mmu_cache for OPSP like other m32r chip. - Fix pte_data that was not initialized if no entry found. Signed-off-by: Kazuhiro Inaoka <inaoka@linux-m32r.org> Signed-off-by: Hirokazu Takata <takata@linux-m32r.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] m32r: build fix for processors without ISA_DSP_LEVEL2Hirokazu Takata
Additional fixes for processors without ISA_DSP_LEVEL2. sigcontext_t does not have dummy_acc1h, dummy_acc1l members any longer. Signed-off-by: Hirokazu Takata <takata@linux-m32r.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] swsusp: Change pm_ops handling by userland interfaceRafael J. Wysocki
Make the userland interface of swsusp call pm_ops->finish() after enable_nonboot_cpus() and before resume_device(), as indicated by the recent discussion on Linux-PM (cf. http://lists.osdl.org/pipermail/linux-pm/2006-November/004164.html). This patch changes the SNAPSHOT_PMOPS ioctl so that its first function, PMOPS_PREPARE, only sets a switch turning the platform suspend mode on, and its last function, PMOPS_FINISH, only checks if the platform mode is enabled. This should allow the older userland tools to work with new kernels without any modifications. The changes here only affect the userland interface of swsusp. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Greg KH <greg@kroah.com> Cc: Nigel Cunningham <nigel@suspend2.net> Cc: Patrick Mochel <mochel@digitalimplant.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] swsusp-change-code-ordering-in-userc-sanityAndrew Morton
The compiler will do that. And if it doesn't, we don't want to either ;) Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Cc: Greg KH <greg@kroah.com> Cc: Nigel Cunningham <nigel@suspend2.net> Cc: Patrick Mochel <mochel@digitalimplant.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] swsusp: Change code ordering in user.cRafael J. Wysocki
Change the ordering of code in kernel/power/user.c so that device_suspend() is called before disable_nonboot_cpus() and device_resume() is called after enable_nonboot_cpus(). This is needed to make the userland suspend call pm_ops->finish() after enable_nonboot_cpus() and before device_resume(), as indicated by the recent discussion on Linux-PM (cf. http://lists.osdl.org/pipermail/linux-pm/2006-November/004164.html). The changes here only affect the userland interface of swsusp. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Greg KH <greg@kroah.com> Cc: Nigel Cunningham <nigel@suspend2.net> Cc: Patrick Mochel <mochel@digitalimplant.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] swsusp: Change code ordering in disk.cRafael J. Wysocki
Change the ordering of code in kernel/power/disk.c so that device_suspend() is called before disable_nonboot_cpus() and platform_finish() is called after enable_nonboot_cpus() and before device_resume(), as indicated by the recent discussion on Linux-PM (cf. http://lists.osdl.org/pipermail/linux-pm/2006-November/004164.html). The changes here only affect the built-in swsusp. [alexey.y.starikovskiy@linux.intel.com: fix LED blinking during image load] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Greg KH <greg@kroah.com> Cc: Nigel Cunningham <nigel@suspend2.net> Cc: Patrick Mochel <mochel@digitalimplant.org> Cc: Alexey Starikovskiy <alexey.y.starikovskiy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] PM: Change code ordering in main.cRafael J. Wysocki
As indicated in a recent thread on Linux-PM, it's necessary to call pm_ops->finish() before devce_resume(), but enable_nonboot_cpus() has to be called before pm_ops->finish() (cf. http://lists.osdl.org/pipermail/linux-pm/2006-November/004164.html). For consistency, it seems reasonable to call disable_nonboot_cpus() after device_suspend(). This way the suspend code will remain symmetrical with respect to the resume code and it may allow us to speed up things in the future by suspending and resuming devices and/or saving the suspend image in many threads. The following series of patches reorders the suspend and resume code so that nonboot CPUs are disabled after devices have been suspended and enabled before the devices are resumed. It also causes pm_ops->finish() to be called after enable_nonboot_cpus() wherever necessary. This patch: Change the ordering of code in kernel/power/main.c so that device_suspend() is called before disable_nonboot_cpus() and pm_ops->finish() is called after enable_nonboot_cpus() and before device_resume(), as indicated by recent discussion on Linux-PM (cf. http://lists.osdl.org/pipermail/linux-pm/2006-November/004164.html). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Greg KH <greg@kroah.com> Cc: Nigel Cunningham <nigel@suspend2.net> Cc: Patrick Mochel <mochel@digitalimplant.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] ARM26: Use ARRAY_SIZE macro when appropriateAhmed S. Darwish
Use ARRAY_SIZE macro already defined in linux/kernel.h Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com> Acked-by: Ian Molton <spyro@f2s.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] Alpha: increase PERCPU_ENOUGH_ROOMAneesh Kumar K.V
Module loading on Alpha was failing with error "Could not allocate 8 bytes percpu data". Looking at dmesg we have the below error "No per-cpu room for modules." Increase the PERCPU_ENOUGH_ROOM in a similar way as x86_64 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@gmail.com> Cc: <Jay.Estabrook@hp.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] make reading /proc/sys/kernel/cap-bould not require CAP_SYS_MODULEEric Paris
Reading /proc/sys/kernel/cap-bound requires CAP_SYS_MODULE. (see proc_dointvec_bset in kernel/sysctl.c) sysctl appears to drive all over proc reading everything it can get it's hands on and is complaining when it is being denied access to read cap-bound. Clearly writing to cap-bound should be a sensitive operation but requiring CAP_SYS_MODULE to read cap-bound seems a bit to strong. I believe the information could with reasonable certainty be obtained by looking at a bunch of the output of /proc/pid/status which has very low security protection, so at best we are just getting a little obfuscation of information. Currently SELinux policy has to 'dontaudit' capability checks for CAP_SYS_MODULE for things like sysctl which just want to read cap-bound. In doing so we also as a byproduct have to hide warnings of potential exploits such as if at some time that sysctl actually tried to load a module. I wondered if anyone would have a problem opening cap-bound up to read from anyone? Acked-by: Chris Wright <chrisw@sous-sol.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] do not disturb page referenced state when unmapping memory rangeKen Chen
When kernel unmaps an address range, it needs to transfer PTE state into page struct. Currently, kernel transfer access bit via mark_page_accessed(). The call to mark_page_accessed in the unmap path doesn't look logically correct. At unmap time, calling mark_page_accessed will causes page LRU state to be bumped up one step closer to more recently used state. It is causing quite a bit headache in a scenario when a process creates a shmem segment, touch a whole bunch of pages, then unmaps it. The unmapping takes a long time because mark_page_accessed() will start moving pages from inactive to active list. I'm not too much concerned with moving the page from one list to another in LRU. Sooner or later it might be moved because of multiple mappings from various processes. But it just doesn't look logical that when user asks a range to be unmapped, it's his intention that the process is no longer interested in these pages. Moving those pages to active list (or bumping up a state towards more active) seems to be an over reaction. It also prolongs unmapping latency which is the core issue I'm trying to solve. As suggested by Peter, we should still preserve the info on pte young pages, but not more. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ken Chen <kenchen@google.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] convert ramfs to use __set_page_dirty_no_writebackKen Chen
As pointed out by Hugh, ramfs would also benefit from using the new set_page_dirty aop method for memory backed file systems. Signed-off-by: Ken Chen <kenchen@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] simplify shmem_aops.set_page_dirty() methodKen Chen
shmem backed file does not have page writeback, nor it participates in backing device's dirty or writeback accounting. So using generic __set_page_dirty_nobuffers() for its .set_page_dirty aops method is a bit overkill. It unnecessarily prolongs shm unmap latency. For example, on a densely populated large shm segment (sevearl GBs), the unmapping operation becomes painfully long. Because at unmap, kernel transfers dirty bit in PTE into page struct and to the radix tree tag. The operation of tagging the radix tree is particularly expensive because it has to traverse the tree from the root to the leaf node on every dirty page. What's bothering is that radix tree tag is used for page write back. However, shmem is memory backed and there is no page write back for such file system. And in the end, we spend all that time tagging radix tree and none of that fancy tagging will be used. So let's simplify it by introduce a new aops __set_page_dirty_no_writeback and this will speed up shm unmap. Signed-off-by: Ken Chen <kenchen@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] zoneid: fix up calculations for ZONEID_PGSHIFTAndy Whitcroft
Currently if we have a non-zero ZONES_SHIFT we assume we are able to rely on that as the bottom edge of the ZONEID, if not then we use the NODES_PGOFF as the right end of either NODES _or_ SECTION. This latter is more luck than judgement and would be incorrect if we reordered the SECTION,NODE,ZONE options in the fields space. Really what we want is the lower of the right hand end of the two fields we are using (either NODE,ZONE or SECTION,ZONE). Codify that explicitly. As always allow for there being no bits in either of the fields, such as might be valid in a non-numa machine with only a zone NORMAL. I have checked that the compiler is still able to constant fold all of this away correctly. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] Set CONFIG_ZONE_DMA for arches with GENERIC_ISA_DMAChristoph Lameter
As Andi pointed out: CONFIG_GENERIC_ISA_DMA only disables the ISA DMA channel management. Other functionality may still expect GFP_DMA to provide memory below 16M. So we need to make sure that CONFIG_ZONE_DMA is set independent of CONFIG_GENERIC_ISA_DMA. Undo the modifications to mm/Kconfig where we made ZONE_DMA dependent on GENERIC_ISA_DMA and set theses explicitly in each arches Kconfig. Reviews must occur for each arch in order to determine if ZONE_DMA can be switched off. It can only be switched off if we know that all devices supported by a platform are capable of performing DMA transfers to all of memory (Some arches already support this: uml, avr32, sh sh64, parisc and IA64/Altix). In order to switch ZONE_DMA off conditionally, one would have to establish a scheme by which one can assure that no drivers are enabled that are only capable of doing I/O to a part of memory, or one needs to provide an alternate means of performing an allocation from a specific range of memory (like provided by alloc_pages_range()) and insure that all drivers use that call. In that case the arches alloc_dma_coherent() may need to be modified to call alloc_pages_range() instead of relying on GFP_DMA. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] optional ZONE_DMA: remove ZONE_DMA remains from sh/sh64Christoph Lameter
sh / sh64: Remove ZONE_DMA remains. Both arches do not need ZONE_DMA Signed-off-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] optional ZONE_DMA: remove ZONE_DMA remains from pariscChristoph Lameter
Remove ZONE_DMA remains from parisc so that kernels are build without ZONE_DMA. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Matthew Wilcox <willy@debian.org> Cc: James Bottomley <James.Bottomley@steeleye.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] optional ZONE_DMA: optional ZONE_DMA for ia64Christoph Lameter
ZONE_DMA less operation for IA64 SGI platform Disable ZONE_DMA for SGI SN2. All memory is addressable by all devices and we do not need any special memory pool. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] optional ZONE_DMA: optional ZONE_DMA in the VMChristoph Lameter
Make ZONE_DMA optional in core code. - ifdef all code for ZONE_DMA and related definitions following the example for ZONE_DMA32 and ZONE_HIGHMEM. - Without ZONE_DMA, ZONE_HIGHMEM and ZONE_DMA32 we get to a ZONES_SHIFT of 0. - Modify the VM statistics to work correctly without a DMA zone. - Modify slab to not create DMA slabs if there is no ZONE_DMA. [akpm@osdl.org: cleanup] [jdike@addtoit.com: build fix] [apw@shadowen.org: Simplify calculation of the number of bits we need for ZONES_SHIFT] Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Matthew Wilcox <willy@debian.org> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Jeff Dike <jdike@addtoit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] optional ZONE_DMA: introduce CONFIG_ZONE_DMAChristoph Lameter
This patch simply defines CONFIG_ZONE_DMA for all arches. We later do special things with CONFIG_ZONE_DMA after the VM and an arch are prepared to work without ZONE_DMA. CONFIG_ZONE_DMA can be defined in two ways depending on how an architecture handles ISA DMA. First if CONFIG_GENERIC_ISA_DMA is set by the arch then we know that the arch needs ZONE_DMA because ISA DMA devices are supported. We can catch this in mm/Kconfig and do not need to modify arch code. Second, arches may use ZONE_DMA in an unknown way. We set CONFIG_ZONE_DMA for all arches that do not set CONFIG_GENERIC_ISA_DMA in order to insure backwards compatibility. The arches may later undefine ZONE_DMA if their arch code has been verified to not depend on ZONE_DMA. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Matthew Wilcox <willy@debian.org> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] optional ZONE_DMA: deal with cases of ZONE_DMA meaning the first zoneChristoph Lameter
This patchset follows up on the earlier work in Andrew's tree to reduce the number of zones. The patches allow to go to a minimum of 2 zones. This one allows also to make ZONE_DMA optional and therefore the number of zones can be reduced to one. ZONE_DMA is usually used for ISA DMA devices. There are a number of reasons why we would not want to have ZONE_DMA 1. Some arches do not need ZONE_DMA at all. 2. With the advent of IOMMUs DMA zones are no longer needed. The necessity of DMA zones may drastically be reduced in the future. This patchset allows a compilation of a kernel without that overhead. 3. Devices that require ISA DMA get rare these days. All my systems do not have any need for ISA DMA. 4. The presence of an additional zone unecessarily complicates VM operations because it must be scanned and balancing logic must operate on its. 5. With only ZONE_NORMAL one can reach the situation where we have only one zone. This will allow the unrolling of many loops in the VM and allows the optimization of varous code paths in the VM. 6. Having only a single zone in a NUMA system results in a 1-1 correspondence between nodes and zones. Various additional optimizations to critical VM paths become possible. Many systems today can operate just fine with a single zone. If you look at what is in ZONE_DMA then one usually sees that nothing uses it. The DMA slabs are empty (Some arches use ZONE_DMA instead of ZONE_NORMAL, then ZONE_NORMAL will be empty instead). On all of my systems (i386, x86_64, ia64) ZONE_DMA is completely empty. Why constantly look at an empty zone in /proc/zoneinfo and empty slab in /proc/slabinfo? Non i386 also frequently have no need for ZONE_DMA and zones stay empty. The patchset was tested on i386 (UP / SMP), x86_64 (UP, NUMA) and ia64 (NUMA). The RFC posted earlier (see http://marc.theaimsgroup.com/?l=linux-kernel&m=115231723513008&w=2) had lots of #ifdefs in them. An effort has been made to minize the number of #ifdefs and make this as compact as possible. The job was made much easier by the ongoing efforts of others to extract common arch specific functionality. I have been running this for awhile now on my desktop and finally Linux is using all my available RAM instead of leaving the 16MB in ZONE_DMA untouched: christoph@pentium940:~$ cat /proc/zoneinfo Node 0, zone Normal pages free 4435 min 1448 low 1810 high 2172 active 241786 inactive 210170 scanned 0 (a: 0 i: 0) spanned 524224 present 524224 nr_anon_pages 61680 nr_mapped 14271 nr_file_pages 390264 nr_slab_reclaimable 27564 nr_slab_unreclaimable 1793 nr_page_table_pages 449 nr_dirty 39 nr_writeback 0 nr_unstable 0 nr_bounce 0 cpu: 0 pcp: 0 count: 156 high: 186 batch: 31 cpu: 0 pcp: 1 count: 9 high: 62 batch: 15 vm stats threshold: 20 cpu: 1 pcp: 0 count: 177 high: 186 batch: 31 cpu: 1 pcp: 1 count: 12 high: 62 batch: 15 vm stats threshold: 20 all_unreclaimable: 0 prev_priority: 12 temp_priority: 12 start_pfn: 0 This patch: In two places in the VM we use ZONE_DMA to refer to the first zone. If ZONE_DMA is optional then other zones may be first. So simply replace ZONE_DMA with zone 0. This also fixes ZONETABLE_PGSHIFT. If we have only a single zone then ZONES_PGSHIFT may become 0 because there is no need anymore to encode the zone number related to a pgdat. However, we still need a zonetable to index all the zones for each node if this is a NUMA system. Therefore define ZONETABLE_SHIFT unconditionally as the offset of the ZONE field in page flags. [apw@shadowen.org: fix mismerge] Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Matthew Wilcox <willy@debian.org> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] Drop get_zone_counts()Christoph Lameter
Values are available via ZVC sums. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] Drop __get_zone_counts()Christoph Lameter
Values are readily available via ZVC per node and global sums. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] Drop nr_free_pages_pgdat()Christoph Lameter
Function is unnecessary now. We can use the summing features of the ZVCs to get the values we need. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] Drop free_pages()Christoph Lameter
nr_free_pages is now a simple access to a global variable. Make it a macro instead of a function. The nr_free_pages now requires vmstat.h to be included. There is one occurrence in power management where we need to add the include. Directly refrer to global_page_state() there to clarify why the #include was added. [akpm@osdl.org: arm build fix] [akpm@osdl.org: sparc64 build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] Reorder ZVCs according to cachelineChristoph Lameter
The global and per zone counter sums are in arrays of longs. Reorder the ZVCs so that the most frequently used ZVCs are put into the same cacheline. That way calculations of the global, node and per zone vm state touches only a single cacheline. This is mostly important for 64 bit systems were one 128 byte cacheline takes only 8 longs. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] Use ZVC for free_pagesChristoph Lameter
This is again simplifies some of the VM counter calculations through the use of the ZVC consolidated counters. [michal.k.k.piotrowski@gmail.com: build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] Use ZVC for inactive and active countsChristoph Lameter
The determination of the dirty ratio to determine writeback behavior is currently based on the number of total pages on the system. However, not all pages in the system may be dirtied. Thus the ratio is always too low and can never reach 100%. The ratio may be particularly skewed if large hugepage allocations, slab allocations or device driver buffers make large sections of memory not available anymore. In that case we may get into a situation in which f.e. the background writeback ratio of 40% cannot be reached anymore which leads to undesired writeback behavior. This patchset fixes that issue by determining the ratio based on the actual pages that may potentially be dirty. These are the pages on the active and the inactive list plus free pages. The problem with those counts has so far been that it is expensive to calculate these because counts from multiple nodes and multiple zones will have to be summed up. This patchset makes these counters ZVC counters. This means that a current sum per zone, per node and for the whole system is always available via global variables and not expensive anymore to calculate. The patchset results in some other good side effects: - Removal of the various functions that sum up free, active and inactive page counts - Cleanup of the functions that display information via the proc filesystem. This patch: The use of a ZVC for nr_inactive and nr_active allows a simplification of some counter operations. More ZVC functionality is used for sums etc in the following patches. [akpm@osdl.org: UP build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] page_mkwrite caller race fixHugh Dickins
After do_wp_page has tested page_mkwrite, it must release old_page after acquiring page table lock, not before: at some stage that ordering got reversed, leaving a (very unlikely) window in which old_page might be truncated, freed, and reused in the same position. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] typeof __page_to_pfn with SPARSEMEM=yRandy Dunlap
With CONFIG_SPARSEMEM=y: mm/rmap.c:579: warning: format '%lx' expects type 'long unsigned int', but argument 2 has type 'int' Make __page_to_pfn() return unsigned long. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] /proc/zoneinfo: fix vm stats displayAndrew Morton
This early break prevents us from displaying info for the vm stats thresholds if the zone doesn't have any pages in its per-cpu pagesets. So my 800MB i386 box says: Node 0, zone DMA pages free 2365 min 16 low 20 high 24 active 0 inactive 0 scanned 0 (a: 0 i: 0) spanned 4096 present 4044 nr_anon_pages 0 nr_mapped 1 nr_file_pages 0 nr_slab_reclaimable 0 nr_slab_unreclaimable 0 nr_page_table_pages 0 nr_dirty 0 nr_writeback 0 nr_unstable 0 nr_bounce 0 nr_vmscan_write 0 protection: (0, 868, 868) pagesets all_unreclaimable: 0 prev_priority: 12 start_pfn: 0 Node 0, zone Normal pages free 199713 min 934 low 1167 high 1401 active 10215 inactive 4507 scanned 0 (a: 0 i: 0) spanned 225280 present 222420 nr_anon_pages 2685 nr_mapped 1110 nr_file_pages 12055 nr_slab_reclaimable 2216 nr_slab_unreclaimable 1527 nr_page_table_pages 213 nr_dirty 0 nr_writeback 0 nr_unstable 0 nr_bounce 0 nr_vmscan_write 0 protection: (0, 0, 0) pagesets cpu: 0 pcp: 0 count: 152 high: 186 batch: 31 cpu: 0 pcp: 1 count: 13 high: 62 batch: 15 vm stats threshold: 16 cpu: 1 pcp: 0 count: 34 high: 186 batch: 31 cpu: 1 pcp: 1 count: 10 high: 62 batch: 15 vm stats threshold: 16 all_unreclaimable: 0 prev_priority: 12 start_pfn: 4096 Just nuke all that search-for-the-first-non-empty-pageset code. Dunno why it was there in the first place.. Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] Avoid excessive sorting of early_node_map[]Mel Gorman
find_min_pfn_for_node() and find_min_pfn_with_active_regions() sort early_node_map[] on every call. This is an excessive amount of sorting and that can be avoided. This patch always searches the whole early_node_map[] in find_min_pfn_for_node() instead of returning the first value found. The map is then only sorted once when required. Successfully boot tested on a number of machines. [akpm@osdl.org: cleanup] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>