aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2008-06-06sched: fix the cpuprio count reallyThomas Gleixner
Peter pointed out that the last version of the "fix" was still one off under certain circumstances. Use BITS_TO_LONG instead to get an accurate result. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06sched: fix cpupri priocountGregory Haskins
A rounding error was pointed out by Peter Zijlstra which would result in the structure holding priorities to be off by one. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06sched: fix cpupri hotplug supportGregory Haskins
The RT folks over at RedHat found an issue w.r.t. hotplug support which was traced to problems with the cpupri infrastructure in the scheduler: https://bugzilla.redhat.com/show_bug.cgi?id=449676 This bug affects 23-rt12+, 24-rtX, 25-rtX, and sched-devel. This patch applies to 25.4-rt4, though it should trivially apply to most cpupri enabled kernels mentioned above. It turned out that the issue was that offline cpus could get inadvertently registered with cpupri so that they were erroneously selected during migration decisions. The end result would be an OOPS as the offline cpu had tasks routed to it. This patch generalizes the old join/leave domain interface into an online/offline interface, and adjusts the root-domain/hotplug code to utilize it. I was able to easily reproduce the issue prior to this patch, and am no longer able to reproduce it after this patch. I can offline cpus indefinately and everything seems to be in working order. Thanks to Arnaldo (acme), Thomas, and Peter for doing the legwork to point me in the right direction. Also thank you to Peter for reviewing the early iterations of this patch. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06sched: print the sd->level in sched_domain_debug codeGautham R Shenoy
While printing out the visual representation of the sched-domains, print the level (MC, SMT, CPU, NODE, ... ) of each of the sched_domains. Credit: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-06sched: add comments for ifdefs in sched.cDhaval Giani
make sched.c easier to read. Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06sched: print module list in the "scheduling while atomic" warningArjan van de Ven
For the normal WARN_ON() etc we added a print-the-modules-list already, which is very useful to figure out candidates for certain types of bugs. This patch adds the same print to the "scheduling while atomic" BUG warning, for the same reason: when we get here it's very useful to see which modules are loaded, to narrow down the candidate code list. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: mingo@elte.hu Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06sched: fix defined-but-unused warningRabin Vincent
Fix this warning, which appears with !CONFIG_SMP: kernel/sched.c:1216: warning: `init_hrtick' defined but not used Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06namespacecheck: fixes in kernel/sched.cThomas Gleixner
Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06sched: check for SD_SERIALIZE atomically in rebalance_domains()Dmitry Adamushko
Nothing really serious here, mainly just a matter of nit-picking :-/ From: Dmitry Adamushko <dmitry.adamushko@gmail.com> For CONFIG_SCHED_DEBUG && CONFIG_SYSCT configs, sd->flags can be altered while being manipulated in rebalance_domains(). Let's do an atomic check. We rely here on the atomicity of read/write accesses for aligned words. Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06sched: fix SCHED_OTHER balance iterator to include all tasksGregory Haskins
The currently logic inadvertently skips the last task on the run-queue, resulting in missed balance opportunities. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: David Bahi <dbahi@novell.com> CC: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06sched: use a 2-d bitmap for searching lowest-pri CPUGregory Haskins
The current code use a linear algorithm which causes scaling issues on larger SMP machines. This patch replaces that algorithm with a 2-dimensional bitmap to reduce latencies in the wake-up path. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Acked-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06sched: make !hrtick fasterMike Galbraith
it is safe to ignore timers and flags when the feature is disabled. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06sched: prioritize non-migratable tasks over migratable onesGregory Haskins
Dmitry Adamushko pointed out a known flaw in the rt-balancing algorithm that could allow suboptimal balancing if a non-migratable task gets queued behind a running migratable one. It is discussed in this thread: http://lkml.org/lkml/2008/4/22/296 This issue has been further exacerbated by a recent checkin to sched-devel (git-id 5eee63a5ebc19a870ac40055c0be49457f3a89a3). >From a pure priority standpoint, the run-queue is doing the "right" thing. Using Dmitry's nomenclature, if T0 is on cpu1 first, and T1 wakes up at equal or lower priority (affined only to cpu1) later, it *should* wait for T0 to finish. However, in reality that is likely suboptimal from a system perspective if there are other cores that could allow T0 and T1 to run concurrently. Since T1 can not migrate, the only choice for higher concurrency is to try to move T0. This is not something we addessed in the recent rt-balancing re-work. This patch tries to enhance the balancing algorithm by accomodating this scenario. It accomplishes this by incorporating the migratability of a task into its priority calculation. Within a numerical tsk->prio, a non-migratable task is logically higher than a migratable one. We maintain this by introducing a new per-priority queue (xqueue, or exclusive-queue) for holding non-migratable tasks. The scheduler will draw from the xqueue over the standard shared-queue (squeue) when available. There are several details for utilizing this properly. 1) During task-wake-up, we not only need to check if the priority preempts the current task, but we also need to check for this non-migratable condition. Therefore, if a non-migratable task wakes up and sees an equal priority migratable task already running, it will attempt to preempt it *if* there is a likelyhood that the current task will find an immediate home. 2) Tasks only get this non-migratable "priority boost" on wake-up. Any requeuing will result in the non-migratable task being queued to the end of the shared queue. This is an attempt to prevent the system from being completely unfair to migratable tasks during things like SCHED_RR timeslicing. I am sure this patch introduces potentially "odd" behavior if you concoct a scenario where a bunch of non-migratable threads could starve migratable ones given the right pattern. I am not yet convinced that this is a problem since we are talking about tasks of equal RT priority anyway, and there never is much in the way of guarantees against starvation under that scenario anyway. (e.g. you could come up with a similar scenario with a specific timing environment verses an affinity environment). I can be convinced otherwise, but for now I think this is "ok". Signed-off-by: Gregory Haskins <ghaskins@novell.com> CC: Dmitry Adamushko <dmitry.adamushko@gmail.com> CC: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-04Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb: kgdbts: Use HW breakpoints with CONFIG_DEBUG_RODATA kgdb: use common ascii helpers and put_unaligned_be32 helper
2008-05-31capabilities: remain source compatible with 32-bit raw legacy capability ↵Andrew G. Morgan
support. Source code out there hard-codes a notion of what the _LINUX_CAPABILITY_VERSION #define means in terms of the semantics of the raw capability system calls capget() and capset(). Its unfortunate, but true. Since the confusing header file has been in a released kernel, there is software that is erroneously using 64-bit capabilities with the semantics of 32-bit compatibilities. These recently compiled programs may suffer corruption of their memory when sys_getcap() overwrites more memory than they are coded to expect, and the raising of added capabilities when using sys_capset(). As such, this patch does a number of things to clean up the situation for all. It 1. forces the _LINUX_CAPABILITY_VERSION define to always retain its legacy value. 2. adopts a new #define strategy for the kernel's internal implementation of the preferred magic. 3. deprecates v2 capability magic in favor of a new (v3) magic number. The functionality of v3 is entirely equivalent to v2, the only difference being that the v2 magic causes the kernel to log a "deprecated" warning so the admin can find applications that may be using v2 inappropriately. [User space code continues to be encouraged to use the libcap API which protects the application from details like this. libcap-2.10 is the first to support v3 capabilities.] Fixes issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=447518. Thanks to Bojan Smojver for the report. [akpm@linux-foundation.org: s/depreciate/deprecate/g] [akpm@linux-foundation.org: be robust about put_user size] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Bojan Smojver <bojan@rexursive.com> Cc: stable@kernel.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
2008-05-29Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: re-tune NUMA topologies sched: stop wake_affine from causing serious imbalance sched: fix sched_clock_cpu() revert ("sched: fair-group: SMP-nice for group scheduling") sched: cleanup show_schedstat(): fix memleak sched: unite unlikely pairs in rt_policy() and schedule_debug() revert ("sched: fair: weight calculations")
2008-05-29Merge commit 'linus/master' into sched-fixes-for-linusIngo Molnar
2008-05-29sched: stop wake_affine from causing serious imbalanceMike Galbraith
Prevent short-running wakers of short-running threads from overloading a single cpu via wakeup affinity, and wire up disconnected debug option. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29sched: fix sched_clock_cpu()Peter Zijlstra
Make sched_clock_cpu() return 0 before it has been initialized and avoid corrupting its state due to doing so. This fixes the weird printk timestamp jump reported. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2008-05-29revert ("sched: fair-group: SMP-nice for group scheduling")Ingo Molnar
Yanmin Zhang reported: Comparing with 2.6.25, volanoMark has big regression with kernel 2.6.26-rc1. It's about 50% on my 8-core stoakley, 16-core tigerton, and Itanium Montecito. With bisect, I located the following patch: | 18d95a2832c1392a2d63227a7a6d433cb9f2037e is first bad commit | commit 18d95a2832c1392a2d63227a7a6d433cb9f2037e | Author: Peter Zijlstra <a.p.zijlstra@chello.nl> | Date: Sat Apr 19 19:45:00 2008 +0200 | | sched: fair-group: SMP-nice for group scheduling Revert it so that we get v2.6.25 behavior. Bisected-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29sched: cleanupIngo Molnar
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29show_schedstat(): fix memleakAdrian Bunk
The Coverity checker spotted a memleak introduced by commit 39106dcf85285e78f3b290022122c76f851379b8 (cpumask: use new cpus_scnprintf function). It seems the kfree() got lost between v2 and v3 of this patch... Signed-off-by: Adrian Bunk <bunk@kernel.org> Cc: Mike Travis <travis@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29sched: unite unlikely pairs in rt_policy() and schedule_debug()Roel Kluin
Removes obfuscation and may improve assembly. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29revert ("sched: fair: weight calculations")Ingo Molnar
Yanmin Zhang reported: Comparing with kernel 2.6.25, sysbench+mysql(oltp, readonly) has many regressions with 2.6.26-rc1: 1) 8-core stoakley: 28%; 2) 16-core tigerton: 20%; 3) Itanium Montvale: 50%. Bisect located this patch: | 8f1bc385cfbab474db6c27b5af1e439614f3025c is first bad commit | commit 8f1bc385cfbab474db6c27b5af1e439614f3025c | Author: Peter Zijlstra <a.p.zijlstra@chello.nl> | Date: Sat Apr 19 19:45:00 2008 +0200 | | sched: fair: weight calculations Revert it to the 2.6.25 state. Bisected-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-28kgdb: use common ascii helpers and put_unaligned_be32 helperHarvey Harrison
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2008-05-28splice: fix sendfile() issue with relayTom Zanussi
Splice isn't always incrementing the ppos correctly, which broke relay splice. Signed-off-by: Tom Zanussi <zanussi@comcast.net> Tested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-26posix timers: discard SI_TIMER signals on execOleg Nesterov
Based on Roland's patch. This approach was suggested by Austin Clements from the very beginning, and then by Linus. As Austin pointed out, the execing task can be killed by SI_TIMER signal because exec flushes the signal handlers, but doesn't discard the pending signals generated by posix timers. Perhaps not a bug, but people find this surprising. See http://bugzilla.kernel.org/show_bug.cgi?id=10460 Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Austin Clements <amdragon+kernelbugzilla@mit.edu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-26posix timers: sigqueue_free: don't free sigqueue if it is queuedOleg Nesterov
Currently sigqueue_free() removes sigqueue from list, but doesn't cancel the pending signal. This is not consistent, the task should either receive the "full" signal along with siginfo_t, or it shouldn't receive the signal at all. Change sigqueue_free() to clear SIGQUEUE_PREALLOC but leave sigqueue on list if it is queued. This is a user-visible change. If the signal is blocked, it stays queued after sys_timer_delete() until unblocked with the "stale" si_code/si_value, and of course it is still counted wrt RLIMIT_SIGPENDING which also limits the number of posix timers. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Austin Clements <amdragon+kernelbugzilla@mit.edu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24cgroups: remove node_ prefix_from ns subsystemCedric Le Goater
This is a slight change in the namespace cgroup subsystem api. The change is that previously when cgroup_clone() was called (currently only from the unshare path in ns_proxy cgroup, you'd get a new group named "node_$pid" whereas now you'll get a group named after just your pid.) The only users who would notice it are those who are using the ns_proxy cgroup subsystem to auto-create cgroups when namespaces are unshared - something of an experimental feature, which I think really needs more complete container/namespace support in order to be useful. I suspect the only users are Cedric and Serge, or maybe a few others on containers@lists.linux-foundation.org. And in fact it would only be noticed by the users who make the assumption about how the name is generated, rather than getting it from the /proc/<pid>/cgroups file for the process in question. Whether the change is actually needed or not I'm fairly agnostic on, but I guess it is more elegant to just use the pid as the new group name rather than adding a fairly arbitrary "node_" prefix on the front. [menage@google.com: provided changelog] Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Cc: "Paul Menage" <menage@google.com> Cc: "Serge E. Hallyn" <serue@us.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24sys_prctl(): fix return of uninitialized valueShi Weihua
If none of the switch cases match, the PR_SET_PDEATHSIG and PR_SET_DUMPABLE cases of the switch statement will never write to local variable `error'. Signed-off-by: Shi Weihua <shiwh@cn.fujitsu.com> Cc: Andrew G. Morgan <morgan@kernel.org> Acked-by: "Serge E. Hallyn" <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24signals: fix sigqueue_free() vs __exit_signal() raceOleg Nesterov
__exit_signal() does flush_sigqueue(tsk->pending) outside of ->siglock. This can race with another thread doing sigqueue_free(), we can free the same SIGQUEUE_PREALLOC sigqueue twice or corrupt the pending->list. Note that even sys_exit_group() can trigger this race, not only sys_timer_delete(). Move the callsite of flush_sigqueue(tsk->pending) under ->siglock. This patch doesn't touch flush_sigqueue(->shared_pending) below, it is called when there are no other threads which can play with signals, and sigqueue_free() can't be used outside of our thread group. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-23stop_machine: make stop_machine_run more virtualization friendlyChristian Borntraeger
On kvm I have seen some rare hangs in stop_machine when I used more guest cpus than hosts cpus. e.g. 32 guest cpus on 1 host cpu triggered the hang quite often. I could also reproduce the problem on a 4 way z/VM host with a 64 way guest. It turned out that the guest was consuming all available cpus mostly for spinning on scheduler locks like rq->lock. This is expected as the threads are calling yield all the time. The problem is now, that the host scheduling decisings together with the guest scheduling decisions and spinlocks not being fair managed to create an interesting scenario similar to a live lock. (Sometimes the hang resolved itself after some minutes) Changing stop_machine to yield the cpu to the hypervisor when yielding inside the guest fixed the problem for me. While I am not completely happy with this patch, I think it causes no harm and it really improves the situation for me. I used cpu_relax for yielding to the hypervisor, does that work on all architectures? p.s.: If you want to reproduce the problem, cpu hotplug and kprobes use stop_machine_run and both triggered the problem after some retries. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> CC: Ingo Molnar <mingo@elte.hu> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-05-23modules: proper cleanup of kobject without CONFIG_SYSFSDenis V. Lunev
kobject: '<NULL>' (ffffffffa0104050): is not initialized, yet kobject_put() is being called. ------------[ cut here ]------------ WARNING: at /home/den/src/linux-netns26/lib/kobject.c:583 kobject_put+0x53/0x55() Modules linked in: ipv6 nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs ide_cd_mod cdrom button [last unloaded: pktgen] comm: rmmod Tainted: G W 2.6.26-rc3 #585 Call Trace: [<ffffffff802359ab>] warn_on_slowpath+0x58/0x7a [<ffffffff80236aca>] ? printk+0x67/0x69 [<ffffffff80236aca>] ? printk+0x67/0x69 [<ffffffff80324289>] kobject_put+0x53/0x55 [<ffffffff8025e2ee>] free_module+0x87/0xfa [<ffffffff8025fee5>] sys_delete_module+0x178/0x1e1 [<ffffffff804b1e70>] ? lockdep_sys_exit_thunk+0x35/0x67 [<ffffffff804b1dff>] ? trace_hardirqs_on_thunk+0x35/0x3a [<ffffffff8020c0bb>] system_call_after_swapgs+0x7b/0x80 ---[ end trace 8f5aafa7f6406cf8 ]--- mod->mkobj.kobj is not initialized without CONFIG_SYSFS. Do not call kobject_put in this case. Signed-off-by: Denis V. Lunev <den@openvz.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-05-23module loading ELF handling: use SELFMAG instead of numeric constantCyrill Gorcunov
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-05-19Merge branch 'audit.b51' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current * 'audit.b51' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: [PATCH] list_for_each_rcu must die: audit [patch 1/1] audit_send_reply(): fix error-path memory leak [PATCH] open sessionid permissions
2008-05-17[PATCH] list_for_each_rcu must die: auditPaul E. McKenney
All uses of list_for_each_rcu() can be profitably replaced by the easier-to-use list_for_each_entry_rcu(). This patch makes this change for the Audit system, in preparation for removing the list_for_each_rcu() API entirely. This time with well-formed SOB. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-17[patch 1/1] audit_send_reply(): fix error-path memory leakAndrew Morton
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=10663 Reporter: Daniel Marjamki <danielm77@spray.se> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16[PATCH] avoid multiplication overflows and signedness issues for max_fdsAl Viro
Limit sysctl_nr_open - we don't want ->max_fds to exceed MAX_INT and we don't want size calculation for ->fd[] to overflow. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16[PATCH] dup_fd() fixes, part 1Al Viro
Move the sucker to fs/file.c in preparation to the rest Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-14lib: create common ascii hex arrayHarvey Harrison
Add a common hex array in hexdump.c so everyone can use it. Add a common hi/lo helper to avoid the shifting masking that is done to get the upper and lower nibbles of a byte value. Pull the pack_hex_byte helper from kgdb as it is opencoded many places in the tree that will be consolidated. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-14cgroups: fix compile warningMirco Tischler
Return type of cpu_rt_runtime_write() should be int instead of ssize_t. Signed-off-by: Mirco Tischler <mt-ml@gmx.de> Acked-by: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-11Add new 'cond_resched_bkl()' helper functionLinus Torvalds
It acts exactly like a regular 'cond_resched()', but will not get optimized away when CONFIG_PREEMPT is set. Normal kernel code is already preemptable in the presense of CONFIG_PREEMPT, so cond_resched() is optimized away (see commit 02b67cc3ba36bdba351d6c3a00593f4ec550d9d3 "sched: do not do cond_resched() when CONFIG_PREEMPT"). But when wanting to conditionally reschedule while holding a lock, you need to use "cond_sched_lock(lock)", and the new function is the BKL equivalent of that. Also make fs/locks.c use it. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-10BKL: revert back to the old spinlock implementationLinus Torvalds
The generic semaphore rewrite had a huge performance regression on AIM7 (and potentially other BKL-heavy benchmarks) because the generic semaphores had been rewritten to be simple to understand and fair. The latter, in particular, turns a semaphore-based BKL implementation into a mess of scheduling. The attempt to fix the performance regression failed miserably (see the previous commit 00b41ec2611dc98f87f30753ee00a53db648d662 'Revert "semaphore: fix"'), and so for now the simple and sane approach is to instead just go back to the old spinlock-based BKL implementation that never had any issues like this. This patch also has the advantage of being reported to fix the regression completely according to Yanmin Zhang, unlike the semaphore hack which still left a couple percentage point regression. As a spinlock, the BKL obviously has the potential to be a latency issue, but it's not really any different from any other spinlock in that respect. We do want to get rid of the BKL asap, but that has been the plan for several years. These days, the biggest users are in the tty layer (open/release in particular) and Alan holds out some hope: "tty release is probably a few months away from getting cured - I'm afraid it will almost certainly be the very last user of the BKL in tty to get fixed as it depends on everything else being sanely locked." so while we're not there yet, we do have a plan of action. Tested-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Andi Kleen <andi@firstfloor.org> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Alexander Viro <viro@ftp.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-10Revert "semaphore: fix"Linus Torvalds
This reverts commit bf726eab3711cf192405d21688a4b21e07b6188a, as it has been reported to cause a regression with processes stuck in __down(), apparently because some missing wakeup. Quoth Sven Wegener: "I'm currently investigating a regression that has showed up with my last git pull yesterday. Bisecting the commits showed bf726e "semaphore: fix" to be the culprit, reverting it fixed the issue. Symptoms: During heavy filesystem usage (e.g. a kernel compile) I get several compiler processes in uninterruptible sleep, blocking all i/o on the filesystem. System is an Intel Core 2 Quad running a 64bit kernel and userspace. Filesystem is xfs on top of lvm. See below for the output of sysrq-w." See http://lkml.org/lkml/2008/5/10/45 for full report. In the meantime, we can just fix the BKL performance regression by reverting back to the good old BKL spinlock implementation instead, since any sleeping lock will generally perform badly, especially if it tries to be fair. Reported-by: Sven Wegener <sven.wegener@stealer.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-09module: don't ignore vermagic string if module doesn't have modversionsRusty Russell
Linus found a logic bug: we ignore the version number in a module's vermagic string if we have CONFIG_MODVERSIONS set, but modversions also lets through a module with no __versions section for modprobe --force (with tainting, but still). We should only ignore the start of the vermagic string if the module actually *has* crcs to check. Rather than (say) having an entertaining hissy fit and creating a config option to work around the buggy code. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-09module: be more picky about allowing missing module versionsRusty Russell
We allow missing __versions sections, because modprobe --force strips it. It makes less sense to allow sections where there's no version for a specific symbol the module uses, so disallow that. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-08Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes: sched: fix weight calculations semaphore: fix
2008-05-08Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: Revert "relay: fix splice problem" docbook: fix bio missing parameter block: use unitialized_var() in bio_alloc_bioset() block: avoid duplicate calls to get_part() in disk stat code cfq-iosched: make io priorities inherit CPU scheduling class as well as nice block: optimize generic_unplug_device() block: get rid of likely/unlikely predictions in merge logic vfs: splice remove_suid() cleanup cfq-iosched: fix RCU race in the cfq io_context destructor handling block: adjust tagging function queue bit locking block: sysfs store function needs to grab queue_lock and use queue_flag_*()
2008-05-08Fix cpuset sched_relax_domain_level control filePaul Menage
Due to a merge conflict, the sched_relax_domain_level control file was marked as being handled by cpuset_read/write_u64, but the code to handle it was actually in cpuset_common_file_read/write. Since the value being written/read is in fact a signed integer, it should be treated as such; this patch adds cpuset_read/write_s64 functions, and uses them to handle the sched_relax_domain_level file. With this patch, the sched_relax_domain_level can be read and written, and the correct contents seen/updated. Signed-off-by: Paul Menage <menage@google.com> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Paul Jackson <pj@sgi.com> Cc: Ingo Molnar <mingo@elte.hu> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-08sched: fix weight calculationsMike Galbraith
The conversion between virtual and real time is as follows: dvt = rw/w * dt <=> dt = w/rw * dvt Since we want the fair sleeper granularity to be in real time, we actually need to do: dvt = - rw/w * l This bug could be related to the regression reported by Yanmin Zhang: | Comparing with kernel 2.6.25, sysbench+mysql(oltp, readonly) has lots | of regressions with 2.6.26-rc1: | | 1) 8-core stoakley: 28%; | 2) 16-core tigerton: 20%; | 3) Itanium Montvale: 50%. Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>