aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2008-06-26Merge commit 'v2.6.26-rc8' into core/rcuIngo Molnar
2008-06-24kgdb: sparse fixJason Wessel
- Fix warning reported by sparse kernel/kgdb.c:1502:6: warning: symbol 'kgdb_console_write' was not declared. Should it be static? Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2008-06-24rcu: make quiescent rcutorture less power-hungryPaul E. McKenney
This patch aligns the rcutorture wakeup times to align with all other multiple-of-a-second wakeups to further decrease power consumption. Suggested-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: josh@freedesktop.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: dino@in.ibm.com Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Cc: vegard.nossum@gmail.com Cc: adobriyan@gmail.com Cc: oleg@tv-sign.ru Cc: bunk@kernel.org Cc: rjw@sisk.pl Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-24rcu, rcutorture: make quiescent rcutorture less power-hungryPaul E. McKenney
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: josh@freedesktop.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: dino@in.ibm.com Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Cc: vegard.nossum@gmail.com Cc: adobriyan@gmail.com Cc: oleg@tv-sign.ru Cc: bunk@kernel.org Cc: rjw@sisk.pl Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-23Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futexes: fix fault handling in futex_lock_pi
2008-06-23futexes: fix fault handling in futex_lock_piThomas Gleixner
This patch addresses a very sporadic pi-futex related failure in highly threaded java apps on large SMP systems. David Holmes reported that the pi_state consistency check in lookup_pi_state triggered with his test application. This means that the kernel internal pi_state and the user space futex variable are out of sync. First we assumed that this is a user space data corruption, but deeper investigation revieled that the problem happend because the pi-futex code is not handling a fault in the futex_lock_pi path when the user space variable needs to be fixed up. The fault happens when a fork mapped the anon memory which contains the futex readonly for COW or the page got swapped out exactly between the unlock of the futex and the return of either the new futex owner or the task which was the expected owner but failed to acquire the kernel internal rtmutex. The current futex_lock_pi() code drops out with an inconsistent in case it faults and returns -EFAULT to user space. User space has no way to fixup that state. When we wrote this code we thought that we could not drop the hash bucket lock at this point to handle the fault. After analysing the code again it turned out to be wrong because there are only two tasks involved which might modify the pi_state and the user space variable: - the task which acquired the rtmutex - the pending owner of the pi_state which did not get the rtmutex Both tasks drop into the fixup_pi_state() function before returning to user space. The first task which acquired the hash bucket lock faults in the fixup of the user space variable, drops the spinlock and calls futex_handle_fault() to fault in the page. Now the second task could acquire the hash bucket lock and tries to fixup the user space variable as well. It either faults as well or it succeeds because the first task already faulted the page in. One caveat is to avoid a double fixup. After returning from the fault handling we reacquire the hash bucket lock and check whether the pi_state owner has been modified already. Reported-by: David Holmes <david.holmes@sun.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Holmes <david.holmes@sun.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> kernel/futex.c | 93 ++++++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 73 insertions(+), 20 deletions(-)
2008-06-23Merge branch 'linus' into core/rcuIngo Molnar
2008-06-23Merge branch 'linus' into sched/urgentIngo Molnar
2008-06-20Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: softlockup: fix NMI hangs due to lock race - 2.6.26-rc regression rcupreempt: remove export of rcu_batches_completed_bh cpuset: limit the input of cpuset.sched_relax_domain_level
2008-06-20sched: refactor wait_for_completion_timeout()Oleg Nesterov
Simplify the code and fix the boundary condition of wait_for_completion_timeout(,0). We can kill the first __remove_wait_queue() as well. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2008-06-20sched: fix wait_for_completion_timeout() spurious failure under heavy loadRoland Dreier
It seems that the current implementaton of wait_for_completion_timeout() has a small problem under very high load for the common pattern: if (!wait_for_completion_timeout(&done, timeout)) /* handle failure */ because the implementation very roughly does (lots of code deleted to show the basic flow): static inline long __sched do_wait_for_common(struct completion *x, long timeout, int state) { if (x->done) return timeout; do { timeout = schedule_timeout(timeout); if (!timeout) return timeout; } while (!x->done); return timeout; } so if the system is very busy and x->done is not set when do_wait_for_common() is entered, it is possible that the first call to schedule_timeout() returns 0 because the task doing wait_for_completion doesn't get rescheduled for a long time, even if it is woken up early enough. In this case, wait_for_completion_timeout() returns 0 without even checking x->done again, and the code above falls into its failure case purely for scheduler reasons, even if the hardware event or whatever was being waited for happened early enough. It would make sense to add an extra test to do_wait_for() in the timeout case and return 1 if x->done is actually set. A quick audit (not exhaustive) of wait_for_completion_timeout() callers seems to indicate that no one actually cares about the return value in the success case -- they just test for 0 (timed out) versus non-zero (wait succeeded). Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-20sched: rt: dont stop the period timer when there are tasks wanting to runPeter Zijlstra
So if the group ever gets throttled, it will never wake up again. Reported-by: "Daniel K." <dk@uw.no> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Daniel K. <dk@uw.no> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-19sched, delay accounting: fix incorrect delay time when constantly waiting on ↵Bharath Ravi
runqueue This patch corrects the incorrect value of per process run-queue wait time reported by delay statistics. The anomaly was due to the following reason. When a process leaves the CPU and immediately starts waiting for CPU on the runqueue (which means it remains in the TASK_RUNNABLE state), the time of re-entry into the run-queue is never recorded. Due to this, the waiting time on the runqueue from this point of re-entry upto the next time it hits the CPU is not accounted for. This is solved by recording the time of re-entry of a process leaving the CPU in the sched_info_depart() function IF the process will go back to waiting on the run-queue. This IF condition is verified by checking whether the process is still in the TASK_RUNNABLE state. The patch was tested on 2.6.26-rc6 using two simple CPU hog programs. The values noted prior to the fix did not account for the time spent on the runqueue waiting. After the fix, the correct values were reported back to user space. Signed-off-by: Bharath Ravi <bharathravi1@gmail.com> Signed-off-by: Madhava K R <madhavakr@gmail.com> Cc: dhaval@linux.vnet.ibm.com Cc: vatsa@in.ibm.com Cc: balbir@in.ibm.com Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-19rcu: make rcutorture more vicious: reinstate boot-time testingPaul E. McKenney
This patch re-institutes the ability to build rcutorture directly into the Linux kernel. The reason that this capability was removed was that this could result in your kernel being pretty much useless, as rcutorture would be running starting from early boot. This problem has been avoided by (1) making rcutorture run only three seconds of every six by default, (2) adding a CONFIG_RCU_TORTURE_TEST_RUNNABLE that permits rcutorture to be quiesced at boot time, and (3) adding a sysctl in /proc named /proc/sys/kernel/rcutorture_runnable that permits rcutorture to be quiesced and unquiesced when built into the kernel. Please note that this /proc file is -not- available when rcutorture is built as a module. Please also note that to get the earlier take-no-prisoners behavior, you must use the boot command line to set rcutorture's "stutter" parameter to zero. The rcutorture quiescing mechanism is currently quite crude: loops in each rcutorture process that poll a global variable once per tick. Suggestions for improvement are welcome. The default action will be to reduce the polling rate to a few times per second. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Suggested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-19softlockup: fix NMI hangs due to lock race - 2.6.26-rc regressionJason Wessel
The touch_nmi_watchdog() routine on x86 ultimately calls touch_softlockup_watchdog(). The problem is that to touch the softlockup watchdog, the cpu_clock code has to be called which could involve multiple cpu locks and can lead to a hard hang if one of the locks is held by a processor that is not going to return anytime soon (such as could be the case with kgdb or perhaps even with some other kind of exception). This patch causes the public version of the touch_softlockup_watchdog() to defer the cpu clock access to a later point. The test case for this problem is to use the following kernel config options: CONFIG_KGDB_TESTS=y CONFIG_KGDB_TESTS_ON_BOOT=y CONFIG_KGDB_TESTS_BOOT_STRING="V1F100I100000" It should be noted that kgdb test suite and these options were not available until 2.6.26-rc2, so it was necessary to patch the kgdb test suite during the bisection. I would consider this patch a regression fix because the problem first appeared in commit 27ec4407790d075c325e1f4da0a19c56953cce23 when some logic was added to try to periodically sync the clocks. It was possible to work around this particular problem by simply not performing the sync anytime the system was in a critical context. This was ok until commit 3e51f33fcc7f55e6df25d15b55ed10c8b4da84cd, which added config option CONFIG_HAVE_UNSTABLE_SCHED_CLOCK and some multi-cpu locks to sync the clocks. It became clear that accessing this code from an nmi was the source of the lockups. Avoiding the access to the low level clock code from an code inside the NMI processing also fixed the problem with the 27ec44... commit. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-19rcupreempt: remove export of rcu_batches_completed_bhSteven Rostedt
In rcupreempt, rcu_batches_completed_bh is defined as a static inline in the header file. This does not need to be exported, and not only that, this breaks my PPC build. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: paulus@samba.org Cc: linuxppc-dev@ozlabs.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-19cpuset: limit the input of cpuset.sched_relax_domain_levelLi Zefan
We allow the inputs to be [-1 ... SD_LV_MAX), and return -EINVAL for inputs outside this range. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Acked-by: Paul Jackson <pj@sgi.com> Acked-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-19sched: CPU hotplug events must not destroy scheduler domains created by the ↵Max Krasnyansky
cpusets First issue is not related to the cpusets. We're simply leaking doms_cur. It's allocated in arch_init_sched_domains() which is called for every hotplug event. So we just keep reallocation doms_cur without freeing it. I introduced free_sched_domains() function that cleans things up. Second issue is that sched domains created by the cpusets are completely destroyed by the CPU hotplug events. For all CPU hotplug events scheduler attaches all CPUs to the NULL domain and then puts them all into the single domain thereby destroying domains created by the cpusets (partition_sched_domains). The solution is simple, when cpusets are enabled scheduler should not create default domain and instead let cpusets do that. Which is exactly what the patch does. Signed-off-by: Max Krasnyansky <maxk@qualcomm.com> Cc: pj@sgi.com Cc: menage@google.com Cc: rostedt@goodmis.org Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-19sched: rt-group: fix RR bugletPeter Zijlstra
In tick_task_rt() we first call update_curr_rt() which can dequeue a runqueue due to it running out of runtime, and then we try to requeue it, of it also having exhausted its RR quota. Obviously requeueing something that is no longer on the runqueue will not have the expected result. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Daniel K. <dk@uw.no> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-19sched: rt-group: heirarchy aware throttlePeter Zijlstra
The bandwidth throttle code dequeues a group when it runs out of quota, and re-queues it once the period rolls over and the quota gets refreshed. Sadly it failed to take the hierarchy into consideration. Share more of the enqueue/dequeue code with regular task opterations. Also, some operations like sched_setscheduler() can dequeue/enqueue tasks that are in throttled runqueues, we should not inadvertly re-enqueue empty runqueues so check for that. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Daniel K. <dk@uw.no> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-19sched: rt-group: fix hierarchyPeter Zijlstra
Don't re-set the entity's runqueue to the wrong rq after we've set it to the right one. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Daniel K. <dk@uw.no> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-19sched: NULL pointer dereference while setting sched_rt_period_usDario Faggioli
When CONFIG_RT_GROUP_SCHED and CONFIG_CGROUP_SCHED are enabled, with: echo 10000 > /proc/sys/kernel/sched_rt_period_us We get this: BUG: unable to handle kernel NULL pointer dereference at 0000008c [ 947.682233] IP: [<c0216b72>] __rt_schedulable+0x12/0x160 [ 947.683123] *pde = 00000000=20 [ 947.683782] Oops: 0000 [#1] [ 947.684307] Modules linked in: [ 947.684308] [ 947.684308] Pid: 2359, comm: bash Not tainted (2.6.26-rc6 #8) [ 947.684308] EIP: 0060:[<c0216b72>] EFLAGS: 00000246 CPU: 0 [ 947.684308] EIP is at __rt_schedulable+0x12/0x160 [ 947.684308] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000001 [ 947.684308] ESI: c0521db4 EDI: 00000001 EBP: c6cc9f00 ESP: c6cc9ed0 [ 947.684308] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 [ 947.684308] Process bash (pid: 2359, tiÆcc8000 taskÇa54f00=20 task.tiÆcc8000) [ 947.684308] Stack: c0222790 00000000 080f8c08 c0521db4 c6cc9f00 00000001 00000000 00000000 [ 947.684308] c6cc9f9c 00000000 c0521db4 00000001 c6cc9f28 c0216d40 00000000 00000000 [ 947.684308] c6cc9f9c 000f4240 000e7ef0 ffffffff c0521db4 c79dfb60 c6cc9f58 c02af2cc [ 947.684308] Call Trace: [ 947.684308] [<c0222790>] ? do_proc_dointvec_conv+0x0/0x50 [ 947.684308] [<c0216d40>] ? sched_rt_handler+0x80/0x110 [ 947.684308] [<c02af2cc>] ? proc_sys_call_handler+0x9c/0xb0 [ 947.684308] [<c02af2fa>] ? proc_sys_write+0x1a/0x20 [ 947.684308] [<c0273c36>] ? vfs_write+0x96/0x160 [ 947.684308] [<c02af2e0>] ? proc_sys_write+0x0/0x20 [ 947.684308] [<c027423d>] ? sys_write+0x3d/0x70 [ 947.684308] [<c0202ef5>] ? sysenter_past_esp+0x6a/0x91 [ 947.684308] ======================= [ 947.684308] Code: 24 04 e8 62 b1 0e 00 89 c7 89 f8 8b 5d f4 8b 75 f8 8b 7d fc 89 ec 5d c3 90 55 89 e5 57 56 53 83 ec 24 89 45 ec 89 55 e4 89 4d e8 <8b> b8 8c 00 00 00 85 ff 0f 84 c9 00 00 00 8b 57 24 39 55 e8 8b [ 947.684308] EIP: [<c0216b72>] __rt_schedulable+0x12/0x160 SS:ESP 0068:c6cc9ed0 We think the following patch solves the issue. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Michael Trimarchi <trimarchimichael@yahoo.it> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-18rcu: make rcutorture more vicious: add stutter featurePaul E. McKenney
This patch takes a step towards making rcutorture more brutal by allowing the test to be automatically periodically paused, with the default being to run the test for five seconds then pause for five seconds and repeat. This behavior can be controlled using a new "stutter" module parameter, so that "stutter=0" gives the old default behavior of running continuously. Starting and stopping rcutorture more heavily stresses RCU's interaction with the scheduler, as well as exercising more paths through the grace-period detection code. Note that the default to "shuffle_interval" has also been adjusted from 5 seconds to 3 seconds to provide varying overlap with the "stutter" interval. I am still unable to provoke the failures that Alexey has been seeing, even with this patch, but will be doing a few additional things to beef up rcutorture. Suggested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-18rcutorture: WARN_ON_ONCE(1) when detecting an errorIngo Molnar
this makes it easier for automated tests to pick up such failures.
2008-06-17sched: fix defined-but-unused warningRabin Vincent
Fix this warning, which appears with !CONFIG_SMP: kernel/sched.c:1216: warning: `init_hrtick' defined but not used Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-16Merge branch 'linus' into core/rcuIngo Molnar
2008-06-16rcu: remove unused field struct rcu_data::rcu_taskletLai Jiangshan
Since softirq works for rcu reclaimer, rcu_tasklet is unused now. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-12kprobes: fix error checking of batch registrationMasami Hiramatsu
Fix error checking routine to catch an error which occurs in first __register_*probe(). Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: David Miller <davem@davemloft.net> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-12Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: 64-bit: fix arithmetics overflow sched: fair group: fix overflow(was: fix divide by zero) sched: fix TASK_WAKEKILL vs SIGKILL race
2008-06-12sched: 64-bit: fix arithmetics overflowLai Jiangshan
(overflow means weight >= 2^32 here, because inv_weigh = 2^32/weight) A weight of a cfs_rq is the sum of weights of which entities are queued on this cfs_rq, so it will overflow when there are too many entities. Although, overflow occurs very rarely, but it break fairness when it occurs. 64-bits systems have more memory than 32-bit systems and 64-bit systems can create more process usually, so overflow may occur more frequently. This patch guarantees fairness when overflow happens on 64-bit systems. Thanks to the optimization of compiler, it changes nothing on 32-bit. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-12sched: fair group: fix overflow(was: fix divide by zero)Lai Jiangshan
I found a bug which can be reproduced by this way:(linux-2.6.26-rc5, x86-64) (use 2^32, 2^33, ...., 2^63 as shares value) # mkdir /dev/cpuctl # mount -t cgroup -o cpu cpuctl /dev/cpuctl # cd /dev/cpuctl # mkdir sub # echo 0x8000000000000000 > sub/cpu.shares # echo $$ > sub/tasks oops here! divide by zero. This is because do_div() expects the 2th parameter to be 32 bits, but unsigned long is 64 bits in x86_64. Peter Zijstra pointed it out that the sane thing to do is limit the shares value to something smaller instead of using an even more expensive divide. Also, I found another bug about "the shares value is too large": pid1 and pid2 are set affinity to cpu#0 pid1 is attached to cg1 and pid2 is attached to cg2 if cg1/cpu.shares = 1024 cg2/cpu.shares = 2000000000 then pid2 got 100% usage of cpu, and pid1 0% if cg1/cpu.shares = 1024 cg2/cpu.shares = 20000000000 then pid2 got 0% usage of cpu, and pid1 100% And a weight of a cfs_rq is the sum of weights of which entities are queued on this cfs_rq, so the shares value should be limited to a smaller value. I think that (1UL << 18) is a good limited value: 1) it's not too large, we can create a lot of group before overflow 2) it's several times the weight value for nice=-19 (not too small) Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-10sched: fix TASK_WAKEKILL vs SIGKILL raceOleg Nesterov
schedule() has the special "TASK_INTERRUPTIBLE && signal_pending()" case, this allows us to do current->state = TASK_INTERRUPTIBLE; schedule(); without fear to sleep with pending signal. However, the code like current->state = TASK_KILLABLE; schedule(); is not right, schedule() doesn't take TASK_WAKEKILL into account. This means that mutex_lock_killable(), wait_for_completion_killable(), down_killable(), schedule_timeout_killable() can miss SIGKILL (and btw the second SIGKILL has no effect). Introduce the new helper, signal_pending_state(), and change schedule() to use it. Hopefully it will have more users, that is why the task's state is passed separately. Note this "__TASK_STOPPED | __TASK_TRACED" check in signal_pending_state(). This is needed to preserve the current behaviour (ptrace_notify). I hope this check will be removed soon, but this (afaics good) change needs the separate discussion. The fast path is "(state & (INTERRUPTIBLE | WAKEKILL)) + signal_pending(p)", basically the same that schedule() does now. However, this patch of course bloats schedule(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-06Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/chrisw/lsm-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/chrisw/lsm-2.6: capabilities: remain source compatible with 32-bit raw legacy capability support. LSM: remove stale web site from MAINTAINERS
2008-06-06cpusets: fix bug when adding nonexistent cpu or memLai Jiangshan
Adding a nonexistent cpu to a cpuset will be omitted quietly. It should return -EINVAL. Example: (real_nr_cpus <= 4 < NR_CPUS or cpu#4 was just offline) # cat cpus 0-1 # /bin/echo 4 > cpus # /bin/echo $? 0 # cat cpus # The same occurs when add a nonexistent mem. This patch will fix this bug. And when *buf == "", the check is unneeded. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Paul Jackson <pj@sgi.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-04Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb: kgdbts: Use HW breakpoints with CONFIG_DEBUG_RODATA kgdb: use common ascii helpers and put_unaligned_be32 helper
2008-05-31capabilities: remain source compatible with 32-bit raw legacy capability ↵Andrew G. Morgan
support. Source code out there hard-codes a notion of what the _LINUX_CAPABILITY_VERSION #define means in terms of the semantics of the raw capability system calls capget() and capset(). Its unfortunate, but true. Since the confusing header file has been in a released kernel, there is software that is erroneously using 64-bit capabilities with the semantics of 32-bit compatibilities. These recently compiled programs may suffer corruption of their memory when sys_getcap() overwrites more memory than they are coded to expect, and the raising of added capabilities when using sys_capset(). As such, this patch does a number of things to clean up the situation for all. It 1. forces the _LINUX_CAPABILITY_VERSION define to always retain its legacy value. 2. adopts a new #define strategy for the kernel's internal implementation of the preferred magic. 3. deprecates v2 capability magic in favor of a new (v3) magic number. The functionality of v3 is entirely equivalent to v2, the only difference being that the v2 magic causes the kernel to log a "deprecated" warning so the admin can find applications that may be using v2 inappropriately. [User space code continues to be encouraged to use the libcap API which protects the application from details like this. libcap-2.10 is the first to support v3 capabilities.] Fixes issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=447518. Thanks to Bojan Smojver for the report. [akpm@linux-foundation.org: s/depreciate/deprecate/g] [akpm@linux-foundation.org: be robust about put_user size] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Bojan Smojver <bojan@rexursive.com> Cc: stable@kernel.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
2008-05-29Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: re-tune NUMA topologies sched: stop wake_affine from causing serious imbalance sched: fix sched_clock_cpu() revert ("sched: fair-group: SMP-nice for group scheduling") sched: cleanup show_schedstat(): fix memleak sched: unite unlikely pairs in rt_policy() and schedule_debug() revert ("sched: fair: weight calculations")
2008-05-29Merge commit 'linus/master' into sched-fixes-for-linusIngo Molnar
2008-05-29sched: stop wake_affine from causing serious imbalanceMike Galbraith
Prevent short-running wakers of short-running threads from overloading a single cpu via wakeup affinity, and wire up disconnected debug option. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29sched: fix sched_clock_cpu()Peter Zijlstra
Make sched_clock_cpu() return 0 before it has been initialized and avoid corrupting its state due to doing so. This fixes the weird printk timestamp jump reported. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2008-05-29revert ("sched: fair-group: SMP-nice for group scheduling")Ingo Molnar
Yanmin Zhang reported: Comparing with 2.6.25, volanoMark has big regression with kernel 2.6.26-rc1. It's about 50% on my 8-core stoakley, 16-core tigerton, and Itanium Montecito. With bisect, I located the following patch: | 18d95a2832c1392a2d63227a7a6d433cb9f2037e is first bad commit | commit 18d95a2832c1392a2d63227a7a6d433cb9f2037e | Author: Peter Zijlstra <a.p.zijlstra@chello.nl> | Date: Sat Apr 19 19:45:00 2008 +0200 | | sched: fair-group: SMP-nice for group scheduling Revert it so that we get v2.6.25 behavior. Bisected-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29sched: cleanupIngo Molnar
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29show_schedstat(): fix memleakAdrian Bunk
The Coverity checker spotted a memleak introduced by commit 39106dcf85285e78f3b290022122c76f851379b8 (cpumask: use new cpus_scnprintf function). It seems the kfree() got lost between v2 and v3 of this patch... Signed-off-by: Adrian Bunk <bunk@kernel.org> Cc: Mike Travis <travis@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29sched: unite unlikely pairs in rt_policy() and schedule_debug()Roel Kluin
Removes obfuscation and may improve assembly. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29revert ("sched: fair: weight calculations")Ingo Molnar
Yanmin Zhang reported: Comparing with kernel 2.6.25, sysbench+mysql(oltp, readonly) has many regressions with 2.6.26-rc1: 1) 8-core stoakley: 28%; 2) 16-core tigerton: 20%; 3) Itanium Montvale: 50%. Bisect located this patch: | 8f1bc385cfbab474db6c27b5af1e439614f3025c is first bad commit | commit 8f1bc385cfbab474db6c27b5af1e439614f3025c | Author: Peter Zijlstra <a.p.zijlstra@chello.nl> | Date: Sat Apr 19 19:45:00 2008 +0200 | | sched: fair: weight calculations Revert it to the 2.6.25 state. Bisected-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-28kgdb: use common ascii helpers and put_unaligned_be32 helperHarvey Harrison
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2008-05-28splice: fix sendfile() issue with relayTom Zanussi
Splice isn't always incrementing the ppos correctly, which broke relay splice. Signed-off-by: Tom Zanussi <zanussi@comcast.net> Tested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-26posix timers: discard SI_TIMER signals on execOleg Nesterov
Based on Roland's patch. This approach was suggested by Austin Clements from the very beginning, and then by Linus. As Austin pointed out, the execing task can be killed by SI_TIMER signal because exec flushes the signal handlers, but doesn't discard the pending signals generated by posix timers. Perhaps not a bug, but people find this surprising. See http://bugzilla.kernel.org/show_bug.cgi?id=10460 Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Austin Clements <amdragon+kernelbugzilla@mit.edu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-26posix timers: sigqueue_free: don't free sigqueue if it is queuedOleg Nesterov
Currently sigqueue_free() removes sigqueue from list, but doesn't cancel the pending signal. This is not consistent, the task should either receive the "full" signal along with siginfo_t, or it shouldn't receive the signal at all. Change sigqueue_free() to clear SIGQUEUE_PREALLOC but leave sigqueue on list if it is queued. This is a user-visible change. If the signal is blocked, it stays queued after sys_timer_delete() until unblocked with the "stale" si_code/si_value, and of course it is still counted wrt RLIMIT_SIGPENDING which also limits the number of posix timers. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Austin Clements <amdragon+kernelbugzilla@mit.edu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24cgroups: remove node_ prefix_from ns subsystemCedric Le Goater
This is a slight change in the namespace cgroup subsystem api. The change is that previously when cgroup_clone() was called (currently only from the unshare path in ns_proxy cgroup, you'd get a new group named "node_$pid" whereas now you'll get a group named after just your pid.) The only users who would notice it are those who are using the ns_proxy cgroup subsystem to auto-create cgroups when namespaces are unshared - something of an experimental feature, which I think really needs more complete container/namespace support in order to be useful. I suspect the only users are Cedric and Serge, or maybe a few others on containers@lists.linux-foundation.org. And in fact it would only be noticed by the users who make the assumption about how the name is generated, rather than getting it from the /proc/<pid>/cgroups file for the process in question. Whether the change is actually needed or not I'm fairly agnostic on, but I guess it is more elegant to just use the pid as the new group name rather than adding a fairly arbitrary "node_" prefix on the front. [menage@google.com: provided changelog] Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Cc: "Paul Menage" <menage@google.com> Cc: "Serge E. Hallyn" <serue@us.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>