aboutsummaryrefslogtreecommitdiff
path: root/kernel/sched.c
AgeCommit message (Collapse)Author
2008-06-27sched: correct wakeup weight calculationsPeter Zijlstra
rw_i = {2, 4, 1, 0} s_i = {2/7, 4/7, 1/7, 0} wakeup on cpu0, weight=1 rw'_i = {3, 4, 1, 0} s'_i = {3/8, 4/8, 1/8, 0} s_0 = S * rw_0 / \Sum rw_j -> \Sum rw_j = S*rw_0/s_0 = 1*2*7/2 = 7 (correct) s'_0 = S * (rw_0 + 1) / (\Sum rw_j + 1) = 1 * (2+1) / (7+1) = 3/8 (correct so we find that adding 1 to cpu0 gains 5/56 in weight if say the other cpu were, cpu1, we'd also have to calculate its 4/56 loss Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: update shares on wakeupPeter Zijlstra
We found that the affine wakeup code needs rather accurate load figures to be effective. The trouble is that updating the load figures is fairly expensive with group scheduling. Therefore ratelimit the updating. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: fix shares boost logicPeter Zijlstra
In case the domain is empty, pretend there is a single task on each cpu, so that together with the boost logic we end up giving 1/n shares to each cpu. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: disable source/target_load biasPeter Zijlstra
The bias given by source/target_load functions can be very large, disable it by default to get faster convergence. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: remove prio preference from balance decisionsPeter Zijlstra
Priority looses much of its meaning in a hierarchical context. So don't use it in balance decisions. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: hierarchical load vs find_busiest_groupPeter Zijlstra
find_busiest_group() has some assumptions about task weight being in the NICE_0_LOAD range. Hierarchical task groups break this assumption - fix this by replacing it with the average task weight, which will adapt the situation. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: persistent average load per taskPeter Zijlstra
Remove the fall-back to SCHED_LOAD_SCALE by remembering the previous value of cpu_avg_load_per_task() - this is useful because of the hierarchical group model in which task weight can be much smaller. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: fix sched_balance_self() smp group balancingPeter Zijlstra
Finding the least idle cpu is more accurate when done with updated shares. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: fix newidle smp group balancingPeter Zijlstra
Re-compute the shares on newidle - so we can make a decision based on recent data. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: simplify the group load balancerPeter Zijlstra
While thinking about the previous patch - I realized that using per domain aggregate load values in load_balance_fair() is wrong. We should use the load value for that CPU. By not needing per domain hierarchical load values we don't need to store per domain aggregate shares, which greatly simplifies all the math. It basically falls apart in two separate computations: - per domain update of the shares - per CPU update of the hierarchical load Also get rid of the move_group_shares() stuff - just re-compute the shares again after a successful load balance. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: no need to aggregate task_weightPeter Zijlstra
We only need to know the task_weight of the busiest rq - nothing to do if there are no tasks there. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: dont micro manage share lossesPeter Zijlstra
We used to try and contain the loss of 'shares' by playing arithmetic games. Replace that by noticing that at the top sched_domain we'll always have the full weight in shares to distribute. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: update aggregate when holding the RQsPeter Zijlstra
It was observed that in __update_group_shares_cpu() rq_weight > aggregate()->rq_weight This is caused by forks/wakeups in between the initial aggregate pass and locking of the RQs for load balance. To avoid this situation partially re-do the aggregation once we have the RQs locked (which avoids new tasks from appearing). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: fix sched_domain aggregationPeter Zijlstra
Keeping the aggregate on the first cpu of the sched domain has two problems: - it could collide between different sched domains on different cpus - it could slow things down because of the remote accesses Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: fix wakeup granularity and buddy granularityPeter Zijlstra
Uncouple buddy selection from wakeup granularity. The initial idea was that buddies could run ahead as far as a normal task can - do this by measuring a pair 'slice' just as we do for a normal task. This means we can drop the wakeup_granularity back to 5ms. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: sched_clock_cpu() based cpu_clock()Peter Zijlstra
with sched_clock_cpu() being reasonably in sync between cpus (max 1 jiffy difference) use this to provide cpu_clock(). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: revert revert of: fair-group: SMP-nice for group schedulingPeter Zijlstra
Try again.. Initial commit: 18d95a2832c1392a2d63227a7a6d433cb9f2037e Revert: 6363ca57c76b7b83639ca8c83fc285fa26a7880e Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: revert the revert of: weight calculationsPeter Zijlstra
Try again.. initial commit: 8f1bc385cfbab474db6c27b5af1e439614f3025c revert: f9305d4a0968201b2818dbed0dc8cb0d4ee7aeb3 Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-25Merge branch 'linus' into tracing/ftraceIngo Molnar
2008-06-25Merge branch 'linus' into sched/new-API-sched_setschedulerIngo Molnar
2008-06-25Merge branch 'linus' into sched/develIngo Molnar
Conflicts: kernel/sched_rt.c Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-25Merge branch 'linus' into core/softirqIngo Molnar
2008-06-23sched: add new API sched_setscheduler_nocheck: add a flag to control access ↵Rusty Russell
checks Hidehiro Kawai noticed that sched_setscheduler() can fail in stop_machine: it calls sched_setscheduler() from insmod, which can have CAP_SYS_MODULE without CAP_SYS_NICE. Two cases could have failed, so are changed to sched_setscheduler_nocheck: kernel/softirq.c:cpu_callback() - CPU hotplug callback kernel/stop_machine.c:__stop_machine_run() - Called from various places, including modprobe() Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Cc: sugita <yumiko.sugita.yf@hitachi.com> Cc: Satoshi OSHIMA <satoshi.oshima.fk@hitachi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-23Merge branch 'linus' into sched/develIngo Molnar
2008-06-23Merge branch 'linus' into tracing/ftraceIngo Molnar
2008-06-23Merge branch 'linus' into sched/urgentIngo Molnar
2008-06-23Merge branch 'linus' into core/softirqIngo Molnar
2008-06-20Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: softlockup: fix NMI hangs due to lock race - 2.6.26-rc regression rcupreempt: remove export of rcu_batches_completed_bh cpuset: limit the input of cpuset.sched_relax_domain_level
2008-06-20sched: refactor wait_for_completion_timeout()Oleg Nesterov
Simplify the code and fix the boundary condition of wait_for_completion_timeout(,0). We can kill the first __remove_wait_queue() as well. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2008-06-20sched: fix wait_for_completion_timeout() spurious failure under heavy loadRoland Dreier
It seems that the current implementaton of wait_for_completion_timeout() has a small problem under very high load for the common pattern: if (!wait_for_completion_timeout(&done, timeout)) /* handle failure */ because the implementation very roughly does (lots of code deleted to show the basic flow): static inline long __sched do_wait_for_common(struct completion *x, long timeout, int state) { if (x->done) return timeout; do { timeout = schedule_timeout(timeout); if (!timeout) return timeout; } while (!x->done); return timeout; } so if the system is very busy and x->done is not set when do_wait_for_common() is entered, it is possible that the first call to schedule_timeout() returns 0 because the task doing wait_for_completion doesn't get rescheduled for a long time, even if it is woken up early enough. In this case, wait_for_completion_timeout() returns 0 without even checking x->done again, and the code above falls into its failure case purely for scheduler reasons, even if the hardware event or whatever was being waited for happened early enough. It would make sense to add an extra test to do_wait_for() in the timeout case and return 1 if x->done is actually set. A quick audit (not exhaustive) of wait_for_completion_timeout() callers seems to indicate that no one actually cares about the return value in the success case -- they just test for 0 (timed out) versus non-zero (wait succeeded). Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-20sched: rt: fix the bandwidth contraint computationsPeter Zijlstra
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Daniel K." <dk@uw.no> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-19cpuset: limit the input of cpuset.sched_relax_domain_levelLi Zefan
We allow the inputs to be [-1 ... SD_LV_MAX), and return -EINVAL for inputs outside this range. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Acked-by: Paul Jackson <pj@sgi.com> Acked-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-19sched: CPU hotplug events must not destroy scheduler domains created by the ↵Max Krasnyansky
cpusets First issue is not related to the cpusets. We're simply leaking doms_cur. It's allocated in arch_init_sched_domains() which is called for every hotplug event. So we just keep reallocation doms_cur without freeing it. I introduced free_sched_domains() function that cleans things up. Second issue is that sched domains created by the cpusets are completely destroyed by the CPU hotplug events. For all CPU hotplug events scheduler attaches all CPUs to the NULL domain and then puts them all into the single domain thereby destroying domains created by the cpusets (partition_sched_domains). The solution is simple, when cpusets are enabled scheduler should not create default domain and instead let cpusets do that. Which is exactly what the patch does. Signed-off-by: Max Krasnyansky <maxk@qualcomm.com> Cc: pj@sgi.com Cc: menage@google.com Cc: rostedt@goodmis.org Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-19Merge branch 'sched' into sched-develIngo Molnar
Conflicts: kernel/sched_rt.c Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-19sched: rt-group: fix hierarchyPeter Zijlstra
Don't re-set the entity's runqueue to the wrong rq after we've set it to the right one. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Daniel K. <dk@uw.no> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-19sched: NULL pointer dereference while setting sched_rt_period_usDario Faggioli
When CONFIG_RT_GROUP_SCHED and CONFIG_CGROUP_SCHED are enabled, with: echo 10000 > /proc/sys/kernel/sched_rt_period_us We get this: BUG: unable to handle kernel NULL pointer dereference at 0000008c [ 947.682233] IP: [<c0216b72>] __rt_schedulable+0x12/0x160 [ 947.683123] *pde = 00000000=20 [ 947.683782] Oops: 0000 [#1] [ 947.684307] Modules linked in: [ 947.684308] [ 947.684308] Pid: 2359, comm: bash Not tainted (2.6.26-rc6 #8) [ 947.684308] EIP: 0060:[<c0216b72>] EFLAGS: 00000246 CPU: 0 [ 947.684308] EIP is at __rt_schedulable+0x12/0x160 [ 947.684308] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000001 [ 947.684308] ESI: c0521db4 EDI: 00000001 EBP: c6cc9f00 ESP: c6cc9ed0 [ 947.684308] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 [ 947.684308] Process bash (pid: 2359, tiÆcc8000 taskÇa54f00=20 task.tiÆcc8000) [ 947.684308] Stack: c0222790 00000000 080f8c08 c0521db4 c6cc9f00 00000001 00000000 00000000 [ 947.684308] c6cc9f9c 00000000 c0521db4 00000001 c6cc9f28 c0216d40 00000000 00000000 [ 947.684308] c6cc9f9c 000f4240 000e7ef0 ffffffff c0521db4 c79dfb60 c6cc9f58 c02af2cc [ 947.684308] Call Trace: [ 947.684308] [<c0222790>] ? do_proc_dointvec_conv+0x0/0x50 [ 947.684308] [<c0216d40>] ? sched_rt_handler+0x80/0x110 [ 947.684308] [<c02af2cc>] ? proc_sys_call_handler+0x9c/0xb0 [ 947.684308] [<c02af2fa>] ? proc_sys_write+0x1a/0x20 [ 947.684308] [<c0273c36>] ? vfs_write+0x96/0x160 [ 947.684308] [<c02af2e0>] ? proc_sys_write+0x0/0x20 [ 947.684308] [<c027423d>] ? sys_write+0x3d/0x70 [ 947.684308] [<c0202ef5>] ? sysenter_past_esp+0x6a/0x91 [ 947.684308] ======================= [ 947.684308] Code: 24 04 e8 62 b1 0e 00 89 c7 89 f8 8b 5d f4 8b 75 f8 8b 7d fc 89 ec 5d c3 90 55 89 e5 57 56 53 83 ec 24 89 45 ec 89 55 e4 89 4d e8 <8b> b8 8c 00 00 00 85 ff 0f 84 c9 00 00 00 8b 57 24 39 55 e8 8b [ 947.684308] EIP: [<c0216b72>] __rt_schedulable+0x12/0x160 SS:ESP 0068:c6cc9ed0 We think the following patch solves the issue. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Michael Trimarchi <trimarchimichael@yahoo.it> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-18sched: rework of "prioritize non-migratable tasks over migratable ones"Dmitry Adamushko
regarding this commit: 45c01e824991b2dd0a332e19efc4901acb31209f I think we can do it simpler. Please take a look at the patch below. Instead of having 2 separate arrays (which is + ~800 bytes on x86_32 and twice so on x86_64), let's add "exclusive" (the ones that are bound to this CPU) tasks to the head of the queue and "shared" ones -- to the end. In case of a few newly woken up "exclusive" tasks, they are 'stacked' (not queued as now), meaning that a task {i+1} is being placed in front of the previously woken up task {i}. But I don't think that this behavior may cause any realistic problems. There are a couple of changes on top of this one. (1) in check_preempt_curr_rt() I don't think there is a need for the "pick_next_rt_entity(rq, &rq->rt) != &rq->curr->rt" check. enqueue_task_rt(p) and check_preempt_curr_rt() are always called one after another with rq->lock being held so the following check "p->rt.nr_cpus_allowed == 1 && rq->curr->rt.nr_cpus_allowed != 1" should be enough (well, just its left part) to guarantee that 'p' has been queued in front of the 'curr'. (2) in set_cpus_allowed_rt() I don't thinks there is a need for requeue_task_rt() here. Perhaps, the only case when 'requeue' (+ reschedule) might be useful is as follows: i) weight == 1 && cpu_isset(task_cpu(p), *new_mask) i.e. a task is being bound to this CPU); ii) 'p' != rq->curr but here, 'p' has already been on this CPU for a while and was not migrated. i.e. it's possible that 'rq->curr' would not have high chances to be migrated right at this particular moment (although, has chance in a bit longer term), should we allow it to be preempted. Anyway, I think we should not perhaps make it more complex trying to address some rare corner cases. For instance, that's why a single queue approach would be preferable. Unless I'm missing something obvious, this approach gives us similar functionality at lower cost. Verified only compilation-wise. (Almost)-Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-17sched: fix defined-but-unused warningRabin Vincent
Fix this warning, which appears with !CONFIG_SMP: kernel/sched.c:1216: warning: `init_hrtick' defined but not used Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-16Merge branch 'linus' into core/softirqIngo Molnar
2008-06-16Merge branch 'linus' into tracing/ftraceIngo Molnar
2008-06-16Merge branch 'linus' into sched-develIngo Molnar
2008-06-12sched: 64-bit: fix arithmetics overflowLai Jiangshan
(overflow means weight >= 2^32 here, because inv_weigh = 2^32/weight) A weight of a cfs_rq is the sum of weights of which entities are queued on this cfs_rq, so it will overflow when there are too many entities. Although, overflow occurs very rarely, but it break fairness when it occurs. 64-bits systems have more memory than 32-bit systems and 64-bit systems can create more process usually, so overflow may occur more frequently. This patch guarantees fairness when overflow happens on 64-bit systems. Thanks to the optimization of compiler, it changes nothing on 32-bit. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-12sched: fair group: fix overflow(was: fix divide by zero)Lai Jiangshan
I found a bug which can be reproduced by this way:(linux-2.6.26-rc5, x86-64) (use 2^32, 2^33, ...., 2^63 as shares value) # mkdir /dev/cpuctl # mount -t cgroup -o cpu cpuctl /dev/cpuctl # cd /dev/cpuctl # mkdir sub # echo 0x8000000000000000 > sub/cpu.shares # echo $$ > sub/tasks oops here! divide by zero. This is because do_div() expects the 2th parameter to be 32 bits, but unsigned long is 64 bits in x86_64. Peter Zijstra pointed it out that the sane thing to do is limit the shares value to something smaller instead of using an even more expensive divide. Also, I found another bug about "the shares value is too large": pid1 and pid2 are set affinity to cpu#0 pid1 is attached to cg1 and pid2 is attached to cg2 if cg1/cpu.shares = 1024 cg2/cpu.shares = 2000000000 then pid2 got 100% usage of cpu, and pid1 0% if cg1/cpu.shares = 1024 cg2/cpu.shares = 20000000000 then pid2 got 0% usage of cpu, and pid1 100% And a weight of a cfs_rq is the sum of weights of which entities are queued on this cfs_rq, so the shares value should be limited to a smaller value. I think that (1UL << 18) is a good limited value: 1) it's not too large, we can create a lot of group before overflow 2) it's several times the weight value for nice=-19 (not too small) Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-10sched: sched_clock() lockdep fixIngo Molnar
Sitsofe Wheeler bisected the following commit to cause a lockdep to warn about itself and turn itself off: > commit c6531cce6e6e4b99bcda46b6268d6f2d9e30aea4 > Author: Ingo Molnar <mingo@elte.hu> > Date: Mon May 12 21:21:14 2008 +0200 > > sched: do not trace sched_clock do not use raw irq flags in cpu_clock() as it causes lockdep to lose track of the true state of the IRQ flag. Reported-and-bisected-by: Sitsofe Wheeler <sitsofe@yahoo.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2008-06-10sched: kill off dead cfs_rq_set_shares()Paul Mundt
Building with CONFIG_FAIR_GROUP_SCHED=y on UP results in an unused cfs_rq_set_shares() reference. As nothing is using this dummy function in the first place, just kill it off. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-10sched: prevent bound kthreads from changing cpus_allowedDavid Rientjes
Kthreads that have called kthread_bind() are bound to specific cpus, so other tasks should not be able to change their cpus_allowed from under them. Otherwise, it is possible to move kthreads, such as the migration or software watchdog threads, so they are not allowed access to the cpu they work on. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-10sched: fix hotplug cpus on ia64Peter Zijlstra
Cliff Wickman wrote: > I built an ia64 kernel from Andrew's tree (2.6.26-rc2-mm1) > and get a very predictable hotplug cpu problem. > billberry1:/tmp/cpw # ./dis > disabled cpu 17 > enabled cpu 17 > billberry1:/tmp/cpw # ./dis > disabled cpu 17 > enabled cpu 17 > billberry1:/tmp/cpw # ./dis > > The script that disables the cpu always hangs (unkillable) > on the 3rd attempt. > > And a bit further: > The kstopmachine thread always sits on the run queue (real time) for about > 30 minutes before running. this fix solves some (but not all) issues between CPU hotplug and RT bandwidth throttling. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-10sched: fix TASK_WAKEKILL vs SIGKILL raceOleg Nesterov
schedule() has the special "TASK_INTERRUPTIBLE && signal_pending()" case, this allows us to do current->state = TASK_INTERRUPTIBLE; schedule(); without fear to sleep with pending signal. However, the code like current->state = TASK_KILLABLE; schedule(); is not right, schedule() doesn't take TASK_WAKEKILL into account. This means that mutex_lock_killable(), wait_for_completion_killable(), down_killable(), schedule_timeout_killable() can miss SIGKILL (and btw the second SIGKILL has no effect). Introduce the new helper, signal_pending_state(), and change schedule() to use it. Hopefully it will have more users, that is why the task's state is passed separately. Note this "__TASK_STOPPED | __TASK_TRACED" check in signal_pending_state(). This is needed to preserve the current behaviour (ptrace_notify). I hope this check will be removed soon, but this (afaics good) change needs the separate discussion. The fast path is "(state & (INTERRUPTIBLE | WAKEKILL)) + signal_pending(p)", basically the same that schedule() does now. However, this patch of course bloats schedule(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-06sched: move weighted_cpuload into #ifdef CONFIG_SMP sectionThomas Gleixner
weighted_cpuload is only used on SMP. move it into the CONFIG_SMP section. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06sched: Move cpu masks from kernel/sched.c into kernel/cpu.cMax Krasnyansky
kernel/cpu.c seems a more logical place for those maps since they do not really have much to do with the scheduler these days. kernel/cpu.c is now built for the UP kernel too, but it does not affect the size the kernel sections. $ size vmlinux before text data bss dec hex filename 3313797 307060 310352 3931209 3bfc49 vmlinux after text data bss dec hex filename 3313797 307060 310352 3931209 3bfc49 vmlinux Signed-off-by: Max Krasnyansky <maxk@qualcomm.com> Cc: pj@sgi.com Cc: menage@google.com Cc: rostedt@goodmis.org Cc: mingo@elte.hu Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>