aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2008-06-27sched: correct wakeup weight calculationsPeter Zijlstra
rw_i = {2, 4, 1, 0} s_i = {2/7, 4/7, 1/7, 0} wakeup on cpu0, weight=1 rw'_i = {3, 4, 1, 0} s'_i = {3/8, 4/8, 1/8, 0} s_0 = S * rw_0 / \Sum rw_j -> \Sum rw_j = S*rw_0/s_0 = 1*2*7/2 = 7 (correct) s'_0 = S * (rw_0 + 1) / (\Sum rw_j + 1) = 1 * (2+1) / (7+1) = 3/8 (correct so we find that adding 1 to cpu0 gains 5/56 in weight if say the other cpu were, cpu1, we'd also have to calculate its 4/56 loss Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: fix mult overflowSrivatsa Vaddagiri
It was observed these mults can overflow. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: update shares on wakeupPeter Zijlstra
We found that the affine wakeup code needs rather accurate load figures to be effective. The trouble is that updating the load figures is fairly expensive with group scheduling. Therefore ratelimit the updating. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: fix shares boost logicPeter Zijlstra
In case the domain is empty, pretend there is a single task on each cpu, so that together with the boost logic we end up giving 1/n shares to each cpu. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: disable source/target_load biasPeter Zijlstra
The bias given by source/target_load functions can be very large, disable it by default to get faster convergence. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: optimize effective_load()Peter Zijlstra
s_i = S * rw_i / \Sum_j rw_j -> \Sum_j rw_j = S * rw_i / s_i -> s'_i = S * (rw_i + w) / (\Sum_j rw_j + w) delta s = s' - s = S * (rw + w) / ((S * rw / s) + w) = s * (S * (rw + w) / (S * rw + s * w) - 1) a = S*(rw+w), b = S*rw + s*w delta s = s * (a-b) / b IOW, trade one divide for two multiplies Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: remove prio preference from balance decisionsPeter Zijlstra
Priority looses much of its meaning in a hierarchical context. So don't use it in balance decisions. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: fix task_h_load()Peter Zijlstra
Currently task_h_load() computes the load of a task and uses that to either subtract it from the total, or add to it. However, removing or adding a task need not have any effect on the total load at all. Imagine adding a task to a group that is local to one cpu - in that case the total load of that cpu is unaffected. So properly compute addition/removal: s_i = S * rw_i / \Sum_j rw_j s'_i = S * (rw_i + wl) / (\Sum_j rw_j + wg) then s'_i - s_i gives the change in load. Where s_i is the shares for cpu i, S the group weight, rw_i the runqueue weight for that cpu, wl the weight we add (subtract) and wg the weight contribution to the runqueue. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: fix load scaling in group balancingPeter Zijlstra
doing the load balance will change cfs_rq->load.weight (that's the whole point) but since that's part of the scale factor, we'll scale back with a different amount. Weight getting smaller would result in an inflated moved_load which causes it to stop balancing too soon. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: hierarchical load vs find_busiest_groupPeter Zijlstra
find_busiest_group() has some assumptions about task weight being in the NICE_0_LOAD range. Hierarchical task groups break this assumption - fix this by replacing it with the average task weight, which will adapt the situation. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: hierarchical load vs affine wakeupsPeter Zijlstra
With hierarchical grouping we can't just compare task weight to rq weight - we need to scale the weight appropriately. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: persistent average load per taskPeter Zijlstra
Remove the fall-back to SCHED_LOAD_SCALE by remembering the previous value of cpu_avg_load_per_task() - this is useful because of the hierarchical group model in which task weight can be much smaller. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: fix sched_balance_self() smp group balancingPeter Zijlstra
Finding the least idle cpu is more accurate when done with updated shares. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: fix newidle smp group balancingPeter Zijlstra
Re-compute the shares on newidle - so we can make a decision based on recent data. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: simplify the group load balancerPeter Zijlstra
While thinking about the previous patch - I realized that using per domain aggregate load values in load_balance_fair() is wrong. We should use the load value for that CPU. By not needing per domain hierarchical load values we don't need to store per domain aggregate shares, which greatly simplifies all the math. It basically falls apart in two separate computations: - per domain update of the shares - per CPU update of the hierarchical load Also get rid of the move_group_shares() stuff - just re-compute the shares again after a successful load balance. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: no need to aggregate task_weightPeter Zijlstra
We only need to know the task_weight of the busiest rq - nothing to do if there are no tasks there. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: dont micro manage share lossesPeter Zijlstra
We used to try and contain the loss of 'shares' by playing arithmetic games. Replace that by noticing that at the top sched_domain we'll always have the full weight in shares to distribute. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: kill task_group balancingSrivatsa Vaddagiri
The idea was to balance groups until we've reached the global goal, however Vatsa rightly pointed out that we might never reach that goal this way - hence take out this logic. [ the initial rationale for this 'feature' was to promote max concurrency within a group - it does not however affect fairness ] Reported-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: update aggregate when holding the RQsPeter Zijlstra
It was observed that in __update_group_shares_cpu() rq_weight > aggregate()->rq_weight This is caused by forks/wakeups in between the initial aggregate pass and locking of the RQs for load balance. To avoid this situation partially re-do the aggregation once we have the RQs locked (which avoids new tasks from appearing). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: fix sched_domain aggregationPeter Zijlstra
Keeping the aggregate on the first cpu of the sched domain has two problems: - it could collide between different sched domains on different cpus - it could slow things down because of the remote accesses Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: add full schedstats to /proc/sched_debugPeter Zijlstra
show all the schedstats in /debug/sched_debug as well. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: fix wakeup granularity and buddy granularityPeter Zijlstra
Uncouple buddy selection from wakeup granularity. The initial idea was that buddies could run ahead as far as a normal task can - do this by measuring a pair 'slice' just as we do for a normal task. This means we can drop the wakeup_granularity back to 5ms. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: sched_clock_cpu() based cpu_clock()Peter Zijlstra
with sched_clock_cpu() being reasonably in sync between cpus (max 1 jiffy difference) use this to provide cpu_clock(). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: revert revert of: fair-group: SMP-nice for group schedulingPeter Zijlstra
Try again.. Initial commit: 18d95a2832c1392a2d63227a7a6d433cb9f2037e Revert: 6363ca57c76b7b83639ca8c83fc285fa26a7880e Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: fix calc_delta_asym, #2Peter Zijlstra
Ok, so why are we in this mess, it was: 1/w but now we mixed that rw in the mix like: rw/w rw being \Sum w suggests: fiddling w, we should also fiddle rw, humm? Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: fix calc_delta_asym()Peter Zijlstra
calc_delta_asym() is supposed to do the same as calc_delta_fair() except linearly shrink the result for negative nice processes - this causes them to have a smaller preemption threshold so that they are more easily preempted. The problem is that for task groups se->load.weight is the per cpu share of the actual task group weight; take that into account. Also provide a debug switch to disable the asymmetry (which I still don't like - but it does greatly benefit some workloads) This would explain the interactivity issues reported against group scheduling. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: revert the revert of: weight calculationsPeter Zijlstra
Try again.. initial commit: 8f1bc385cfbab474db6c27b5af1e439614f3025c revert: f9305d4a0968201b2818dbed0dc8cb0d4ee7aeb3 Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-27sched: clean up some unused variablesPeter Zijlstra
In file included from /mnt/build/linux-2.6/kernel/sched.c:1496: /mnt/build/linux-2.6/kernel/sched_rt.c: In function '__enable_runtime': /mnt/build/linux-2.6/kernel/sched_rt.c:339: warning: unused variable 'rd' /mnt/build/linux-2.6/kernel/sched_rt.c: In function 'requeue_rt_entity': /mnt/build/linux-2.6/kernel/sched_rt.c:692: warning: unused variable 'queue' Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-26rcu: make rcutorture even more vicious: invoke RCU readers from irq handlers ↵Paul E. McKenney
(timers) This patch allows torturing RCU from irq handlers (timers, in this case). A new module parameter irqreader enables such additional torturing, and is enabled by default. Variants of RCU that do not tolerate readers being called from irq handlers (e.g., SRCU) ignore irqreader. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: josh@freedesktop.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: dino@in.ibm.com Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Cc: vegard.nossum@gmail.com Cc: adobriyan@gmail.com Cc: oleg@tv-sign.ru Cc: bunk@kernel.org Cc: rjw@sisk.pl Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-26Merge commit 'v2.6.26-rc8' into core/rcuIngo Molnar
2008-06-25Merge branch 'linus' into tracing/sysprofIngo Molnar
2008-06-25Merge branch 'linus' into tracing/ftraceIngo Molnar
2008-06-25Merge branch 'linus' into sched/new-API-sched_setschedulerIngo Molnar
2008-06-25Merge branch 'linus' into sched/develIngo Molnar
Conflicts: kernel/sched_rt.c Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-25Merge branch 'linus' into core/softirqIngo Molnar
2008-06-25Merge commit 'v2.6.26-rc8' into x86/xenIngo Molnar
Conflicts: arch/x86/xen/enlighten.c arch/x86/xen/mmu.c Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-24[PATCH] remove useless argument type in audit_filter_user()Peng Haitao
The second argument "type" is not used in audit_filter_user(), so I think that type can be removed. If I'm wrong, please tell me. Signed-off-by: Peng Haitao <penght@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-06-24[PATCH] audit: fix kernel-doc parameter notationRandy Dunlap
Fix auditfilter kernel-doc misssing parameter description: Warning(lin2626-rc3//kernel/auditfilter.c:1551): No description found for parameter 'sessionid' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-06-24[PATCH] kernel/audit.c: nlh->nlmsg_type is gotten more than oncePeng Haitao
The first argument "nlh->nlmsg_type" of audit_receive_filter() should be modified to "msg_type" in audit_receive_msg(). Signed-off-by: Peng Haitao <penght@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-06-24kgdb: sparse fixJason Wessel
- Fix warning reported by sparse kernel/kgdb.c:1502:6: warning: symbol 'kgdb_console_write' was not declared. Should it be static? Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2008-06-24rcu: make quiescent rcutorture less power-hungryPaul E. McKenney
This patch aligns the rcutorture wakeup times to align with all other multiple-of-a-second wakeups to further decrease power consumption. Suggested-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: josh@freedesktop.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: dino@in.ibm.com Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Cc: vegard.nossum@gmail.com Cc: adobriyan@gmail.com Cc: oleg@tv-sign.ru Cc: bunk@kernel.org Cc: rjw@sisk.pl Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-24rcu, rcutorture: make quiescent rcutorture less power-hungryPaul E. McKenney
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: josh@freedesktop.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: dino@in.ibm.com Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Cc: vegard.nossum@gmail.com Cc: adobriyan@gmail.com Cc: oleg@tv-sign.ru Cc: bunk@kernel.org Cc: rjw@sisk.pl Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-24lockdep: add lock_class information to lock_chain and output itHuang, Ying
It is based on x86/master branch of git-x86 tree, and has been tested on x86_64 platform. ChangeLog: v2: - Enclosing proc file system related code into CONFIG_PROVE_LOCKING. - Fix nr_chain_hlocks update code. Signed-off-by: Huang Ying <ying.huang@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-23sched: add new API sched_setscheduler_nocheck: add a flag to control access ↵Rusty Russell
checks Hidehiro Kawai noticed that sched_setscheduler() can fail in stop_machine: it calls sched_setscheduler() from insmod, which can have CAP_SYS_MODULE without CAP_SYS_NICE. Two cases could have failed, so are changed to sched_setscheduler_nocheck: kernel/softirq.c:cpu_callback() - CPU hotplug callback kernel/stop_machine.c:__stop_machine_run() - Called from various places, including modprobe() Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Cc: sugita <yumiko.sugita.yf@hitachi.com> Cc: Satoshi OSHIMA <satoshi.oshima.fk@hitachi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-23ftrace: avoid modifying kprobe'd recordsAbhishek Sagar
Avoid modifying the mcount call-site if there is a kprobe installed on it. These records are not marked as failed however. This allowed the filter rules on them to remain up-to-date. Whenever the kprobe on the corresponding record is removed, the record gets updated as normal. Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-23ftrace: freeze kprobe'd recordsAbhishek Sagar
Let records identified as being kprobe'd be marked as "frozen". The trouble with records which have a kprobe installed on their mcount call-site is that they don't get updated. So if such a function which is currently being traced gets its tracing disabled due to a new filter rule (or because it was added to the notrace list) then it won't be updated and continue being traced. This patch allows scanning of all frozen records during tracing to check if they should be traced. Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-23ftrace: store mcount address in rec->ipAbhishek Sagar
Record the address of the mcount call-site. Currently all archs except sparc64 record the address of the instruction following the mcount call-site. Some general cleanups are entailed. Storing mcount addresses in rec->ip enables looking them up in the kprobe hash table later on to check if they're kprobe'd. Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com> Cc: davem@davemloft.net Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-23Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futexes: fix fault handling in futex_lock_pi
2008-06-23futexes: fix fault handling in futex_lock_piThomas Gleixner
This patch addresses a very sporadic pi-futex related failure in highly threaded java apps on large SMP systems. David Holmes reported that the pi_state consistency check in lookup_pi_state triggered with his test application. This means that the kernel internal pi_state and the user space futex variable are out of sync. First we assumed that this is a user space data corruption, but deeper investigation revieled that the problem happend because the pi-futex code is not handling a fault in the futex_lock_pi path when the user space variable needs to be fixed up. The fault happens when a fork mapped the anon memory which contains the futex readonly for COW or the page got swapped out exactly between the unlock of the futex and the return of either the new futex owner or the task which was the expected owner but failed to acquire the kernel internal rtmutex. The current futex_lock_pi() code drops out with an inconsistent in case it faults and returns -EFAULT to user space. User space has no way to fixup that state. When we wrote this code we thought that we could not drop the hash bucket lock at this point to handle the fault. After analysing the code again it turned out to be wrong because there are only two tasks involved which might modify the pi_state and the user space variable: - the task which acquired the rtmutex - the pending owner of the pi_state which did not get the rtmutex Both tasks drop into the fixup_pi_state() function before returning to user space. The first task which acquired the hash bucket lock faults in the fixup of the user space variable, drops the spinlock and calls futex_handle_fault() to fault in the page. Now the second task could acquire the hash bucket lock and tries to fixup the user space variable as well. It either faults as well or it succeeds because the first task already faulted the page in. One caveat is to avoid a double fixup. After returning from the fault handling we reacquire the hash bucket lock and check whether the pi_state owner has been modified already. Reported-by: David Holmes <david.holmes@sun.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Holmes <david.holmes@sun.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> kernel/futex.c | 93 ++++++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 73 insertions(+), 20 deletions(-)
2008-06-23Merge branch 'linus' into sched/develIngo Molnar