aboutsummaryrefslogtreecommitdiff
path: root/kernel/sched.c
AgeCommit message (Collapse)Author
2005-11-04[PATCH] improve scheduler fairness a bitOleg Nesterov
Do not transfer remaining time slice to another cpu on process exit. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] sched: hardcode non-smp set_cpus_allowedPaul Jackson
Simplify the UP (1 CPU) implementatin of set_cpus_allowed. The one CPU is hardcoded to be cpu 0 - so just test for that bit, and avoid having to pick up the cpu_online_map. Also, unexport cpu_online_map: it was only needed for set_cpus_allowed(). Signed-off-by: Paul Jackson <pj@sgi.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: update_hiwaters just in timeHugh Dickins
update_mem_hiwater has attracted various criticisms, in particular from those concerned with mm scalability. Originally it was called whenever rss or total_vm got raised. Then many of those callsites were replaced by a timer tick call from account_system_time. Now Frank van Maarseveen reports that to be found inadequate. How about this? Works for Frank. Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros update_hiwater_rss and update_hiwater_vm. Don't attempt to keep mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually by 1): those are hot paths. Do the opposite, update only when about to lower rss (usually by many), or just before final accounting in do_exit. Handle mm->hiwater_vm in the same way, though it's much less of an issue. Demand that whoever collects these hiwater statistics do the work of taking the maximum with rss or total_vm. And there has been no collector of these hiwater statistics in the tree. The new convention needs an example, so match Frank's usage by adding a VmPeak line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS (High-Water-Mark or High-Water-Memory). There was a particular anomaly during mremap move, that hiwater_vm might be captured too high. A fleeting such anomaly remains, but it's quickly corrected now, whereas before it would stick. What locking? None: if the app is racy then these statistics will be racy, it's not worth any overhead to make them exact. But whenever it suits, hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under page_table_lock (for now) or with preemption disabled (later on): without going to any trouble, minimize the time between reading current values and updating, to minimize those occasions when a racing thread bumps a count up and back down in between. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-26[PATCH] export cpu_online_mapAndrew Morton
With CONFIG_SMP=n: *** Warning: "cpu_online_map" [drivers/firmware/dcdbas.ko] undefined! due to set_cpus_allowed(). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-13[PATCH] Fix spinlock owner debuggingIngo Molnar
fix up the runqueue lock owner only if we truly did a context-switch with the runqueue lock held. Impacts ia64, mips, sparc64 and arm. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-12Mark ia64-specific MCA/INIT scheduler hooks as dangerousLinus Torvalds
..and only enable them for ia64. The functions are only valid when the whole system has been totally stopped and no scheduler activity is ongoing on any CPU, and interrupts are globally disabled. In other words, they aren't useful for anything else. So make sure that nobody can use them by mistake. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-11[PATCH] MCA/INIT: scheduler hooksKeith Owens
Scheduler hooks to see/change which process is deemed to be on a cpu. Signed-off-by: Keith Owens <kaos@sgi.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-09-10[PATCH] sched: allow the load to grow upto its cpu_powerSiddha, Suresh B
Don't pull tasks from a group if that would cause the group's total load to drop below its total cpu_power (ie. cause the group to start going idle). Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: don't kick ALB in the presence of pinned taskSiddha, Suresh B
Jack Steiner brought this issue at my OLS talk. Take a scenario where two tasks are pinned to two HT threads in a physical package. Idle packages in the system will keep kicking migration_thread on the busy package with out any success. We will run into similar scenarios in the presence of CMP/NUMA. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: use cached variable in sys_sched_yield()Renaud Lienhart
In sys_sched_yield(), we cache current->array in the "array" variable, thus there's no need to dereference "current" again later. Signed-Off-By: Renaud Lienhart <renaud.lienhart@free.fr> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: HT optimisationNick Piggin
If an idle sibling of an HT queue encounters a busy sibling, then make higher level load balancing of the non-idle variety. Performance of multiprocessor HT systems with low numbers of tasks (generally < number of virtual CPUs) can be significantly worse than the exact same workloads when running in non-HT mode. The reason is largely due to poor scheduling behaviour. This patch improves the situation, making the performance gap far less significant on one problematic test case (tbench). Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: less lockingNick Piggin
During periodic load balancing, don't hold this runqueue's lock while scanning remote runqueues, which can take a non trivial amount of time especially on very large systems. Holding the runqueue lock will only help to stabilise ->nr_running, however this doesn't do much to help because tasks being woken will simply get held up on the runqueue lock, so ->nr_running would not provide a really accurate picture of runqueue load in that case anyway. What's more, ->nr_running (and possibly the cpu_load averages) of remote runqueues won't be stable anyway, so load balancing is always an inexact operation. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: less newidle lockingNick Piggin
Similarly to the earlier change in load_balance, only lock the runqueue in load_balance_newidle if the busiest queue found has a nr_running > 1. This will reduce frequency of expensive remote runqueue lock aquisitions in the schedule() path on some workloads. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: fix SMT scheduler latency bugIngo Molnar
William Weston reported unusually high scheduling latencies on his x86 HT box, on the -RT kernel. I managed to reproduce it on my HT box and the latency tracer shows the incident in action: _------=> CPU# / _-----=> irqs-off | / _----=> need-resched || / _---=> hardirq/softirq ||| / _--=> preempt-depth |||| / ||||| delay cmd pid ||||| time | caller \ / ||||| \ | / du-2803 3Dnh2 0us : __trace_start_sched_wakeup (try_to_wake_up) .............................................................. ... we are running on CPU#3, PID 2778 gets woken to CPU#1: ... .............................................................. du-2803 3Dnh2 0us : __trace_start_sched_wakeup <<...>-2778> (73 1) du-2803 3Dnh2 0us : _raw_spin_unlock (try_to_wake_up) ................................................ ... still on CPU#3, we send an IPI to CPU#1: ... ................................................ du-2803 3Dnh1 0us : resched_task (try_to_wake_up) du-2803 3Dnh1 1us : smp_send_reschedule (try_to_wake_up) du-2803 3Dnh1 1us : send_IPI_mask_bitmask (smp_send_reschedule) du-2803 3Dnh1 2us : _raw_spin_unlock_irqrestore (try_to_wake_up) ............................................... ... 1 usec later, the IPI arrives on CPU#1: ... ............................................... <idle>-0 1Dnh. 2us : smp_reschedule_interrupt (c0100c5a 0 0) So far so good, this is the normal wakeup/preemption mechanism. But here comes the scheduler anomaly on CPU#1: <idle>-0 1Dnh. 2us : preempt_schedule_irq (need_resched) <idle>-0 1Dnh. 2us : preempt_schedule_irq (need_resched) <idle>-0 1Dnh. 3us : __schedule (preempt_schedule_irq) <idle>-0 1Dnh. 3us : profile_hit (__schedule) <idle>-0 1Dnh1 3us : sched_clock (__schedule) <idle>-0 1Dnh1 4us : _raw_spin_lock_irq (__schedule) <idle>-0 1Dnh1 4us : _raw_spin_lock_irqsave (__schedule) <idle>-0 1Dnh2 5us : _raw_spin_unlock (__schedule) <idle>-0 1Dnh1 5us : preempt_schedule (__schedule) <idle>-0 1Dnh1 6us : _raw_spin_lock (__schedule) <idle>-0 1Dnh2 6us : find_next_bit (__schedule) <idle>-0 1Dnh2 6us : _raw_spin_lock (__schedule) <idle>-0 1Dnh3 7us : find_next_bit (__schedule) <idle>-0 1Dnh3 7us : find_next_bit (__schedule) <idle>-0 1Dnh3 8us : _raw_spin_unlock (__schedule) <idle>-0 1Dnh2 8us : preempt_schedule (__schedule) <idle>-0 1Dnh2 8us : find_next_bit (__schedule) <idle>-0 1Dnh2 9us : trace_stop_sched_switched (__schedule) <idle>-0 1Dnh2 9us : _raw_spin_lock (trace_stop_sched_switched) <idle>-0 1Dnh3 10us : trace_stop_sched_switched <<...>-2778> (73 8c) <idle>-0 1Dnh3 10us : _raw_spin_unlock (trace_stop_sched_switched) <idle>-0 1Dnh1 10us : _raw_spin_unlock (__schedule) <idle>-0 1Dnh. 11us : local_irq_enable_noresched (preempt_schedule_irq) <idle>-0 1Dnh. 11us < (0) we didnt pick up pid 2778! It only gets scheduled much later: <...>-2778 1Dnh2 412us : __switch_to (__schedule) <...>-2778 1Dnh2 413us : __schedule <<idle>-0> (8c 73) <...>-2778 1Dnh2 413us : _raw_spin_unlock (__schedule) <...>-2778 1Dnh1 413us : trace_stop_sched_switched (__schedule) <...>-2778 1Dnh1 414us : _raw_spin_lock (trace_stop_sched_switched) <...>-2778 1Dnh2 414us : trace_stop_sched_switched <<...>-2778> (73 1) <...>-2778 1Dnh2 414us : _raw_spin_unlock (trace_stop_sched_switched) <...>-2778 1Dnh1 415us : trace_stop_sched_switched (__schedule) the reason for this anomaly is the following code in dependent_sleeper(): /* * If a user task with lower static priority than the * running task on the SMT sibling is trying to schedule, * delay it till there is proportionately less timeslice * left of the sibling task to prevent a lower priority * task from using an unfair proportion of the * physical cpu's resources. -ck */ [...] if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) / 100) > task_timeslice(p))) ret = 1; Note that in contrast to the comment above, we dont actually do the check based on static priority, we do the check based on timeslices. But timeslices go up and down, and even highprio tasks can randomly have very low timeslices (just before their next refill) and can thus be judged as 'lowprio' by the above piece of code. This condition is clearly buggy. The correct test is to check for static_prio _and_ to check for the preemption priority. Even on different static priority levels, a higher-prio interactive task should not be delayed due to a higher-static-prio CPU hog. There is a symmetric bug in the 'kick SMT sibling' code of this function as well, which can be solved in a similar way. The patch below (against the current scheduler queue in -mm) fixes both bugs. I have build and boot-tested this on x86 SMT, and nice +20 tasks still get properly throttled - so the dependent-sleeper logic is still in action. btw., these bugs pessimised the SMT scheduler because the 'delay wakeup' property was applied too liberally, so this fix is likely a throughput improvement as well. I separated out a smt_slice() function to make the code easier to read. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: TASK_NONINTERACTIVEIngo Molnar
This patch implements a task state bit (TASK_NONINTERACTIVE), which can be used by blocking points to mark the task's wait as "non-interactive". This does not mean the task will be considered a CPU-hog - the wait will simply not have an effect on the waiting task's priority - positive or negative alike. Right now only pipe_wait() will make use of it, because it's a common source of not-so-interactive waits (kernel compilation jobs, etc.). Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched cleanupsIngo Molnar
whitespace cleanups. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: make idlest_group/cpu cpus_allowed-awareM.Baris Demiray
Add relevant checks into find_idlest_group() and find_idlest_cpu() to make them return only the groups that have allowed CPUs and allowed CPUs respectively. Signed-off-by: M.Baris Demiray <baris@labristeknoloji.com> Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: run SCHED_NORMAL tasks with real time tasks on SMT siblingsCon Kolivas
The hyperthread aware nice handling currently puts to sleep any non real time task when a real time task is running on its sibling cpu. This can lead to prolonged starvation by having the non real time task pegged to the cpu with load balancing not pulling that task away. Currently we force lower priority hyperthread tasks to run a percentage of time difference based on timeslice differences which is meaningless when comparing real time tasks to SCHED_NORMAL tasks. We can allow non real time tasks to run with real time tasks on the sibling up to per_cpu_gain% if we use jiffies as a counter. Cleanups and micro-optimisations to the relevant code section should make it more understandable as well. Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] spinlock consolidationIngo Molnar
This patch (written by me and also containing many suggestions of Arjan van de Ven) does a major cleanup of the spinlock code. It does the following things: - consolidates and enhances the spinlock/rwlock debugging code - simplifies the asm/spinlock.h files - encapsulates the raw spinlock type and moves generic spinlock features (such as ->break_lock) into the generic code. - cleans up the spinlock code hierarchy to get rid of the spaghetti. Most notably there's now only a single variant of the debugging code, located in lib/spinlock_debug.c. (previously we had one SMP debugging variant per architecture, plus a separate generic one for UP builds) Also, i've enhanced the rwlock debugging facility, it will now track write-owners. There is new spinlock-owner/CPU-tracking on SMP builds too. All locks have lockup detection now, which will work for both soft and hard spin/rwlock lockups. The arch-level include files now only contain the minimally necessary subset of the spinlock code - all the rest that can be generalized now lives in the generic headers: include/asm-i386/spinlock_types.h | 16 include/asm-x86_64/spinlock_types.h | 16 I have also split up the various spinlock variants into separate files, making it easier to see which does what. The new layout is: SMP | UP ----------------------------|----------------------------------- asm/spinlock_types_smp.h | linux/spinlock_types_up.h linux/spinlock_types.h | linux/spinlock_types.h asm/spinlock_smp.h | linux/spinlock_up.h linux/spinlock_api_smp.h | linux/spinlock_api_up.h linux/spinlock.h | linux/spinlock.h /* * here's the role of the various spinlock/rwlock related include files: * * on SMP builds: * * asm/spinlock_types.h: contains the raw_spinlock_t/raw_rwlock_t and the * initializers * * linux/spinlock_types.h: * defines the generic type and initializers * * asm/spinlock.h: contains the __raw_spin_*()/etc. lowlevel * implementations, mostly inline assembly code * * (also included on UP-debug builds:) * * linux/spinlock_api_smp.h: * contains the prototypes for the _spin_*() APIs. * * linux/spinlock.h: builds the final spin_*() APIs. * * on UP builds: * * linux/spinlock_type_up.h: * contains the generic, simplified UP spinlock type. * (which is an empty structure on non-debug builds) * * linux/spinlock_types.h: * defines the generic type and initializers * * linux/spinlock_up.h: * contains the __raw_spin_*()/etc. version of UP * builds. (which are NOPs on non-debug, non-preempt * builds) * * (included on UP-non-debug builds:) * * linux/spinlock_api_up.h: * builds the _spin_*() APIs. * * linux/spinlock.h: builds the final spin_*() APIs. */ All SMP and UP architectures are converted by this patch. arm, i386, ia64, ppc, ppc64, s390/s390x, x64 was build-tested via crosscompilers. m32r, mips, sh, sparc, have not been tested yet, but should be mostly fine. From: Grant Grundler <grundler@parisc-linux.org> Booted and lightly tested on a500-44 (64-bit, SMP kernel, dual CPU). Builds 32-bit SMP kernel (not booted or tested). I did not try to build non-SMP kernels. That should be trivial to fix up later if necessary. I converted bit ops atomic_hash lock to raw_spinlock_t. Doing so avoids some ugly nesting of linux/*.h and asm/*.h files. Those particular locks are well tested and contained entirely inside arch specific code. I do NOT expect any new issues to arise with them. If someone does ever need to use debug/metrics with them, then they will need to unravel this hairball between spinlocks, atomic ops, and bit ops that exist only because parisc has exactly one atomic instruction: LDCW (load and clear word). From: "Luck, Tony" <tony.luck@intel.com> ia64 fix Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjanv@infradead.org> Signed-off-by: Grant Grundler <grundler@parisc-linux.org> Cc: Matthew Wilcox <willy@debian.org> Signed-off-by: Hirokazu Takata <takata@linux-m32r.org> Signed-off-by: Mikael Pettersson <mikpe@csd.uu.se> Signed-off-by: Benoit Boissinot <benoit.boissinot@ens-lyon.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09[PATCH] Prefetch kernel stacks to speed up context switchChen, Kenneth W
For architecture like ia64, the switch stack structure is fairly large (currently 528 bytes). For context switch intensive application, we found that significant amount of cache misses occurs in switch_to() function. The following patch adds a hook in the schedule() function to prefetch switch stack structure as soon as 'next' task is determined. This allows maximum overlap in prefetch cache lines for that structure. Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07Merge branch 'upstream' of ↵Linus Torvalds
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/misc-2.6
2005-09-07[PATCH] cpusets: fix the "dynamic sched domains" bugJohn Hawkes
For a NUMA system with multiple CPUs per node, declaring a cpu-exclusive cpuset that includes only some, but not all, of the CPUs in a node will mangle the sched domain structures. Signed-off-by: John Hawkes <hawkes@sgi.com> Cc; Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] cpusets: Move the ia64 domain setup code to the generic codeJohn Hawkes
Signed-off-by: John Hawkes <hawkes@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[kernel-doc] fix various DocBook build problems/warningsJeff Garzik
Most serious is fixing include/sound/pcm.h, which breaks the DocBook build. The other stuff is just filling in things that cause warnings.
2005-08-18[PATCH] Make RLIMIT_NICE ranges consistent with getpriority(2)Matt Mackall
As suggested by Michael Kerrisk <mtk-manpages@gmx.net>, make RLIMIT_NICE consistent with getpriority before it becomes available in released glibc. Signed-off-by: Matt Mackall <mpm@selenic.com> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Chris Wright <chrisw@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-26[PATCH] fix MAX_USER_RT_PRIO and MAX_RT_PRIOSteven Rostedt
Here's the patch again to fix the code to handle if the values between MAX_USER_RT_PRIO and MAX_RT_PRIO are different. Without this patch, an SMP system will crash if the values are different. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Dean Nelson <dcn@sgi.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-26[PATCH] Fix RLIMIT_RTPRIO breakageAndreas Steinmetz
RLIMIT_RTPRIO is supposed to grant non privileged users the right to use SCHED_FIFO/SCHED_RR scheduling policies with priorites bounded by the RLIMIT_RTPRIO value via sched_setscheduler(). This is usually used by audio users. Unfortunately this is broken in 2.6.13rc3 as you can see in the excerpt from sched_setscheduler below: /* * Allow unprivileged RT tasks to decrease priority: */ if (!capable(CAP_SYS_NICE)) { /* can't change policy */ if (policy != p->policy) return -EPERM; After the above unconditional test which causes sched_setscheduler to fail with no regard to the RLIMIT_RTPRIO value the following check is made: /* can't increase priority */ if (policy != SCHED_NORMAL && param->sched_priority > p->rt_priority && param->sched_priority > p->signal->rlim[RLIMIT_RTPRIO].rlim_cur) return -EPERM; Thus I do believe that the RLIMIT_RTPRIO value must be taken into account for the policy check, especially as the RLIMIT_RTPRIO limit is of no use without this change. The attached patch fixes this problem. Signed-off-by: Andreas Steinmetz <ast@domdv.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-07[PATCH] cond_resched(): fix bogus might_sleep() warningIngo Molnar
The BKS might be reacquired before we have dropped PREEMPT_ACTIVE, which could trigger a second could trigger a second cond_resched() call. Bug found by Hirofumi Ogawa. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-28[PATCH] Tweak idle thread setup semanticsIngo Molnar
This patch tweaks idle thread setup semantics a bit: instead of setting NEED_RESCHED in init_idle(), we do an explicit schedule() before calling into cpu_idle(). This patch, while having no negative side-effects, enables wider use of cond_resched()s. (which might happen in the stock kernel too, but it's particulary important for voluntary-preempt) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-27[PATCH] Update cfq io scheduler to time sliced designJens Axboe
This updates the CFQ io scheduler to the new time sliced design (cfq v3). It provides full process fairness, while giving excellent aggregate system throughput even for many competing processes. It supports io priorities, either inherited from the cpu nice value or set directly with the ioprio_get/set syscalls. The latter closely mimic set/getpriority. This import is based on my latest from -mm. Signed-off-by: Jens Axboe <axboe@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25Merge Christoph's freeze cleanup patchLinus Torvalds
2005-06-25[PATCH] Cleanup patch for process freezingChristoph Lameter
1. Establish a simple API for process freezing defined in linux/include/sched.h: frozen(process) Check for frozen process freezing(process) Check if a process is being frozen freeze(process) Tell a process to freeze (go to refrigerator) thaw_process(process) Restart process frozen_process(process) Process is frozen now 2. Remove all references to PF_FREEZE and PF_FROZEN from all kernel sources except sched.h 3. Fix numerous locations where try_to_freeze is manually done by a driver 4. Remove the argument that is no longer necessary from two function calls. 5. Some whitespace cleanup 6. Clear potential race in refrigerator (provides an open window of PF_FREEZE cleared before setting PF_FROZEN, recalc_sigpending does not check PF_FROZEN). This patch does not address the problem of freeze_processes() violating the rule that a task may only modify its own flags by setting PF_FREEZE. This is not clean in an SMP environment. freeze(process) is therefore not SMP safe! Signed-off-by: Christoph Lameter <christoph@lameter.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] Dynamic sched domains: sched changesDinakar Guniguntala
The following patches add dynamic sched domains functionality that was extensively discussed on lkml and lse-tech. I would like to see this added to -mm o The main advantage with this feature is that it ensures that the scheduler load balacing code only balances against the cpus that are in the sched domain as defined by an exclusive cpuset and not all of the cpus in the system. This removes any overhead due to load balancing code trying to pull tasks outside of the cpu exclusive cpuset only to be prevented by the tasks' cpus_allowed mask. o cpu exclusive cpusets are useful for servers running orthogonal workloads such as RT applications requiring low latency and HPC applications that are throughput sensitive o It provides a new API partition_sched_domains in sched.c that makes dynamic sched domains possible. o cpu_exclusive cpusets sets are now associated with a sched domain. Which means that the users can dynamically modify the sched domains through the cpuset file system interface o ia64 sched domain code has been updated to support this feature as well o Currently, this does not support hotplug. (However some of my tests indicate hotplug+preempt is currently broken) o I have tested it extensively on x86. o This should have very minimal impact on performance as none of the fast paths are affected Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com> Acked-by: Paul Jackson <pj@sgi.com> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Matthew Dobson <colpatch@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] Changing RT priority without CAP_SYS_NICEOlivier Croquette
Presently, a process without the capability CAP_SYS_NICE can not change its own policy, which is OK. But it can also not decrease its RT priority (if scheduled with policy SCHED_RR or SCHED_FIFO), which is what this patch changes. The rationale is the same as for the nice value: a process should be able to require less priority for itself. Increasing the priority is still not allowed. This is for example useful if you give a multithreaded user process a RT priority, and the process would like to organize its internal threads using priorities also. Then you can give the process the highest priority needed N, and the process starts its threads with lower priorities: N-1, N-2... The POSIX norm says that the permissions are implementation specific, so I think we can do that. In a sense, it makes the permissions consistent whatever the policy is: with this patch, process scheduled by SCHED_FIFO, SCHED_RR and SCHED_OTHER can all decrease their priority. From: Ingo Molnar <mingo@elte.hu> cleaned up and merged to -mm. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] sched: micro-optimize task requeueing in schedule()Chen Shang
micro-optimize task requeueing in schedule() & clean up recalc_task_prio(). Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] sched: relax pinned balancingNick Piggin
The maximum rebalance interval allowed by the multiprocessor balancing backoff is often not large enough to handle corner cases where there are lots of tasks pinned on a CPU. Suresh reported: I see system livelock's if for example I have 7000 processes pinned onto one cpu (this is on the fastest 8-way system I have access to). After this patch, the machine is reported to go well above this number. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] sched: consolidate sbe sbfNick Piggin
Consolidate balance-on-exec with balance-on-fork. This is made easy by the sched-domains RCU patches. As well as the general goodness of code reduction, this allows the runqueues to be unlocked during balance-on-fork. schedstats is a problem. Maybe just have balance-on-event instead of distinguishing fork and exec? Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] sched: RCU domainsNick Piggin
One of the problems with the multilevel balance-on-fork/exec is that it needs to jump through hoops to satisfy sched-domain's locking semantics (that is, you may traverse your own domain when not preemptable, and you may traverse others' domains when holding their runqueue lock). balance-on-exec had to potentially migrate between more than one CPU before finding a final CPU to migrate to, and balance-on-fork needed to potentially take multiple runqueue locks. So bite the bullet and make sched-domains go completely RCU. This actually simplifies the code quite a bit. From: Ingo Molnar <mingo@elte.hu> schedstats RCU fix, and a nice comment on for_each_domain, from Ingo. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] sched: multilevel sbe sbfNick Piggin
The fundamental problem that Suresh has with balance on exec and fork is that it only tries to balance the top level domain with the flag set. This was worked around by removing degenerate domains, but is still a problem if people want to start using more complex sched-domains, especially multilevel NUMA that ia64 is already using. This patch makes balance on fork and exec try balancing over not just the top most domain with the flag set, but all the way down the domain tree. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] sched: remove degenerate domainsSuresh Siddha
Remove degenerate scheduler domains during the sched-domain init. For example on x86_64, we always have NUMA configured in. On Intel EM64T systems, top most sched domain will be of NUMA and with only one sched_group in it. With fork/exec balances(recent Nick's fixes in -mm tree), we always endup taking wrong decisions because of this topmost domain (as it contains only one group and find_idlest_group always returns NULL). We will endup loading HT package completely first, letting active load balance kickin and correct it. In general, this patch also makes sense with out recent Nick's fixes in -mm. From: Nick Piggin <nickpiggin@yahoo.com.au> Modified to account for more than just sched_groups when scanning for degenerate domains by Nick Piggin. And allow a runqueue's sd to go NULL rather than keep a single degenerate domain around (this happens when you run with maxcpus=1). Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] sched: null domainsNick Piggin
Fix the last 2 places that directly access a runqueue's sched-domain and assume it cannot be NULL. That allows the use of NULL for domain, instead of a dummy domain, to signify no balancing is to happen. No functional changes. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] sched: cleanup context switch lockingNick Piggin
Instead of requiring architecture code to interact with the scheduler's locking implementation, provide a couple of defines that can be used by the architecture to request runqueue unlocked context switches, and ask for interrupts to be enabled over the context switch. Also replaces the "switch_lock" used by these architectures with an oncpu flag (note, not a potentially slow bitflag). This eliminates one bus locked memory operation when context switching, and simplifies the task_running function. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] sched: uninline task_timesliceIngo Molnar
"Chen, Kenneth W" <kenneth.w.chen@intel.com> uninline task_timeslice() - reduces code footprint noticeably, and it's slowpath code. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] sched: schedstats update for balance on forkNick Piggin
Add SCHEDSTAT statistics for sched-balance-fork. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] sched: balance on forkNick Piggin
Reimplement the balance on exec balancing to be sched-domains aware. Use this to also do balance on fork balancing. Make x86_64 do balance on fork over the NUMA domain. The problem that the non sched domains aware blancing became apparent on dual core, multi socket opterons. What we want is for the new tasks to be sent to a different socket, but more often than not, we would first load up our sibling core, or fill two cores of a single remote socket before selecting a new one. This gives large improvements to STREAM on such systems. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] sched: no aggressive idle balancingNick Piggin
Remove the very aggressive idle stuff that has recently gone into 2.6 - it is going against the direction we are trying to go. Hopefully we can regain performance through other methods. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] sched: tweak affine wakeupsNick Piggin
Do less affine wakeups. We're trying to reduce dbt2-pgsql idle time regressions here... make sure we don't don't move tasks the wrong way in an imbalance condition. Also, remove the cache coldness requirement from the calculation - this seems to induce sharp cutoff points where behaviour will suddenly change on some workloads if the load creeps slightly over or under some point. It is good for periodic balancing because in that case have otherwise have no other context to determine what task to move. But also make a minor tweak to "wake balancing" - the imbalance tolerance is now set at half the domain's imbalance, so we get the opportunity to do wake balancing before the more random periodic rebalancing gets preformed. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] sched: balance timersNick Piggin
Do CPU load averaging over a number of different intervals. Allow each interval to be chosen by sending a parameter to source_load and target_load. 0 is instantaneous, idx > 0 returns a decaying average with the most recent sample weighted at 2^(idx-1). To a maximum of 3 (could be easily increased). So generally a higher number will result in more conservative balancing. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] sched: less aggressive idle balancingNick Piggin
Remove the special casing for idle CPU balancing. Things like this are hurting for example on SMT, where are single sibling being idle doesn't really warrant a really aggressive pull over the NUMA domain, for example. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] sched: add debuggingNick Piggin
These conditions should now be impossible, and we need to fix them if they happen. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>