From 41719b03091911028116155deddc5eedf8c45e37 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 14 Jan 2009 15:36:26 +0100 Subject: mutex: preemption fixes The problem is that dropping the spinlock right before schedule is a voluntary preemption point and can cause a schedule, right after which we schedule again. Fix this inefficiency by keeping preemption disabled until we schedule, do this by explicity disabling preemption and providing a schedule() variant that assumes preemption is already disabled. Signed-off-by: Peter Zijlstra Signed-off-by: Ingo Molnar --- kernel/sched.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) (limited to 'kernel/sched.c') diff --git a/kernel/sched.c b/kernel/sched.c index 8be2c13b50d..b001c133c35 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -4538,15 +4538,13 @@ pick_next_task(struct rq *rq, struct task_struct *prev) /* * schedule() is the main scheduler function. */ -asmlinkage void __sched schedule(void) +asmlinkage void __sched __schedule(void) { struct task_struct *prev, *next; unsigned long *switch_count; struct rq *rq; int cpu; -need_resched: - preempt_disable(); cpu = smp_processor_id(); rq = cpu_rq(cpu); rcu_qsctr_inc(cpu); @@ -4603,7 +4601,13 @@ need_resched_nonpreemptible: if (unlikely(reacquire_kernel_lock(current) < 0)) goto need_resched_nonpreemptible; +} +asmlinkage void __sched schedule(void) +{ +need_resched: + preempt_disable(); + __schedule(); preempt_enable_no_resched(); if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) goto need_resched; -- cgit v1.2.3 From 0d66bf6d3514b35eb6897629059443132992dbd7 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Mon, 12 Jan 2009 14:01:47 +0100 Subject: mutex: implement adaptive spinning Change mutex contention behaviour such that it will sometimes busy wait on acquisition - moving its behaviour closer to that of spinlocks. This concept got ported to mainline from the -rt tree, where it was originally implemented for rtmutexes by Steven Rostedt, based on work by Gregory Haskins. Testing with Ingo's test-mutex application (http://lkml.org/lkml/2006/1/8/50) gave a 345% boost for VFS scalability on my testbox: # ./test-mutex-shm V 16 10 | grep "^avg ops" avg ops/sec: 296604 # ./test-mutex-shm V 16 10 | grep "^avg ops" avg ops/sec: 85870 The key criteria for the busy wait is that the lock owner has to be running on a (different) cpu. The idea is that as long as the owner is running, there is a fair chance it'll release the lock soon, and thus we'll be better off spinning instead of blocking/scheduling. Since regular mutexes (as opposed to rtmutexes) do not atomically track the owner, we add the owner in a non-atomic fashion and deal with the races in the slowpath. Furthermore, to ease the testing of the performance impact of this new code, there is means to disable this behaviour runtime (without having to reboot the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y), by issuing the following command: # echo NO_OWNER_SPIN > /debug/sched_features This command re-enables spinning again (this is also the default): # echo OWNER_SPIN > /debug/sched_features Signed-off-by: Peter Zijlstra Signed-off-by: Ingo Molnar --- kernel/sched.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 61 insertions(+) (limited to 'kernel/sched.c') diff --git a/kernel/sched.c b/kernel/sched.c index b001c133c35..589e7308c61 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -4614,6 +4614,67 @@ need_resched: } EXPORT_SYMBOL(schedule); +#ifdef CONFIG_SMP +/* + * Look out! "owner" is an entirely speculative pointer + * access and not reliable. + */ +int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner) +{ + unsigned int cpu; + struct rq *rq; + + if (!sched_feat(OWNER_SPIN)) + return 0; + +#ifdef CONFIG_DEBUG_PAGEALLOC + /* + * Need to access the cpu field knowing that + * DEBUG_PAGEALLOC could have unmapped it if + * the mutex owner just released it and exited. + */ + if (probe_kernel_address(&owner->cpu, cpu)) + goto out; +#else + cpu = owner->cpu; +#endif + + /* + * Even if the access succeeded (likely case), + * the cpu field may no longer be valid. + */ + if (cpu >= nr_cpumask_bits) + goto out; + + /* + * We need to validate that we can do a + * get_cpu() and that we have the percpu area. + */ + if (!cpu_online(cpu)) + goto out; + + rq = cpu_rq(cpu); + + for (;;) { + /* + * Owner changed, break to re-assess state. + */ + if (lock->owner != owner) + break; + + /* + * Is that owner really running on that cpu? + */ + if (task_thread_info(rq->curr) != owner || need_resched()) + return 0; + + cpu_relax(); + } +out: + return 1; +} +#endif + #ifdef CONFIG_PREEMPT /* * this is the entry point to schedule() from in-kernel preemption -- cgit v1.2.3 From 7e49fcce1bdadd723ae6a0b3b324c4daced61563 Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Thu, 22 Jan 2009 19:01:40 -0500 Subject: trace, lockdep: manual preempt count adding for local_bh_disable Impact: fix to preempt trace triggering lockdep check_flag failure In local_bh_disable, the use of add_preempt_count causes the preempt tracer to start recording the time preemption is off. But because it already modified the preempt_count to show softirqs disabled, and before it called the lockdep code to handle this, it causes a state that lockdep can not handle. The preempt tracer will reset the ring buffer on start of a trace, and the ring buffer reset code does a spin_lock_irqsave. This calls into lockdep and lockdep will fail when it detects the invalid state of having softirqs disabled but the internal current->softirqs_enabled is still set. The fix is to manually add the SOFTIRQ_OFFSET to preempt count and call the preempt tracer code outside the lockdep critical area. Thanks to Peter Zijlstra for suggesting this solution. Signed-off-by: Steven Rostedt Signed-off-by: Ingo Molnar --- kernel/sched.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel/sched.c') diff --git a/kernel/sched.c b/kernel/sched.c index 52bbf1c842a..c154825ae75 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -4399,10 +4399,7 @@ void scheduler_tick(void) #endif } -#if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \ - defined(CONFIG_PREEMPT_TRACER)) - -static inline unsigned long get_parent_ip(unsigned long addr) +unsigned long get_parent_ip(unsigned long addr) { if (in_lock_functions(addr)) { addr = CALLER_ADDR2; @@ -4412,6 +4409,9 @@ static inline unsigned long get_parent_ip(unsigned long addr) return addr; } +#if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \ + defined(CONFIG_PREEMPT_TRACER)) + void __kprobes add_preempt_count(int val) { #ifdef CONFIG_DEBUG_PREEMPT -- cgit v1.2.3 From 6e2756376c706e4da3454a272947983f92e80a7e Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 25 Feb 2009 13:59:48 +0100 Subject: generic-ipi: remove CSD_FLAG_WAIT Oleg noticed that we don't strictly need CSD_FLAG_WAIT, rework the code so that we can use CSD_FLAG_LOCK for both purposes. Signed-off-by: Peter Zijlstra Cc: Oleg Nesterov Cc: Linus Torvalds Cc: Nick Piggin Cc: Jens Axboe Cc: "Paul E. McKenney" Cc: Rusty Russell Signed-off-by: Ingo Molnar --- kernel/sched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel/sched.c') diff --git a/kernel/sched.c b/kernel/sched.c index 410eec40413..d4c2749a299 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -1093,7 +1093,7 @@ static void hrtick_start(struct rq *rq, u64 delay) if (rq == this_rq()) { hrtimer_restart(timer); } else if (!rq->hrtick_csd_pending) { - __smp_call_function_single(cpu_of(rq), &rq->hrtick_csd); + __smp_call_function_single(cpu_of(rq), &rq->hrtick_csd, 0); rq->hrtick_csd_pending = 1; } } -- cgit v1.2.3 From c69fc56de1df5769f2ec69c915c7ad5afe63804c Mon Sep 17 00:00:00 2001 From: Rusty Russell Date: Fri, 13 Mar 2009 14:49:46 +1030 Subject: cpumask: use topology_core_cpumask/topology_thread_cpumask instead of cpu_core_map/cpu_sibling_map Impact: cleanup This is presumably what those definitions are for, and while all archs define cpu_core_map/cpu_sibling map, that's changing (eg. x86 wants to change it to a pointer). Signed-off-by: Rusty Russell --- kernel/sched.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel/sched.c') diff --git a/kernel/sched.c b/kernel/sched.c index 0a76d0b6f21..5dabd80c3c1 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -7249,7 +7249,7 @@ cpu_to_core_group(int cpu, const struct cpumask *cpu_map, { int group; - cpumask_and(mask, &per_cpu(cpu_sibling_map, cpu), cpu_map); + cpumask_and(mask, topology_thread_cpumask(cpu), cpu_map); group = cpumask_first(mask); if (sg) *sg = &per_cpu(sched_group_core, group).sg; @@ -7278,7 +7278,7 @@ cpu_to_phys_group(int cpu, const struct cpumask *cpu_map, cpumask_and(mask, cpu_coregroup_mask(cpu), cpu_map); group = cpumask_first(mask); #elif defined(CONFIG_SCHED_SMT) - cpumask_and(mask, &per_cpu(cpu_sibling_map, cpu), cpu_map); + cpumask_and(mask, topology_thread_cpumask(cpu), cpu_map); group = cpumask_first(mask); #else group = cpu; @@ -7621,7 +7621,7 @@ static int __build_sched_domains(const struct cpumask *cpu_map, SD_INIT(sd, SIBLING); set_domain_attribute(sd, attr); cpumask_and(sched_domain_span(sd), - &per_cpu(cpu_sibling_map, i), cpu_map); + topology_thread_cpumask(i), cpu_map); sd->parent = p; p->child = sd; cpu_to_cpu_group(i, cpu_map, &sd->groups, tmpmask); @@ -7632,7 +7632,7 @@ static int __build_sched_domains(const struct cpumask *cpu_map, /* Set up CPU (sibling) groups */ for_each_cpu(i, cpu_map) { cpumask_and(this_sibling_map, - &per_cpu(cpu_sibling_map, i), cpu_map); + topology_thread_cpumask(i), cpu_map); if (i != cpumask_first(this_sibling_map)) continue; -- cgit v1.2.3 From df7c8e845e8e2030e8ae947e0ace56d184d0e9a0 Mon Sep 17 00:00:00 2001 From: Rusty Russell Date: Thu, 19 Mar 2009 15:22:20 +1030 Subject: cpumask: remove cpumask allocation from idle_balance Impact: fix circular locking Steven reports a circular locking from alloc_cpumask_var doing a wakeup. We get rid of this using the tried-and-true technique of using a per-cpu cpumask_var_t rather than doing an alloc every time. Simpler and more robust than a rare, implicit allocation within an atomic codepath. Reported-by: Steven Rostedt Signed-off-by: Rusty Russell LKML-Reference: Signed-off-by: Ingo Molnar --- kernel/sched.c | 35 ++++++++++++++++++----------------- 1 file changed, 18 insertions(+), 17 deletions(-) (limited to 'kernel/sched.c') diff --git a/kernel/sched.c b/kernel/sched.c index 5dabd80c3c1..48862d418be 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -3448,19 +3448,23 @@ find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle, */ #define MAX_PINNED_INTERVAL 512 +/* Working cpumask for load_balance and load_balance_newidle. */ +static DEFINE_PER_CPU(cpumask_var_t, load_balance_tmpmask); + /* * Check this_cpu to ensure it is balanced within domain. Attempt to move * tasks if there is an imbalance. */ static int load_balance(int this_cpu, struct rq *this_rq, struct sched_domain *sd, enum cpu_idle_type idle, - int *balance, struct cpumask *cpus) + int *balance) { int ld_moved, all_pinned = 0, active_balance = 0, sd_idle = 0; struct sched_group *group; unsigned long imbalance; struct rq *busiest; unsigned long flags; + struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask); cpumask_setall(cpus); @@ -3615,8 +3619,7 @@ out: * this_rq is locked. */ static int -load_balance_newidle(int this_cpu, struct rq *this_rq, struct sched_domain *sd, - struct cpumask *cpus) +load_balance_newidle(int this_cpu, struct rq *this_rq, struct sched_domain *sd) { struct sched_group *group; struct rq *busiest = NULL; @@ -3624,6 +3627,7 @@ load_balance_newidle(int this_cpu, struct rq *this_rq, struct sched_domain *sd, int ld_moved = 0; int sd_idle = 0; int all_pinned = 0; + struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask); cpumask_setall(cpus); @@ -3764,10 +3768,6 @@ static void idle_balance(int this_cpu, struct rq *this_rq) struct sched_domain *sd; int pulled_task = 0; unsigned long next_balance = jiffies + HZ; - cpumask_var_t tmpmask; - - if (!alloc_cpumask_var(&tmpmask, GFP_ATOMIC)) - return; for_each_domain(this_cpu, sd) { unsigned long interval; @@ -3778,7 +3778,7 @@ static void idle_balance(int this_cpu, struct rq *this_rq) if (sd->flags & SD_BALANCE_NEWIDLE) /* If we've pulled tasks over stop searching: */ pulled_task = load_balance_newidle(this_cpu, this_rq, - sd, tmpmask); + sd); interval = msecs_to_jiffies(sd->balance_interval); if (time_after(next_balance, sd->last_balance + interval)) @@ -3793,7 +3793,6 @@ static void idle_balance(int this_cpu, struct rq *this_rq) */ this_rq->next_balance = next_balance; } - free_cpumask_var(tmpmask); } /* @@ -3943,11 +3942,6 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle) unsigned long next_balance = jiffies + 60*HZ; int update_next_balance = 0; int need_serialize; - cpumask_var_t tmp; - - /* Fails alloc? Rebalancing probably not a priority right now. */ - if (!alloc_cpumask_var(&tmp, GFP_ATOMIC)) - return; for_each_domain(cpu, sd) { if (!(sd->flags & SD_LOAD_BALANCE)) @@ -3972,7 +3966,7 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle) } if (time_after_eq(jiffies, sd->last_balance + interval)) { - if (load_balance(cpu, rq, sd, idle, &balance, tmp)) { + if (load_balance(cpu, rq, sd, idle, &balance)) { /* * We've pulled tasks over so either we're no * longer idle, or one of our SMT siblings is @@ -4006,8 +4000,6 @@ out: */ if (likely(update_next_balance)) rq->next_balance = next_balance; - - free_cpumask_var(tmp); } /* @@ -8303,6 +8295,9 @@ void __init sched_init(void) #endif #ifdef CONFIG_USER_SCHED alloc_size *= 2; +#endif +#ifdef CONFIG_CPUMASK_OFFSTACK + alloc_size *= num_possible_cpus() * cpumask_size(); #endif /* * As sched_init() is called before page_alloc is setup, @@ -8341,6 +8336,12 @@ void __init sched_init(void) ptr += nr_cpu_ids * sizeof(void **); #endif /* CONFIG_USER_SCHED */ #endif /* CONFIG_RT_GROUP_SCHED */ +#ifdef CONFIG_CPUMASK_OFFSTACK + for_each_possible_cpu(i) { + per_cpu(load_balance_tmpmask, i) = (void *)ptr; + ptr += cpumask_size(); + } +#endif /* CONFIG_CPUMASK_OFFSTACK */ } #ifdef CONFIG_SMP -- cgit v1.2.3 From 8c083f081d0014057901c68a0a3e0f8ca7ac8d23 Mon Sep 17 00:00:00 2001 From: Rusty Russell Date: Thu, 19 Mar 2009 15:22:20 +1030 Subject: cpumask: remove cpumask allocation from idle_balance, fix Impact: fix boot crash Fix typo in the size calculation. Reported-by: Ingo Molnar Signed-off-by: Rusty Russell LKML-Reference: Signed-off-by: Ingo Molnar --- kernel/sched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel/sched.c') diff --git a/kernel/sched.c b/kernel/sched.c index 48862d418be..11dd52780ad 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -8297,7 +8297,7 @@ void __init sched_init(void) alloc_size *= 2; #endif #ifdef CONFIG_CPUMASK_OFFSTACK - alloc_size *= num_possible_cpus() * cpumask_size(); + alloc_size += num_possible_cpus() * cpumask_size(); #endif /* * As sched_init() is called before page_alloc is setup, -- cgit v1.2.3 From d5ac537e5fb6fc12384c9f3ed6a15e912dfbbc2a Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Sat, 28 Mar 2009 21:52:47 -0700 Subject: sched: fix errors in struct & function comments Fix kernel-doc errors in sched.c: the structs don't have kernel-doc notation and the short function description needs to be one line only. Error(kernel/sched.c:3197): cannot understand prototype: 'struct sd_lb_stats ' Error(kernel/sched.c:3228): cannot understand prototype: 'struct sg_lb_stats ' Error(kernel/sched.c:3375): duplicate section name 'Description' Signed-off-by: Randy Dunlap cc: Ingo Molnar Signed-off-by: Linus Torvalds --- kernel/sched.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) (limited to 'kernel/sched.c') diff --git a/kernel/sched.c b/kernel/sched.c index f4c413bdd38..5757e03cfac 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -3190,7 +3190,7 @@ static int move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest, return 0; } /********** Helpers for find_busiest_group ************************/ -/** +/* * sd_lb_stats - Structure to store the statistics of a sched_domain * during load balancing. */ @@ -3222,7 +3222,7 @@ struct sd_lb_stats { #endif }; -/** +/* * sg_lb_stats - stats of a sched_group required for load_balancing */ struct sg_lb_stats { @@ -3360,16 +3360,17 @@ static inline void update_sd_power_savings_stats(struct sched_group *group, } /** - * check_power_save_busiest_group - Check if we have potential to perform - * some power-savings balance. If yes, set the busiest group to be - * the least loaded group in the sched_domain, so that it's CPUs can - * be put to idle. - * + * check_power_save_busiest_group - see if there is potential for some power-savings balance * @sds: Variable containing the statistics of the sched_domain * under consideration. * @this_cpu: Cpu at which we're currently performing load-balancing. * @imbalance: Variable to store the imbalance. * + * Description: + * Check if we have potential to perform some power-savings balance. + * If yes, set the busiest group to be the least loaded group in the + * sched_domain, so that it's CPUs can be put to idle. + * * Returns 1 if there is potential to perform power-savings balance. * Else returns 0. */ -- cgit v1.2.3 From 7f1e2ca9f04b02794597f60e7b1d43f0a1317939 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 13 Mar 2009 12:21:27 +0100 Subject: hrtimer: fix rq->lock inversion (again) It appears I inadvertly introduced rq->lock recursion to the hrtimer_start() path when I delegated running already expired timers to softirq context. This patch fixes it by introducing a __hrtimer_start_range_ns() method that will not use raise_softirq_irqoff() but __raise_softirq_irqoff() which avoids the wakeup. It then also changes schedule() to check for pending softirqs and do the wakeup then, I'm not quite sure I like this last bit, nor am I convinced its really needed. Signed-off-by: Peter Zijlstra Cc: Peter Zijlstra Cc: paulus@samba.org LKML-Reference: <20090313112301.096138802@chello.nl> Signed-off-by: Ingo Molnar --- kernel/sched.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) (limited to 'kernel/sched.c') diff --git a/kernel/sched.c b/kernel/sched.c index 196d48babbe..63256e3ede2 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -231,13 +231,20 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b) spin_lock(&rt_b->rt_runtime_lock); for (;;) { + unsigned long delta; + ktime_t soft, hard; + if (hrtimer_active(&rt_b->rt_period_timer)) break; now = hrtimer_cb_get_time(&rt_b->rt_period_timer); hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period); - hrtimer_start_expires(&rt_b->rt_period_timer, - HRTIMER_MODE_ABS); + + soft = hrtimer_get_softexpires(&rt_b->rt_period_timer); + hard = hrtimer_get_expires(&rt_b->rt_period_timer); + delta = ktime_to_ns(ktime_sub(hard, soft)); + __hrtimer_start_range_ns(&rt_b->rt_period_timer, soft, delta, + HRTIMER_MODE_ABS, 0); } spin_unlock(&rt_b->rt_runtime_lock); } @@ -1146,7 +1153,8 @@ static __init void init_hrtick(void) */ static void hrtick_start(struct rq *rq, u64 delay) { - hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay), HRTIMER_MODE_REL); + __hrtimer_start_range_ns(&rq->hrtick_timer, ns_to_ktime(delay), 0, + HRTIMER_MODE_REL, 0); } static inline void init_hrtick(void) -- cgit v1.2.3 From a18b83b7ef3c98cd8b4bb885e4a649a8f30fb7b0 Mon Sep 17 00:00:00 2001 From: Bharata B Rao Date: Mon, 23 Mar 2009 10:02:53 +0530 Subject: cpuacct: make cpuacct hierarchy walk in cpuacct_charge() safe when rcupreempt is used -v2 Impact: fix cgroups race under rcu-preempt cpuacct_charge() obtains task's ca and does a hierarchy walk upwards. This can race with the task's movement between cgroups. This race can cause an access to freed ca pointer in cpuacct_charge() or access to invalid cgroups pointer of the task. This will not happen with rcu or tree rcu as cpuacct_charge() is called with preemption disabled. However if rcupreempt is used, the race is seen. Thanks to Li Zefan for explaining this. Fix this race by explicitly protecting ca and the hierarchy walk with rcu_read_lock(). Changes for v2: - Update patch descrition (as per Li Zefan's review comments). - Remove comments in cpuacct_charge() which explained why rcu_read_lock() was needed (as per Peter Zijlstra's review comments). Signed-off-by: Bharata B Rao Cc: Dhaval Giani Cc: Li Zefan Cc: Paul Menage Cc: KAMEZAWA Hiroyuki Acked-by: Peter Zijlstra Acked-by: Balbir Singh Tested-by: Balbir Singh Signed-off-by: Ingo Molnar --- kernel/sched.c | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'kernel/sched.c') diff --git a/kernel/sched.c b/kernel/sched.c index 186c6fd08ac..cc397aae5ea 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -9654,12 +9654,17 @@ static void cpuacct_charge(struct task_struct *tsk, u64 cputime) return; cpu = task_cpu(tsk); + + rcu_read_lock(); + ca = task_ca(tsk); for (; ca; ca = ca->parent) { u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu); *cpuusage += cputime; } + + rcu_read_unlock(); } struct cgroup_subsys cpuacct_subsys = { -- cgit v1.2.3 From c5f8d99585d7b5b7e857fabf8aefd0174903a98c Mon Sep 17 00:00:00 2001 From: Hidetoshi Seto Date: Tue, 31 Mar 2009 16:56:03 +0900 Subject: posixtimers, sched: Fix posix clock monotonicity Impact: Regression fix (against clock_gettime() backwarding bug) This patch re-introduces a couple of functions, task_sched_runtime and thread_group_sched_runtime, which was once removed at the time of 2.6.28-rc1. These functions protect the sampling of thread/process clock with rq lock. This rq lock is required not to update rq->clock during the sampling. i.e. The clock_gettime() may return ((accounted runtime before update) + (delta after update)) that is less than what it should be. v2 -> v3: - Rename static helper function __task_delta_exec() to do_task_delta_exec() since -tip tree already has a __task_delta_exec() of different version. v1 -> v2: - Revises comments of function and patch description. - Add note about accuracy of thread group's runtime. Signed-off-by: Hidetoshi Seto Acked-by: Peter Zijlstra Cc: stable@kernel.org [2.6.28.x][2.6.29.x] LKML-Reference: <49D1CC93.4080401@jp.fujitsu.com> Signed-off-by: Ingo Molnar --- kernel/sched.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 57 insertions(+), 8 deletions(-) (limited to 'kernel/sched.c') diff --git a/kernel/sched.c b/kernel/sched.c index cc397aae5ea..c8d7f17bd03 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -4139,9 +4139,25 @@ DEFINE_PER_CPU(struct kernel_stat, kstat); EXPORT_PER_CPU_SYMBOL(kstat); /* - * Return any ns on the sched_clock that have not yet been banked in + * Return any ns on the sched_clock that have not yet been accounted in * @p in case that task is currently running. + * + * Called with task_rq_lock() held on @rq. */ +static u64 do_task_delta_exec(struct task_struct *p, struct rq *rq) +{ + u64 ns = 0; + + if (task_current(rq, p)) { + update_rq_clock(rq); + ns = rq->clock - p->se.exec_start; + if ((s64)ns < 0) + ns = 0; + } + + return ns; +} + unsigned long long task_delta_exec(struct task_struct *p) { unsigned long flags; @@ -4149,16 +4165,49 @@ unsigned long long task_delta_exec(struct task_struct *p) u64 ns = 0; rq = task_rq_lock(p, &flags); + ns = do_task_delta_exec(p, rq); + task_rq_unlock(rq, &flags); - if (task_current(rq, p)) { - u64 delta_exec; + return ns; +} - update_rq_clock(rq); - delta_exec = rq->clock - p->se.exec_start; - if ((s64)delta_exec > 0) - ns = delta_exec; - } +/* + * Return accounted runtime for the task. + * In case the task is currently running, return the runtime plus current's + * pending runtime that have not been accounted yet. + */ +unsigned long long task_sched_runtime(struct task_struct *p) +{ + unsigned long flags; + struct rq *rq; + u64 ns = 0; + + rq = task_rq_lock(p, &flags); + ns = p->se.sum_exec_runtime + do_task_delta_exec(p, rq); + task_rq_unlock(rq, &flags); + + return ns; +} +/* + * Return sum_exec_runtime for the thread group. + * In case the task is currently running, return the sum plus current's + * pending runtime that have not been accounted yet. + * + * Note that the thread group might have other running tasks as well, + * so the return value not includes other pending runtime that other + * running tasks might have. + */ +unsigned long long thread_group_sched_runtime(struct task_struct *p) +{ + struct task_cputime totals; + unsigned long flags; + struct rq *rq; + u64 ns; + + rq = task_rq_lock(p, &flags); + thread_group_cputime(p, &totals); + ns = totals.sum_exec_runtime + do_task_delta_exec(p, rq); task_rq_unlock(rq, &flags); return ns; -- cgit v1.2.3 From ef12fefabf94b6a902ad3abd3eb124b00560c445 Mon Sep 17 00:00:00 2001 From: Bharata B Rao Date: Tue, 31 Mar 2009 10:02:22 +0530 Subject: cpuacct: add per-cgroup utime/stime statistics Add per-cgroup cpuacct controller statistics like the system and user time consumed by the group of tasks. Changelog: v7 - Changed the name of the statistic from utime to user and from stime to system so that in future we could easily add other statistics like irq, softirq, steal times etc easily. v6 - Fixed a bug in the error path of cpuacct_create() (pointed by Li Zefan). v5 - In cpuacct_stats_show(), use cputime64_to_clock_t() since we are operating on a 64bit variable here. v4 - Remove comments in cpuacct_update_stats() which explained why rcu_read_lock() was needed (as per Peter Zijlstra's review comments). - Don't say that percpu_counter_read() is broken in Documentation/cpuacct.txt as per KAMEZAWA Hiroyuki's review comments. v3 - Fix a small race in the cpuacct hierarchy walk. v2 - stime and utime now exported in clock_t units instead of msecs. - Addressed the code review comments from Balbir and Li Zefan. - Moved to -tip tree. v1 - Moved the stime/utime accounting to cpuacct controller. Earlier versions - http://lkml.org/lkml/2009/2/25/129 Signed-off-by: Bharata B Rao Signed-off-by: Balaji Rao Cc: Dhaval Giani Cc: Paul Menage Cc: Andrew Morton Cc: KAMEZAWA Hiroyuki Reviewed-by: Li Zefan Acked-by: Peter Zijlstra Acked-by: Balbir Singh Tested-by: Balbir Singh LKML-Reference: <20090331043222.GA4093@in.ibm.com> Signed-off-by: Ingo Molnar --- kernel/sched.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 81 insertions(+), 6 deletions(-) (limited to 'kernel/sched.c') diff --git a/kernel/sched.c b/kernel/sched.c index c8d7f17bd03..8d1bdbe8aaf 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -1393,10 +1393,22 @@ iter_move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest, struct rq_iterator *iterator); #endif +/* Time spent by the tasks of the cpu accounting group executing in ... */ +enum cpuacct_stat_index { + CPUACCT_STAT_USER, /* ... user mode */ + CPUACCT_STAT_SYSTEM, /* ... kernel mode */ + + CPUACCT_STAT_NSTATS, +}; + #ifdef CONFIG_CGROUP_CPUACCT static void cpuacct_charge(struct task_struct *tsk, u64 cputime); +static void cpuacct_update_stats(struct task_struct *tsk, + enum cpuacct_stat_index idx, cputime_t val); #else static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {} +static inline void cpuacct_update_stats(struct task_struct *tsk, + enum cpuacct_stat_index idx, cputime_t val) {} #endif static inline void inc_cpu_load(struct rq *rq, unsigned long load) @@ -4236,6 +4248,8 @@ void account_user_time(struct task_struct *p, cputime_t cputime, cpustat->nice = cputime64_add(cpustat->nice, tmp); else cpustat->user = cputime64_add(cpustat->user, tmp); + + cpuacct_update_stats(p, CPUACCT_STAT_USER, cputime); /* Account for user time used */ acct_update_integrals(p); } @@ -4297,6 +4311,8 @@ void account_system_time(struct task_struct *p, int hardirq_offset, else cpustat->system = cputime64_add(cpustat->system, tmp); + cpuacct_update_stats(p, CPUACCT_STAT_SYSTEM, cputime); + /* Account for system time used */ acct_update_integrals(p); } @@ -9539,6 +9555,7 @@ struct cpuacct { struct cgroup_subsys_state css; /* cpuusage holds pointer to a u64-type object on every cpu */ u64 *cpuusage; + struct percpu_counter cpustat[CPUACCT_STAT_NSTATS]; struct cpuacct *parent; }; @@ -9563,20 +9580,32 @@ static struct cgroup_subsys_state *cpuacct_create( struct cgroup_subsys *ss, struct cgroup *cgrp) { struct cpuacct *ca = kzalloc(sizeof(*ca), GFP_KERNEL); + int i; if (!ca) - return ERR_PTR(-ENOMEM); + goto out; ca->cpuusage = alloc_percpu(u64); - if (!ca->cpuusage) { - kfree(ca); - return ERR_PTR(-ENOMEM); - } + if (!ca->cpuusage) + goto out_free_ca; + + for (i = 0; i < CPUACCT_STAT_NSTATS; i++) + if (percpu_counter_init(&ca->cpustat[i], 0)) + goto out_free_counters; if (cgrp->parent) ca->parent = cgroup_ca(cgrp->parent); return &ca->css; + +out_free_counters: + while (--i >= 0) + percpu_counter_destroy(&ca->cpustat[i]); + free_percpu(ca->cpuusage); +out_free_ca: + kfree(ca); +out: + return ERR_PTR(-ENOMEM); } /* destroy an existing cpu accounting group */ @@ -9584,7 +9613,10 @@ static void cpuacct_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp) { struct cpuacct *ca = cgroup_ca(cgrp); + int i; + for (i = 0; i < CPUACCT_STAT_NSTATS; i++) + percpu_counter_destroy(&ca->cpustat[i]); free_percpu(ca->cpuusage); kfree(ca); } @@ -9671,6 +9703,25 @@ static int cpuacct_percpu_seq_read(struct cgroup *cgroup, struct cftype *cft, return 0; } +static const char *cpuacct_stat_desc[] = { + [CPUACCT_STAT_USER] = "user", + [CPUACCT_STAT_SYSTEM] = "system", +}; + +static int cpuacct_stats_show(struct cgroup *cgrp, struct cftype *cft, + struct cgroup_map_cb *cb) +{ + struct cpuacct *ca = cgroup_ca(cgrp); + int i; + + for (i = 0; i < CPUACCT_STAT_NSTATS; i++) { + s64 val = percpu_counter_read(&ca->cpustat[i]); + val = cputime64_to_clock_t(val); + cb->fill(cb, cpuacct_stat_desc[i], val); + } + return 0; +} + static struct cftype files[] = { { .name = "usage", @@ -9681,7 +9732,10 @@ static struct cftype files[] = { .name = "usage_percpu", .read_seq_string = cpuacct_percpu_seq_read, }, - + { + .name = "stat", + .read_map = cpuacct_stats_show, + }, }; static int cpuacct_populate(struct cgroup_subsys *ss, struct cgroup *cgrp) @@ -9716,6 +9770,27 @@ static void cpuacct_charge(struct task_struct *tsk, u64 cputime) rcu_read_unlock(); } +/* + * Charge the system/user time to the task's accounting group. + */ +static void cpuacct_update_stats(struct task_struct *tsk, + enum cpuacct_stat_index idx, cputime_t val) +{ + struct cpuacct *ca; + + if (unlikely(!cpuacct_subsys.active)) + return; + + rcu_read_lock(); + ca = task_ca(tsk); + + do { + percpu_counter_add(&ca->cpustat[idx], val); + ca = ca->parent; + } while (ca); + rcu_read_unlock(); +} + struct cgroup_subsys cpuacct_subsys = { .name = "cpuacct", .create = cpuacct_create, -- cgit v1.2.3 From 46e0bb9c12f4bab539736f1714cbf16600f681ec Mon Sep 17 00:00:00 2001 From: Gautham R Shenoy Date: Mon, 30 Mar 2009 10:25:20 +0530 Subject: sched: Print sched_group::__cpu_power in sched_domain_debug Impact: extend debug info /proc/sched_debug If the user changes the value of the sched_mc/smt_power_savings sysfs tunable, it'll trigger a rebuilding of the whole sched_domain tree, with the SD_POWERSAVINGS_BALANCE flag set at certain levels. As a result, there would be a change in the __cpu_power of sched_groups in the sched_domain hierarchy. Print the __cpu_power values for each sched_group in sched_domain_debug to help verify this change and correlate it with the change in the load-balancing behavior. Signed-off-by: Gautham R Shenoy Cc: Peter Zijlstra LKML-Reference: <20090330045520.2869.24777.stgit@sofia.in.ibm.com> Signed-off-by: Ingo Molnar --- kernel/sched.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel/sched.c') diff --git a/kernel/sched.c b/kernel/sched.c index 8d1bdbe8aaf..6234d10c6a7 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -6963,7 +6963,8 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level, cpumask_or(groupmask, groupmask, sched_group_cpus(group)); cpulist_scnprintf(str, sizeof(str), sched_group_cpus(group)); - printk(KERN_CONT " %s", str); + printk(KERN_CONT " %s (__cpu_power = %d)", str, + group->__cpu_power); group = group->next; } while (group != sd->groups); -- cgit v1.2.3 From 4ede816ac36e027db5fe0051ad9c73f76db63772 Mon Sep 17 00:00:00 2001 From: Davide Libenzi Date: Tue, 31 Mar 2009 15:24:20 -0700 Subject: epoll keyed wakeups: add __wake_up_locked_key() and __wake_up_sync_key() This patchset introduces wakeup hints for some of the most popular (from epoll POV) devices, so that epoll code can avoid spurious wakeups on its waiters. The problem with epoll is that the callback-based wakeups do not, ATM, carry any information about the events the wakeup is related to. So the only choice epoll has (not being able to call f_op->poll() from inside the callback), is to add the file* to a ready-list and resolve the real events later on, at epoll_wait() (or its own f_op->poll()) time. This can cause spurious wakeups, since the wake_up() itself might be for an event the caller is not interested into. The rate of these spurious wakeup can be pretty high in case of many network sockets being monitored. By allowing devices to report the events the wakeups refer to (at least the two major classes - POLLIN/POLLOUT), we are able to spare useless wakeups by proper handling inside the epoll's poll callback. Epoll will have in any case to call f_op->poll() on the file* later on, since the change to be done in order to have the full event set sent via wakeup, is too invasive for the way our f_op->poll() system works (the full event set is calculated inside the poll function - there are too many of them to even start thinking the change - also poll/select would need change too). Epoll is changed in a way that both devices which send event hints, and the ones that don't, are correctly handled. The former will gain some efficiency though. As a general rule for devices, would be to add an event mask by using key-aware wakeup macros, when making up poll wait queues. I tested it (together with the epoll's poll fix patch Andrew has in -mm) and wakeups for the supported devices are correctly filtered. Test program available here: http://www.xmailserver.org/epoll_test.c This patch: Nothing revolutionary here. Just using the available "key" that our wakeup core already support. The __wake_up_locked_key() was no brainer, since both __wake_up_locked() and __wake_up_locked_key() are thin wrappers around __wake_up_common(). The __wake_up_sync() function had a body, so the choice was between borrowing the body for __wake_up_sync_key() and calling it from __wake_up_sync(), or make an inline and calling it from both. I chose the former since in most archs it all resolves to "mov $0, REG; jmp ADDR". Signed-off-by: Davide Libenzi Cc: Alan Cox Cc: Ingo Molnar Cc: David Miller Cc: William Lee Irwin III Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sched.c | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-) (limited to 'kernel/sched.c') diff --git a/kernel/sched.c b/kernel/sched.c index 196d48babbe..73513f4e19d 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -5196,11 +5196,17 @@ void __wake_up_locked(wait_queue_head_t *q, unsigned int mode) __wake_up_common(q, mode, 1, 0, NULL); } +void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key) +{ + __wake_up_common(q, mode, 1, 0, key); +} + /** - * __wake_up_sync - wake up threads blocked on a waitqueue. + * __wake_up_sync_key - wake up threads blocked on a waitqueue. * @q: the waitqueue * @mode: which threads * @nr_exclusive: how many wake-one or wake-many threads to wake up + * @key: opaque value to be passed to wakeup targets * * The sync wakeup differs that the waker knows that it will schedule * away soon, so while the target thread will be woken up, it will not @@ -5209,8 +5215,8 @@ void __wake_up_locked(wait_queue_head_t *q, unsigned int mode) * * On UP it can prevent extra preemption. */ -void -__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr_exclusive) +void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, + int nr_exclusive, void *key) { unsigned long flags; int sync = 1; @@ -5222,9 +5228,18 @@ __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr_exclusive) sync = 0; spin_lock_irqsave(&q->lock, flags); - __wake_up_common(q, mode, nr_exclusive, sync, NULL); + __wake_up_common(q, mode, nr_exclusive, sync, key); spin_unlock_irqrestore(&q->lock, flags); } +EXPORT_SYMBOL_GPL(__wake_up_sync_key); + +/* + * __wake_up_sync - see __wake_up_sync_key() + */ +void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr_exclusive) +{ + __wake_up_sync_key(q, mode, nr_exclusive, NULL); +} EXPORT_SYMBOL_GPL(__wake_up_sync); /* For internal use only */ /** -- cgit v1.2.3 From 132380a06b24704fd6c9be55c44d4ef3972cead2 Mon Sep 17 00:00:00 2001 From: Lai Jiangshan Date: Thu, 2 Apr 2009 14:18:25 +0800 Subject: tracing, sched: mark get_parent_ip() notrace Impact: remove overly redundant tracing entries When tracer is "function" or "function_graph", way too much "get_parent_ip" entries are recorded in ring_buffer. Signed-off-by: Lai Jiangshan Acked-by: Frederic Weisbecker Acked-by: Steven Rostedt LKML-Reference: <49D458B1.5000703@cn.fujitsu.com> Signed-off-by: Ingo Molnar --- kernel/sched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel/sched.c') diff --git a/kernel/sched.c b/kernel/sched.c index 5724508c3b6..e90e70ed36a 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -4846,7 +4846,7 @@ void scheduler_tick(void) #endif } -unsigned long get_parent_ip(unsigned long addr) +notrace unsigned long get_parent_ip(unsigned long addr) { if (in_lock_functions(addr)) { addr = CALLER_ADDR2; -- cgit v1.2.3 From 381512cf3d27f63f7a45b1bbe7d2d609c2ea3b74 Mon Sep 17 00:00:00 2001 From: Gautham R Shenoy Date: Tue, 14 Apr 2009 09:09:36 +0530 Subject: sched: Avoid printing sched_group::__cpu_power for default case Commit 46e0bb9c12f4 ("sched: Print sched_group::__cpu_power in sched_domain_debug") produces a messy dmesg output while attempting to print the sched_group::__cpu_power for each group in the sched_domain hierarchy. Fix this by avoid printing the __cpu_power for default cases. (i.e, __cpu_power == SCHED_LOAD_SCALE). [ Impact: reduce syslog clutter ] Reported-by: Tony Luck Signed-off-by: Gautham R Shenoy Fixed-by: Tony Luck Cc: a.p.zijlstra@chello.nl LKML-Reference: <20090414033936.GA534@in.ibm.com> Signed-off-by: Ingo Molnar --- kernel/sched.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) (limited to 'kernel/sched.c') diff --git a/kernel/sched.c b/kernel/sched.c index e90e70ed36a..b902e587a3a 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -7367,8 +7367,12 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level, cpumask_or(groupmask, groupmask, sched_group_cpus(group)); cpulist_scnprintf(str, sizeof(str), sched_group_cpus(group)); - printk(KERN_CONT " %s (__cpu_power = %d)", str, - group->__cpu_power); + + printk(KERN_CONT " %s", str); + if (group->__cpu_power != SCHED_LOAD_SCALE) { + printk(KERN_CONT " (__cpu_power = %d)", + group->__cpu_power); + } group = group->next; } while (group != sd->groups); -- cgit v1.2.3 From f5f293a4e3d0a0c52cec31de6762c95050156516 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Wed, 29 Apr 2009 14:44:49 +0200 Subject: sched: account system time properly Andrew Gallatin reported that IRQ and SOFTIRQ times were sometime not reported correctly on recent kernels, and even bisected to commit 457533a7d3402d1d91fbc125c8bd1bd16dcd3cd4 ([PATCH] fix scaled & unscaled cputime accounting) as the first bad commit. Further analysis pointed that commit 79741dd35713ff4f6fd0eafd59fa94e8a4ba922d ([PATCH] idle cputime accounting) was the real cause of the problem. account_process_tick() was not taking into account timer IRQ interrupting the idle task servicing a hard or soft irq. On mostly idle cpu, irqs were thus not accounted and top or mpstat could tell user/admin that cpu was 100 % idle, 0.00 % irq, 0.00 % softirq, while it was not. [ Impact: fix occasionally incorrect CPU statistics in top/mpstat ] Reported-by: Andrew Gallatin Re-reported-by: Andrew Morton Signed-off-by: Eric Dumazet Acked-by: Martin Schwidefsky Cc: rick.jones2@hp.com Cc: brice@myri.com Cc: Paul Mackerras Cc: Benjamin Herrenschmidt LKML-Reference: <49F84BC1.7080602@cosmosbay.com> Signed-off-by: Ingo Molnar --- kernel/sched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel/sched.c') diff --git a/kernel/sched.c b/kernel/sched.c index b902e587a3a..26efa475bdc 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -4732,7 +4732,7 @@ void account_process_tick(struct task_struct *p, int user_tick) if (user_tick) account_user_time(p, one_jiffy, one_jiffy_scaled); - else if (p != rq->idle) + else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET)) account_system_time(p, HARDIRQ_OFFSET, one_jiffy, one_jiffy_scaled); else -- cgit v1.2.3