From 6b2d7700266b9402e12824e11e0099ae6a4a6a79 Mon Sep 17 00:00:00 2001 From: Srivatsa Vaddagiri Date: Fri, 25 Jan 2008 21:08:00 +0100 Subject: sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups The current load balancing scheme isn't good enough for precise group fairness. For example: on a 8-cpu system, I created 3 groups as under: a = 8 tasks (cpu.shares = 1024) b = 4 tasks (cpu.shares = 1024) c = 3 tasks (cpu.shares = 1024) a, b and c are task groups that have equal weight. We would expect each of the groups to receive 33.33% of cpu bandwidth under a fair scheduler. This is what I get with the latest scheduler git tree: Signed-off-by: Ingo Molnar -------------------------------------------------------------------------------- Col1 | Col2 | Col3 | Col4 ------|---------|-------|------------------------------------------------------- a | 277.676 | 57.8% | 54.1% 54.1% 54.1% 54.2% 56.7% 62.2% 62.8% 64.5% b | 116.108 | 24.2% | 47.4% 48.1% 48.7% 49.3% c | 86.326 | 18.0% | 47.5% 47.9% 48.5% -------------------------------------------------------------------------------- Explanation of o/p: Col1 -> Group name Col2 -> Cumulative execution time (in seconds) received by all tasks of that group in a 60sec window across 8 cpus Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in percentage. Col3 data is derived as: Col3 = 100 * Col2 / (NR_CPUS * 60) Col4 -> CPU bandwidth received by each individual task of the group. Col4 = 100 * cpu_time_recd_by_task / 60 [I can share the test case that produces a similar o/p if reqd] The deviation from desired group fairness is as below: a = +24.47% b = -9.13% c = -15.33% which is quite high. After the patch below is applied, here are the results: -------------------------------------------------------------------------------- Col1 | Col2 | Col3 | Col4 ------|---------|-------|------------------------------------------------------- a | 163.112 | 34.0% | 33.2% 33.4% 33.5% 33.5% 33.7% 34.4% 34.8% 35.3% b | 156.220 | 32.5% | 63.3% 64.5% 66.1% 66.5% c | 160.653 | 33.5% | 85.8% 90.6% 91.4% -------------------------------------------------------------------------------- Deviation from desired group fairness is as below: a = +0.67% b = -0.83% c = +0.17% which is far better IMO. Most of other runs have yielded a deviation within +-2% at the most, which is good. Why do we see bad (group) fairness with current scheuler? ========================================================= Currently cpu's weight is just the summation of individual task weights. This can yield incorrect results. For ex: consider three groups as below on a 2-cpu system: CPU0 CPU1 --------------------------- A (10) B(5) C(5) --------------------------- Group A has 10 tasks, all on CPU0, Group B and C have 5 tasks each all of which are on CPU1. Each task has the same weight (NICE_0_LOAD = 1024). The current scheme would yield a cpu weight of 10240 (10*1024) for each cpu and the load balancer will think both CPUs are perfectly balanced and won't move around any tasks. This, however, would yield this bandwidth: A = 50% B = 25% C = 25% which is not the desired result. What's changing in the patch? ============================= - How cpu weights are calculated when CONFIF_FAIR_GROUP_SCHED is defined (see below) - API Change - Two tunables introduced in sysfs (under SCHED_DEBUG) to control the frequency at which the load balance monitor thread runs. The basic change made in this patch is how cpu weight (rq->load.weight) is calculated. Its now calculated as the summation of group weights on a cpu, rather than summation of task weights. Weight exerted by a group on a cpu is dependent on the shares allocated to it and also the number of tasks the group has on that cpu compared to the total number of (runnable) tasks the group has in the system. Let, W(K,i) = Weight of group K on cpu i T(K,i) = Task load present in group K's cfs_rq on cpu i T(K) = Total task load of group K across various cpus S(K) = Shares allocated to group K NRCPUS = Number of online cpus in the scheduler domain to which group K is assigned. Then, W(K,i) = S(K) * NRCPUS * T(K,i) / T(K) A load balance monitor thread is created at bootup, which periodically runs and adjusts group's weight on each cpu. To avoid its overhead, two min/max tunables are introduced (under SCHED_DEBUG) to control the rate at which it runs. Fixes from: Peter Zijlstra - don't start the load_balance_monitor when there is only a single cpu. - rename the kthread because its currently longer than TASK_COMM_LEN Signed-off-by: Srivatsa Vaddagiri Signed-off-by: Ingo Molnar --- include/linux/sched.h | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'include/linux') diff --git a/include/linux/sched.h b/include/linux/sched.h index d6eacda765c..288245f83bd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1453,6 +1453,10 @@ extern unsigned int sysctl_sched_child_runs_first; extern unsigned int sysctl_sched_features; extern unsigned int sysctl_sched_migration_cost; extern unsigned int sysctl_sched_nr_migrate; +#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP) +extern unsigned int sysctl_sched_min_bal_int_shares; +extern unsigned int sysctl_sched_max_bal_int_shares; +#endif int sched_nr_latency_handler(struct ctl_table *table, int write, struct file *file, void __user *buffer, size_t *length, -- cgit v1.2.3 From d221938c049f4845da13c8593132595a6b9222a8 Mon Sep 17 00:00:00 2001 From: Gautham R Shenoy Date: Fri, 25 Jan 2008 21:08:01 +0100 Subject: cpu-hotplug: refcount based cpu hotplug This patch implements a Refcount + Waitqueue based model for cpu-hotplug. Now, a thread which wants to prevent cpu-hotplug, will bump up a global refcount and the thread which wants to perform a cpu-hotplug operation will block till the global refcount goes to zero. The readers, if any, during an ongoing cpu-hotplug operation are blocked until the cpu-hotplug operation is over. Signed-off-by: Gautham R Shenoy Signed-off-by: Paul Jackson [For !CONFIG_HOTPLUG_CPU ] Signed-off-by: Ingo Molnar --- include/linux/cpu.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'include/linux') diff --git a/include/linux/cpu.h b/include/linux/cpu.h index 92f2029a34f..a40247e4d46 100644 --- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -83,6 +83,9 @@ static inline void unregister_cpu_notifier(struct notifier_block *nb) #endif /* CONFIG_SMP */ extern struct sysdev_class cpu_sysdev_class; +extern void cpu_hotplug_init(void); +extern void cpu_maps_update_begin(void); +extern void cpu_maps_update_done(void); #ifdef CONFIG_HOTPLUG_CPU /* Stop CPUs going up and down. */ -- cgit v1.2.3 From 86ef5c9a8edd78e6bf92879f32329d89b2d55b5a Mon Sep 17 00:00:00 2001 From: Gautham R Shenoy Date: Fri, 25 Jan 2008 21:08:02 +0100 Subject: cpu-hotplug: replace lock_cpu_hotplug() with get_online_cpus() Replace all lock_cpu_hotplug/unlock_cpu_hotplug from the kernel and use get_online_cpus and put_online_cpus instead as it highlights the refcount semantics in these operations. The new API guarantees protection against the cpu-hotplug operation, but it doesn't guarantee serialized access to any of the local data structures. Hence the changes needs to be reviewed. In case of pseries_add_processor/pseries_remove_processor, use cpu_maps_update_begin()/cpu_maps_update_done() as we're modifying the cpu_present_map there. Signed-off-by: Gautham R Shenoy Signed-off-by: Ingo Molnar --- include/linux/cpu.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'include/linux') diff --git a/include/linux/cpu.h b/include/linux/cpu.h index a40247e4d46..3a3ff1c5cbe 100644 --- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -100,8 +100,8 @@ static inline void cpuhotplug_mutex_unlock(struct mutex *cpu_hp_mutex) mutex_unlock(cpu_hp_mutex); } -extern void lock_cpu_hotplug(void); -extern void unlock_cpu_hotplug(void); +extern void get_online_cpus(void); +extern void put_online_cpus(void); #define hotcpu_notifier(fn, pri) { \ static struct notifier_block fn##_nb = \ { .notifier_call = fn, .priority = pri }; \ @@ -118,8 +118,8 @@ static inline void cpuhotplug_mutex_lock(struct mutex *cpu_hp_mutex) static inline void cpuhotplug_mutex_unlock(struct mutex *cpu_hp_mutex) { } -#define lock_cpu_hotplug() do { } while (0) -#define unlock_cpu_hotplug() do { } while (0) +#define get_online_cpus() do { } while (0) +#define put_online_cpus() do { } while (0) #define hotcpu_notifier(fn, pri) do { (void)(fn); } while (0) /* These aren't inline functions due to a GCC bug. */ #define register_hotcpu_notifier(nb) ({ (void)(nb); 0; }) -- cgit v1.2.3 From 95402b3829010fe1e208f44e4a158ccade88969a Mon Sep 17 00:00:00 2001 From: Gautham R Shenoy Date: Fri, 25 Jan 2008 21:08:02 +0100 Subject: cpu-hotplug: replace per-subsystem mutexes with get_online_cpus() This patch converts the known per-subsystem mutexes to get_online_cpus put_online_cpus. It also eliminates the CPU_LOCK_ACQUIRE and CPU_LOCK_RELEASE hotplug notification events. Signed-off-by: Gautham R Shenoy Signed-off-by: Ingo Molnar --- include/linux/notifier.h | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) (limited to 'include/linux') diff --git a/include/linux/notifier.h b/include/linux/notifier.h index 0c40cc0b4a3..5dfbc684ce7 100644 --- a/include/linux/notifier.h +++ b/include/linux/notifier.h @@ -207,9 +207,7 @@ static inline int notifier_to_errno(int ret) #define CPU_DOWN_PREPARE 0x0005 /* CPU (unsigned)v going down */ #define CPU_DOWN_FAILED 0x0006 /* CPU (unsigned)v NOT going down */ #define CPU_DEAD 0x0007 /* CPU (unsigned)v dead */ -#define CPU_LOCK_ACQUIRE 0x0008 /* Acquire all hotcpu locks */ -#define CPU_LOCK_RELEASE 0x0009 /* Release all hotcpu locks */ -#define CPU_DYING 0x000A /* CPU (unsigned)v not running any task, +#define CPU_DYING 0x0008 /* CPU (unsigned)v not running any task, * not handling interrupts, soon dead */ /* Used for CPU hotplug events occuring while tasks are frozen due to a suspend -- cgit v1.2.3 From d0d23b5432fe61229dd3641c5e94d4130bc4e61b Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Fri, 25 Jan 2008 21:08:02 +0100 Subject: cpu-hotplug: fix build on !CONFIG_SMP fix build on !CONFIG_SMP. Signed-off-by: Ingo Molnar --- include/linux/cpu.h | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/cpu.h b/include/linux/cpu.h index 3a3ff1c5cbe..0be8d65bc3c 100644 --- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -71,19 +71,25 @@ static inline void unregister_cpu_notifier(struct notifier_block *nb) int cpu_up(unsigned int cpu); +extern void cpu_hotplug_init(void); + #else static inline int register_cpu_notifier(struct notifier_block *nb) { return 0; } + static inline void unregister_cpu_notifier(struct notifier_block *nb) { } +static inline void cpu_hotplug_init(void) +{ +} + #endif /* CONFIG_SMP */ extern struct sysdev_class cpu_sysdev_class; -extern void cpu_hotplug_init(void); extern void cpu_maps_update_begin(void); extern void cpu_maps_update_done(void); -- cgit v1.2.3 From 82a1fcb90287052aabfa235e7ffc693ea003fe69 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Fri, 25 Jan 2008 21:08:02 +0100 Subject: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks this patch extends the soft-lockup detector to automatically detect hung TASK_UNINTERRUPTIBLE tasks. Such hung tasks are printed the following way: ------------------> INFO: task prctl:3042 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message prctl D fd5e3793 0 3042 2997 f6050f38 00000046 00000001 fd5e3793 00000009 c06d8264 c06dae80 00000286 f6050f40 f6050f00 f7d34d90 f7d34fc8 c1e1be80 00000001 f6050000 00000000 f7e92d00 00000286 f6050f18 c0489d1a f6050f40 00006605 00000000 c0133a5b Call Trace: [] schedule_timeout+0x6d/0x8b [] schedule_timeout_uninterruptible+0x15/0x17 [] msleep+0x10/0x16 [] sys_prctl+0x30/0x1e2 [] sysenter_past_esp+0x5f/0xa5 ======================= 2 locks held by prctl/3042: #0: (&sb->s_type->i_mutex_key#5){--..}, at: [] do_fsync+0x38/0x7a #1: (jbd_handle){--..}, at: [] journal_start+0xc7/0xe9 <------------------ the current default timeout is 120 seconds. Such messages are printed up to 10 times per bootup. If the system has crashed already then the messages are not printed. if lockdep is enabled then all held locks are printed as well. this feature is a natural extension to the softlockup-detector (kernel locked up without scheduling) and to the NMI watchdog (kernel locked up with IRQs disabled). [ Gautham R Shenoy : CPU hotplug fixes. ] [ Andrew Morton : build warning fix. ] Signed-off-by: Ingo Molnar Signed-off-by: Arjan van de Ven --- include/linux/debug_locks.h | 5 +++++ include/linux/sched.h | 10 ++++++++++ 2 files changed, 15 insertions(+) (limited to 'include/linux') diff --git a/include/linux/debug_locks.h b/include/linux/debug_locks.h index 1678a5de701..f4a5871767f 100644 --- a/include/linux/debug_locks.h +++ b/include/linux/debug_locks.h @@ -47,6 +47,7 @@ struct task_struct; #ifdef CONFIG_LOCKDEP extern void debug_show_all_locks(void); +extern void __debug_show_held_locks(struct task_struct *task); extern void debug_show_held_locks(struct task_struct *task); extern void debug_check_no_locks_freed(const void *from, unsigned long len); extern void debug_check_no_locks_held(struct task_struct *task); @@ -55,6 +56,10 @@ static inline void debug_show_all_locks(void) { } +static inline void __debug_show_held_locks(struct task_struct *task) +{ +} + static inline void debug_show_held_locks(struct task_struct *task) { } diff --git a/include/linux/sched.h b/include/linux/sched.h index 288245f83bd..0846f1f9e19 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -258,12 +258,17 @@ extern void account_process_tick(struct task_struct *task, int user); extern void update_process_times(int user); extern void scheduler_tick(void); +extern void sched_show_task(struct task_struct *p); + #ifdef CONFIG_DETECT_SOFTLOCKUP extern void softlockup_tick(void); extern void spawn_softlockup_task(void); extern void touch_softlockup_watchdog(void); extern void touch_all_softlockup_watchdogs(void); extern int softlockup_thresh; +extern unsigned long sysctl_hung_task_check_count; +extern unsigned long sysctl_hung_task_timeout_secs; +extern long sysctl_hung_task_warnings; #else static inline void softlockup_tick(void) { @@ -1041,6 +1046,11 @@ struct task_struct { /* ipc stuff */ struct sysv_sem sysvsem; #endif +#ifdef CONFIG_DETECT_SOFTLOCKUP +/* hung task detection */ + unsigned long last_switch_timestamp; + unsigned long last_switch_count; +#endif /* CPU-specific state of this task */ struct thread_struct thread; /* filesystem information */ -- cgit v1.2.3 From 73fe6aae84400e2b475e2a1dc4e8592cd3ed6e69 Mon Sep 17 00:00:00 2001 From: Gregory Haskins Date: Fri, 25 Jan 2008 21:08:07 +0100 Subject: sched: add RT-balance cpu-weight Some RT tasks (particularly kthreads) are bound to one specific CPU. It is fairly common for two or more bound tasks to get queued up at the same time. Consider, for instance, softirq_timer and softirq_sched. A timer goes off in an ISR which schedules softirq_thread to run at RT50. Then the timer handler determines that it's time to smp-rebalance the system so it schedules softirq_sched to run. So we are in a situation where we have two RT50 tasks queued, and the system will go into rt-overload condition to request other CPUs for help. This causes two problems in the current code: 1) If a high-priority bound task and a low-priority unbounded task queue up behind the running task, we will fail to ever relocate the unbounded task because we terminate the search on the first unmovable task. 2) We spend precious futile cycles in the fast-path trying to pull overloaded tasks over. It is therefore optimial to strive to avoid the overhead all together if we can cheaply detect the condition before overload even occurs. This patch tries to achieve this optimization by utilizing the hamming weight of the task->cpus_allowed mask. A weight of 1 indicates that the task cannot be migrated. We will then utilize this information to skip non-migratable tasks and to eliminate uncessary rebalance attempts. We introduce a per-rq variable to count the number of migratable tasks that are currently running. We only go into overload if we have more than one rt task, AND at least one of them is migratable. In addition, we introduce a per-task variable to cache the cpus_allowed weight, since the hamming calculation is probably relatively expensive. We only update the cached value when the mask is updated which should be relatively infrequent, especially compared to scheduling frequency in the fast path. Signed-off-by: Gregory Haskins Signed-off-by: Steven Rostedt Signed-off-by: Ingo Molnar --- include/linux/init_task.h | 1 + include/linux/sched.h | 2 ++ 2 files changed, 3 insertions(+) (limited to 'include/linux') diff --git a/include/linux/init_task.h b/include/linux/init_task.h index cae35b6b9ae..572c65bcc80 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -130,6 +130,7 @@ extern struct group_info init_groups; .normal_prio = MAX_PRIO-20, \ .policy = SCHED_NORMAL, \ .cpus_allowed = CPU_MASK_ALL, \ + .nr_cpus_allowed = NR_CPUS, \ .mm = NULL, \ .active_mm = &init_mm, \ .run_list = LIST_HEAD_INIT(tsk.run_list), \ diff --git a/include/linux/sched.h b/include/linux/sched.h index 0846f1f9e19..b07a2cf7640 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -847,6 +847,7 @@ struct sched_class { void (*set_curr_task) (struct rq *rq); void (*task_tick) (struct rq *rq, struct task_struct *p); void (*task_new) (struct rq *rq, struct task_struct *p); + void (*set_cpus_allowed)(struct task_struct *p, cpumask_t *newmask); }; struct load_weight { @@ -956,6 +957,7 @@ struct task_struct { unsigned int policy; cpumask_t cpus_allowed; + int nr_cpus_allowed; unsigned int time_slice; #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) -- cgit v1.2.3 From e7693a362ec84bb5b6fd441d8a8b4b9d568a7a0c Mon Sep 17 00:00:00 2001 From: Gregory Haskins Date: Fri, 25 Jan 2008 21:08:09 +0100 Subject: sched: de-SCHED_OTHER-ize the RT path The current wake-up code path tries to determine if it can optimize the wake-up to "this_cpu" by computing load calculations. The problem is that these calculations are only relevant to SCHED_OTHER tasks where load is king. For RT tasks, priority is king. So the load calculation is completely wasted bandwidth. Therefore, we create a new sched_class interface to help with pre-wakeup routing decisions and move the load calculation as a function of CFS task's class. Signed-off-by: Gregory Haskins Signed-off-by: Steven Rostedt Signed-off-by: Ingo Molnar --- include/linux/sched.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/linux') diff --git a/include/linux/sched.h b/include/linux/sched.h index b07a2cf7640..6fbbf38555a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -827,6 +827,7 @@ struct sched_class { void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup); void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep); void (*yield_task) (struct rq *rq); + int (*select_task_rq)(struct task_struct *p, int sync); void (*check_preempt_curr) (struct rq *rq, struct task_struct *p); -- cgit v1.2.3 From 57d885fea0da0e9541d7730a9e1dcf734981a173 Mon Sep 17 00:00:00 2001 From: Gregory Haskins Date: Fri, 25 Jan 2008 21:08:18 +0100 Subject: sched: add sched-domain roots We add the notion of a root-domain which will be used later to rescope global variables to per-domain variables. Each exclusive cpuset essentially defines an island domain by fully partitioning the member cpus from any other cpuset. However, we currently still maintain some policy/state as global variables which transcend all cpusets. Consider, for instance, rt-overload state. Whenever a new exclusive cpuset is created, we also create a new root-domain object and move each cpu member to the root-domain's span. By default the system creates a single root-domain with all cpus as members (mimicking the global state we have today). We add some plumbing for storing class specific data in our root-domain. Whenever a RQ is switching root-domains (because of repartitioning) we give each sched_class the opportunity to remove any state from its old domain and add state to the new one. This logic doesn't have any clients yet but it will later in the series. Signed-off-by: Gregory Haskins CC: Christoph Lameter CC: Paul Jackson CC: Simon Derr Signed-off-by: Ingo Molnar --- include/linux/sched.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'include/linux') diff --git a/include/linux/sched.h b/include/linux/sched.h index 6fbbf38555a..2e69f19369e 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -849,6 +849,9 @@ struct sched_class { void (*task_tick) (struct rq *rq, struct task_struct *p); void (*task_new) (struct rq *rq, struct task_struct *p); void (*set_cpus_allowed)(struct task_struct *p, cpumask_t *newmask); + + void (*join_domain)(struct rq *rq); + void (*leave_domain)(struct rq *rq); }; struct load_weight { -- cgit v1.2.3 From 52d853431e8d9dc17ba94792123a3fe2bc039831 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Fri, 25 Jan 2008 21:08:20 +0100 Subject: sched: reactivate fork balancing reactivate fork balancing. Signed-off-by: Ingo Molnar --- include/linux/topology.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'include/linux') diff --git a/include/linux/topology.h b/include/linux/topology.h index 47729f18bfd..b6c073d8b70 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -103,6 +103,7 @@ .forkexec_idx = 0, \ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_NEWIDLE \ + | SD_BALANCE_FORK \ | SD_BALANCE_EXEC \ | SD_WAKE_AFFINE \ | SD_WAKE_IDLE \ @@ -134,6 +135,7 @@ .forkexec_idx = 1, \ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_NEWIDLE \ + | SD_BALANCE_FORK \ | SD_BALANCE_EXEC \ | SD_WAKE_AFFINE \ | SD_WAKE_IDLE \ @@ -165,6 +167,7 @@ .forkexec_idx = 1, \ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_NEWIDLE \ + | SD_BALANCE_FORK \ | SD_BALANCE_EXEC \ | SD_WAKE_AFFINE \ | BALANCE_FOR_PKG_POWER,\ -- cgit v1.2.3 From 32525d022ad52a5c14e80e130260431e16e294b6 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Fri, 25 Jan 2008 21:08:20 +0100 Subject: sched: whitespace cleanups in topology.h whitespace cleanups in topology.h. Signed-off-by: Ingo Molnar --- include/linux/topology.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/topology.h b/include/linux/topology.h index b6c073d8b70..2352f46160d 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -5,7 +5,7 @@ * * Copyright (C) 2002, IBM Corp. * - * All rights reserved. + * All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by -- cgit v1.2.3 From 9a897c5a6701bcb6f099f7ca20194999102729fd Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Fri, 25 Jan 2008 21:08:22 +0100 Subject: sched: RT-balance, replace hooks with pre/post schedule and wakeup methods To make the main sched.c code more agnostic to the schedule classes. Instead of having specific hooks in the schedule code for the RT class balancing. They are replaced with a pre_schedule, post_schedule and task_wake_up methods. These methods may be used by any of the classes but currently, only the sched_rt class implements them. Signed-off-by: Steven Rostedt Signed-off-by: Ingo Molnar --- include/linux/sched.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'include/linux') diff --git a/include/linux/sched.h b/include/linux/sched.h index 2e69f19369e..c67d2c2f011 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -843,6 +843,9 @@ struct sched_class { int (*move_one_task) (struct rq *this_rq, int this_cpu, struct rq *busiest, struct sched_domain *sd, enum cpu_idle_type idle); + void (*pre_schedule) (struct rq *this_rq, struct task_struct *task); + void (*post_schedule) (struct rq *this_rq); + void (*task_wake_up) (struct rq *this_rq, struct task_struct *task); #endif void (*set_curr_task) (struct rq *rq); -- cgit v1.2.3 From cb46984504048db946cd551c261df4e70d59a8ea Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Fri, 25 Jan 2008 21:08:22 +0100 Subject: sched: RT-balance, add new methods to sched_class Dmitry Adamushko found that the current implementation of the RT balancing code left out changes to the sched_setscheduler and rt_mutex_setprio. This patch addresses this issue by adding methods to the schedule classes to handle being switched out of (switched_from) and being switched into (switched_to) a sched_class. Also a method for changing of priorities is also added (prio_changed). This patch also removes some duplicate logic between rt_mutex_setprio and sched_setscheduler. Signed-off-by: Steven Rostedt Signed-off-by: Ingo Molnar --- include/linux/sched.h | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'include/linux') diff --git a/include/linux/sched.h b/include/linux/sched.h index c67d2c2f011..f2044e70700 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -855,6 +855,13 @@ struct sched_class { void (*join_domain)(struct rq *rq); void (*leave_domain)(struct rq *rq); + + void (*switched_from) (struct rq *this_rq, struct task_struct *task, + int running); + void (*switched_to) (struct rq *this_rq, struct task_struct *task, + int running); + void (*prio_changed) (struct rq *this_rq, struct task_struct *task, + int oldprio, int running); }; struct load_weight { -- cgit v1.2.3 From c2d727aa2ff17a1c8e5ed1e5e231bb8579b27e82 Mon Sep 17 00:00:00 2001 From: Dipankar Sarma Date: Fri, 25 Jan 2008 21:08:23 +0100 Subject: Preempt-RCU: Use softirq instead of tasklets for This patch makes RCU use softirq instead of tasklets. It also adds a memory barrier after raising the softirq inorder to ensure that the cpu sees the most recently updated value of rcu->cur while processing callbacks. The discussion of the related theoretical race pointed out by James Huang can be found here --> http://lkml.org/lkml/2007/11/20/603 Signed-off-by: Gautham R Shenoy Signed-off-by: Steven Rostedt Signed-off-by: Dipankar Sarma Reviewed-by: Steven Rostedt Signed-off-by: Ingo Molnar --- include/linux/interrupt.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/linux') diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 2306920fa38..c3db4a00f1f 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -256,6 +256,7 @@ enum #ifdef CONFIG_HIGH_RES_TIMERS HRTIMER_SOFTIRQ, #endif + RCU_SOFTIRQ, /* Preferable RCU should always be the last softirq */ }; /* softirq mask and active fields moved to irq_cpustat_t in -- cgit v1.2.3 From 01c1c660f4b8086cad7a62345fd04290f3d82c8f Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 25 Jan 2008 21:08:24 +0100 Subject: Preempt-RCU: reorganize RCU code into rcuclassic.c and rcupdate.c This patch re-organizes the RCU code to enable multiple implementations of RCU. Users of RCU continues to include rcupdate.h and the RCU interfaces remain the same. This is in preparation for subsequently merging the preemptible RCU implementation. Signed-off-by: Gautham R Shenoy Signed-off-by: Dipankar Sarma Signed-off-by: Paul E. McKenney Reviewed-by: Steven Rostedt Signed-off-by: Ingo Molnar --- include/linux/rcuclassic.h | 161 +++++++++++++++++++++++++++++++++++++++++++ include/linux/rcupdate.h | 168 +++++++++++++-------------------------------- 2 files changed, 210 insertions(+), 119 deletions(-) create mode 100644 include/linux/rcuclassic.h (limited to 'include/linux') diff --git a/include/linux/rcuclassic.h b/include/linux/rcuclassic.h new file mode 100644 index 00000000000..2b8b045a51d --- /dev/null +++ b/include/linux/rcuclassic.h @@ -0,0 +1,161 @@ +/* + * Read-Copy Update mechanism for mutual exclusion (classic version) + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * Copyright IBM Corporation, 2001 + * + * Author: Dipankar Sarma + * + * Based on the original work by Paul McKenney + * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen. + * Papers: + * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf + * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001) + * + * For detailed explanation of Read-Copy Update mechanism see - + * Documentation/RCU + * + */ + +#ifndef __LINUX_RCUCLASSIC_H +#define __LINUX_RCUCLASSIC_H + +#ifdef __KERNEL__ + +#include +#include +#include +#include +#include +#include + + +/* Global control variables for rcupdate callback mechanism. */ +struct rcu_ctrlblk { + long cur; /* Current batch number. */ + long completed; /* Number of the last completed batch */ + int next_pending; /* Is the next batch already waiting? */ + + int signaled; + + spinlock_t lock ____cacheline_internodealigned_in_smp; + cpumask_t cpumask; /* CPUs that need to switch in order */ + /* for current batch to proceed. */ +} ____cacheline_internodealigned_in_smp; + +/* Is batch a before batch b ? */ +static inline int rcu_batch_before(long a, long b) +{ + return (a - b) < 0; +} + +/* Is batch a after batch b ? */ +static inline int rcu_batch_after(long a, long b) +{ + return (a - b) > 0; +} + +/* + * Per-CPU data for Read-Copy UPdate. + * nxtlist - new callbacks are added here + * curlist - current batch for which quiescent cycle started if any + */ +struct rcu_data { + /* 1) quiescent state handling : */ + long quiescbatch; /* Batch # for grace period */ + int passed_quiesc; /* User-mode/idle loop etc. */ + int qs_pending; /* core waits for quiesc state */ + + /* 2) batch handling */ + long batch; /* Batch # for current RCU batch */ + struct rcu_head *nxtlist; + struct rcu_head **nxttail; + long qlen; /* # of queued callbacks */ + struct rcu_head *curlist; + struct rcu_head **curtail; + struct rcu_head *donelist; + struct rcu_head **donetail; + long blimit; /* Upper limit on a processed batch */ + int cpu; + struct rcu_head barrier; +}; + +DECLARE_PER_CPU(struct rcu_data, rcu_data); +DECLARE_PER_CPU(struct rcu_data, rcu_bh_data); + +/* + * Increment the quiescent state counter. + * The counter is a bit degenerated: We do not need to know + * how many quiescent states passed, just if there was at least + * one since the start of the grace period. Thus just a flag. + */ +static inline void rcu_qsctr_inc(int cpu) +{ + struct rcu_data *rdp = &per_cpu(rcu_data, cpu); + rdp->passed_quiesc = 1; +} +static inline void rcu_bh_qsctr_inc(int cpu) +{ + struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu); + rdp->passed_quiesc = 1; +} + +extern int rcu_pending(int cpu); +extern int rcu_needs_cpu(int cpu); + +#ifdef CONFIG_DEBUG_LOCK_ALLOC +extern struct lockdep_map rcu_lock_map; +# define rcu_read_acquire() \ + lock_acquire(&rcu_lock_map, 0, 0, 2, 1, _THIS_IP_) +# define rcu_read_release() lock_release(&rcu_lock_map, 1, _THIS_IP_) +#else +# define rcu_read_acquire() do { } while (0) +# define rcu_read_release() do { } while (0) +#endif + +#define __rcu_read_lock() \ + do { \ + preempt_disable(); \ + __acquire(RCU); \ + rcu_read_acquire(); \ + } while (0) +#define __rcu_read_unlock() \ + do { \ + rcu_read_release(); \ + __release(RCU); \ + preempt_enable(); \ + } while (0) +#define __rcu_read_lock_bh() \ + do { \ + local_bh_disable(); \ + __acquire(RCU_BH); \ + rcu_read_acquire(); \ + } while (0) +#define __rcu_read_unlock_bh() \ + do { \ + rcu_read_release(); \ + __release(RCU_BH); \ + local_bh_enable(); \ + } while (0) + +#define __synchronize_sched() synchronize_rcu() + +extern void __rcu_init(void); +extern void rcu_check_callbacks(int cpu, int user); +extern void rcu_restart_cpu(int cpu); + +#endif /* __KERNEL__ */ +#endif /* __LINUX_RCUCLASSIC_H */ diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index cc24a01df94..12aa13e1315 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -15,7 +15,7 @@ * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. * - * Copyright (C) IBM Corporation, 2001 + * Copyright IBM Corporation, 2001 * * Author: Dipankar Sarma * @@ -53,96 +53,14 @@ struct rcu_head { void (*func)(struct rcu_head *head); }; +#include + #define RCU_HEAD_INIT { .next = NULL, .func = NULL } #define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT #define INIT_RCU_HEAD(ptr) do { \ (ptr)->next = NULL; (ptr)->func = NULL; \ } while (0) - - -/* Global control variables for rcupdate callback mechanism. */ -struct rcu_ctrlblk { - long cur; /* Current batch number. */ - long completed; /* Number of the last completed batch */ - int next_pending; /* Is the next batch already waiting? */ - - int signaled; - - spinlock_t lock ____cacheline_internodealigned_in_smp; - cpumask_t cpumask; /* CPUs that need to switch in order */ - /* for current batch to proceed. */ -} ____cacheline_internodealigned_in_smp; - -/* Is batch a before batch b ? */ -static inline int rcu_batch_before(long a, long b) -{ - return (a - b) < 0; -} - -/* Is batch a after batch b ? */ -static inline int rcu_batch_after(long a, long b) -{ - return (a - b) > 0; -} - -/* - * Per-CPU data for Read-Copy UPdate. - * nxtlist - new callbacks are added here - * curlist - current batch for which quiescent cycle started if any - */ -struct rcu_data { - /* 1) quiescent state handling : */ - long quiescbatch; /* Batch # for grace period */ - int passed_quiesc; /* User-mode/idle loop etc. */ - int qs_pending; /* core waits for quiesc state */ - - /* 2) batch handling */ - long batch; /* Batch # for current RCU batch */ - struct rcu_head *nxtlist; - struct rcu_head **nxttail; - long qlen; /* # of queued callbacks */ - struct rcu_head *curlist; - struct rcu_head **curtail; - struct rcu_head *donelist; - struct rcu_head **donetail; - long blimit; /* Upper limit on a processed batch */ - int cpu; - struct rcu_head barrier; -}; - -DECLARE_PER_CPU(struct rcu_data, rcu_data); -DECLARE_PER_CPU(struct rcu_data, rcu_bh_data); - -/* - * Increment the quiescent state counter. - * The counter is a bit degenerated: We do not need to know - * how many quiescent states passed, just if there was at least - * one since the start of the grace period. Thus just a flag. - */ -static inline void rcu_qsctr_inc(int cpu) -{ - struct rcu_data *rdp = &per_cpu(rcu_data, cpu); - rdp->passed_quiesc = 1; -} -static inline void rcu_bh_qsctr_inc(int cpu) -{ - struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu); - rdp->passed_quiesc = 1; -} - -extern int rcu_pending(int cpu); -extern int rcu_needs_cpu(int cpu); - -#ifdef CONFIG_DEBUG_LOCK_ALLOC -extern struct lockdep_map rcu_lock_map; -# define rcu_read_acquire() lock_acquire(&rcu_lock_map, 0, 0, 2, 1, _THIS_IP_) -# define rcu_read_release() lock_release(&rcu_lock_map, 1, _THIS_IP_) -#else -# define rcu_read_acquire() do { } while (0) -# define rcu_read_release() do { } while (0) -#endif - /** * rcu_read_lock - mark the beginning of an RCU read-side critical section. * @@ -172,24 +90,13 @@ extern struct lockdep_map rcu_lock_map; * * It is illegal to block while in an RCU read-side critical section. */ -#define rcu_read_lock() \ - do { \ - preempt_disable(); \ - __acquire(RCU); \ - rcu_read_acquire(); \ - } while(0) +#define rcu_read_lock() __rcu_read_lock() /** * rcu_read_unlock - marks the end of an RCU read-side critical section. * * See rcu_read_lock() for more information. */ -#define rcu_read_unlock() \ - do { \ - rcu_read_release(); \ - __release(RCU); \ - preempt_enable(); \ - } while(0) /* * So where is rcu_write_lock()? It does not exist, as there is no @@ -200,6 +107,7 @@ extern struct lockdep_map rcu_lock_map; * used as well. RCU does not care how the writers keep out of each * others' way, as long as they do so. */ +#define rcu_read_unlock() __rcu_read_unlock() /** * rcu_read_lock_bh - mark the beginning of a softirq-only RCU critical section @@ -212,24 +120,14 @@ extern struct lockdep_map rcu_lock_map; * can use just rcu_read_lock(). * */ -#define rcu_read_lock_bh() \ - do { \ - local_bh_disable(); \ - __acquire(RCU_BH); \ - rcu_read_acquire(); \ - } while(0) +#define rcu_read_lock_bh() __rcu_read_lock_bh() /* * rcu_read_unlock_bh - marks the end of a softirq-only RCU critical section * * See rcu_read_lock_bh() for more information. */ -#define rcu_read_unlock_bh() \ - do { \ - rcu_read_release(); \ - __release(RCU_BH); \ - local_bh_enable(); \ - } while(0) +#define rcu_read_unlock_bh() __rcu_read_unlock_bh() /* * Prevent the compiler from merging or refetching accesses. The compiler @@ -293,21 +191,53 @@ extern struct lockdep_map rcu_lock_map; * In "classic RCU", these two guarantees happen to be one and * the same, but can differ in realtime RCU implementations. */ -#define synchronize_sched() synchronize_rcu() +#define synchronize_sched() __synchronize_sched() + +/** + * call_rcu - Queue an RCU callback for invocation after a grace period. + * @head: structure to be used for queueing the RCU updates. + * @func: actual update function to be invoked after the grace period + * + * The update function will be invoked some time after a full grace + * period elapses, in other words after all currently executing RCU + * read-side critical sections have completed. RCU read-side critical + * sections are delimited by rcu_read_lock() and rcu_read_unlock(), + * and may be nested. + */ +extern void call_rcu(struct rcu_head *head, + void (*func)(struct rcu_head *head)); + +/** + * call_rcu_bh - Queue an RCU for invocation after a quicker grace period. + * @head: structure to be used for queueing the RCU updates. + * @func: actual update function to be invoked after the grace period + * + * The update function will be invoked some time after a full grace + * period elapses, in other words after all currently executing RCU + * read-side critical sections have completed. call_rcu_bh() assumes + * that the read-side critical sections end on completion of a softirq + * handler. This means that read-side critical sections in process + * context must not be interrupted by softirqs. This interface is to be + * used when most of the read-side critical sections are in softirq context. + * RCU read-side critical sections are delimited by : + * - rcu_read_lock() and rcu_read_unlock(), if in interrupt context. + * OR + * - rcu_read_lock_bh() and rcu_read_unlock_bh(), if in process context. + * These may be nested. + */ +extern void call_rcu_bh(struct rcu_head *head, + void (*func)(struct rcu_head *head)); + +/* Exported common interfaces */ +extern void synchronize_rcu(void); +extern void rcu_barrier(void); +/* Internal to kernel */ extern void rcu_init(void); extern void rcu_check_callbacks(int cpu, int user); -extern void rcu_restart_cpu(int cpu); + extern long rcu_batches_completed(void); extern long rcu_batches_completed_bh(void); -/* Exported interfaces */ -extern void FASTCALL(call_rcu(struct rcu_head *head, - void (*func)(struct rcu_head *head))); -extern void FASTCALL(call_rcu_bh(struct rcu_head *head, - void (*func)(struct rcu_head *head))); -extern void synchronize_rcu(void); -extern void rcu_barrier(void); - #endif /* __KERNEL__ */ #endif /* __LINUX_RCUPDATE_H */ -- cgit v1.2.3 From e260be673a15b6125068270e0216a3bfbfc12f87 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 25 Jan 2008 21:08:24 +0100 Subject: Preempt-RCU: implementation This patch implements a new version of RCU which allows its read-side critical sections to be preempted. It uses a set of counter pairs to keep track of the read-side critical sections and flips them when all tasks exit read-side critical section. The details of this implementation can be found in this paper - http://www.rdrop.com/users/paulmck/RCU/OLSrtRCU.2006.08.11a.pdf and the article- http://lwn.net/Articles/253651/ This patch was developed as a part of the -rt kernel development and meant to provide better latencies when read-side critical sections of RCU don't disable preemption. As a consequence of keeping track of RCU readers, the readers have a slight overhead (optimizations in the paper). This implementation co-exists with the "classic" RCU implementations and can be switched to at compiler. Also includes RCU tracing summarized in debugfs. [ akpm@linux-foundation.org: build fixes on non-preempt architectures ] Signed-off-by: Gautham R Shenoy Signed-off-by: Dipankar Sarma Signed-off-by: Paul E. McKenney Reviewed-by: Steven Rostedt Signed-off-by: Ingo Molnar --- include/linux/rcuclassic.h | 3 ++ include/linux/rcupdate.h | 11 +++-- include/linux/rcupreempt.h | 86 ++++++++++++++++++++++++++++++++++ include/linux/rcupreempt_trace.h | 99 ++++++++++++++++++++++++++++++++++++++++ include/linux/sched.h | 5 ++ 5 files changed, 200 insertions(+), 4 deletions(-) create mode 100644 include/linux/rcupreempt.h create mode 100644 include/linux/rcupreempt_trace.h (limited to 'include/linux') diff --git a/include/linux/rcuclassic.h b/include/linux/rcuclassic.h index 2b8b045a51d..4d6624260b4 100644 --- a/include/linux/rcuclassic.h +++ b/include/linux/rcuclassic.h @@ -157,5 +157,8 @@ extern void __rcu_init(void); extern void rcu_check_callbacks(int cpu, int user); extern void rcu_restart_cpu(int cpu); +extern long rcu_batches_completed(void); +extern long rcu_batches_completed_bh(void); + #endif /* __KERNEL__ */ #endif /* __LINUX_RCUCLASSIC_H */ diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 12aa13e1315..d32c14de270 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -53,7 +53,11 @@ struct rcu_head { void (*func)(struct rcu_head *head); }; +#ifdef CONFIG_CLASSIC_RCU #include +#else /* #ifdef CONFIG_CLASSIC_RCU */ +#include +#endif /* #else #ifdef CONFIG_CLASSIC_RCU */ #define RCU_HEAD_INIT { .next = NULL, .func = NULL } #define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT @@ -231,13 +235,12 @@ extern void call_rcu_bh(struct rcu_head *head, /* Exported common interfaces */ extern void synchronize_rcu(void); extern void rcu_barrier(void); +extern long rcu_batches_completed(void); +extern long rcu_batches_completed_bh(void); /* Internal to kernel */ extern void rcu_init(void); -extern void rcu_check_callbacks(int cpu, int user); - -extern long rcu_batches_completed(void); -extern long rcu_batches_completed_bh(void); +extern int rcu_needs_cpu(int cpu); #endif /* __KERNEL__ */ #endif /* __LINUX_RCUPDATE_H */ diff --git a/include/linux/rcupreempt.h b/include/linux/rcupreempt.h new file mode 100644 index 00000000000..ece8eb3e415 --- /dev/null +++ b/include/linux/rcupreempt.h @@ -0,0 +1,86 @@ +/* + * Read-Copy Update mechanism for mutual exclusion (RT implementation) + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * Copyright (C) IBM Corporation, 2006 + * + * Author: Paul McKenney + * + * Based on the original work by Paul McKenney + * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen. + * Papers: + * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf + * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001) + * + * For detailed explanation of Read-Copy Update mechanism see - + * Documentation/RCU + * + */ + +#ifndef __LINUX_RCUPREEMPT_H +#define __LINUX_RCUPREEMPT_H + +#ifdef __KERNEL__ + +#include +#include +#include +#include +#include +#include + +#define rcu_qsctr_inc(cpu) +#define rcu_bh_qsctr_inc(cpu) +#define call_rcu_bh(head, rcu) call_rcu(head, rcu) + +extern void __rcu_read_lock(void); +extern void __rcu_read_unlock(void); +extern int rcu_pending(int cpu); +extern int rcu_needs_cpu(int cpu); + +#define __rcu_read_lock_bh() { rcu_read_lock(); local_bh_disable(); } +#define __rcu_read_unlock_bh() { local_bh_enable(); rcu_read_unlock(); } + +extern void __synchronize_sched(void); + +extern void __rcu_init(void); +extern void rcu_check_callbacks(int cpu, int user); +extern void rcu_restart_cpu(int cpu); +extern long rcu_batches_completed(void); + +/* + * Return the number of RCU batches processed thus far. Useful for debug + * and statistic. The _bh variant is identifcal to straight RCU + */ +static inline long rcu_batches_completed_bh(void) +{ + return rcu_batches_completed(); +} + +#ifdef CONFIG_RCU_TRACE +struct rcupreempt_trace; +extern long *rcupreempt_flipctr(int cpu); +extern long rcupreempt_data_completed(void); +extern int rcupreempt_flip_flag(int cpu); +extern int rcupreempt_mb_flag(int cpu); +extern char *rcupreempt_try_flip_state_name(void); +extern struct rcupreempt_trace *rcupreempt_trace_cpu(int cpu); +#endif + +struct softirq_action; + +#endif /* __KERNEL__ */ +#endif /* __LINUX_RCUPREEMPT_H */ diff --git a/include/linux/rcupreempt_trace.h b/include/linux/rcupreempt_trace.h new file mode 100644 index 00000000000..21cd6b2a5c4 --- /dev/null +++ b/include/linux/rcupreempt_trace.h @@ -0,0 +1,99 @@ +/* + * Read-Copy Update mechanism for mutual exclusion (RT implementation) + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * Copyright (C) IBM Corporation, 2006 + * + * Author: Paul McKenney + * + * Based on the original work by Paul McKenney + * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen. + * Papers: + * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf + * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001) + * + * For detailed explanation of the Preemptible Read-Copy Update mechanism see - + * http://lwn.net/Articles/253651/ + */ + +#ifndef __LINUX_RCUPREEMPT_TRACE_H +#define __LINUX_RCUPREEMPT_TRACE_H + +#ifdef __KERNEL__ +#include +#include + +#include + +/* + * PREEMPT_RCU data structures. + */ + +struct rcupreempt_trace { + long next_length; + long next_add; + long wait_length; + long wait_add; + long done_length; + long done_add; + long done_remove; + atomic_t done_invoked; + long rcu_check_callbacks; + atomic_t rcu_try_flip_1; + atomic_t rcu_try_flip_e1; + long rcu_try_flip_i1; + long rcu_try_flip_ie1; + long rcu_try_flip_g1; + long rcu_try_flip_a1; + long rcu_try_flip_ae1; + long rcu_try_flip_a2; + long rcu_try_flip_z1; + long rcu_try_flip_ze1; + long rcu_try_flip_z2; + long rcu_try_flip_m1; + long rcu_try_flip_me1; + long rcu_try_flip_m2; +}; + +#ifdef CONFIG_RCU_TRACE +#define RCU_TRACE(fn, arg) fn(arg); +#else +#define RCU_TRACE(fn, arg) +#endif + +extern void rcupreempt_trace_move2done(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_move2wait(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_1(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_e1(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_i1(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_ie1(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_g1(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_a1(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_ae1(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_a2(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_z1(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_ze1(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_z2(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_m1(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_me1(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_m2(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_check_callbacks(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_done_remove(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_invoke(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_next_add(struct rcupreempt_trace *trace); + +#endif /* __KERNEL__ */ +#endif /* __LINUX_RCUPREEMPT_TRACE_H */ diff --git a/include/linux/sched.h b/include/linux/sched.h index f2044e70700..72e1b8ecfbe 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -974,6 +974,11 @@ struct task_struct { int nr_cpus_allowed; unsigned int time_slice; +#ifdef CONFIG_PREEMPT_RCU + int rcu_read_lock_nesting; + int rcu_flipctr_idx; +#endif /* #ifdef CONFIG_PREEMPT_RCU */ + #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) struct sched_info sched_info; #endif -- cgit v1.2.3 From fa717060f1ab7eb6570f2fb49136f838fc9195a9 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 25 Jan 2008 21:08:27 +0100 Subject: sched: sched_rt_entity Move the task_struct members specific to rt scheduling together. A future optimization could be to put sched_entity and sched_rt_entity into a union. Signed-off-by: Peter Zijlstra CC: Srivatsa Vaddagiri Signed-off-by: Ingo Molnar --- include/linux/init_task.h | 5 +++-- include/linux/sched.h | 8 ++++++-- 2 files changed, 9 insertions(+), 4 deletions(-) (limited to 'include/linux') diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 572c65bcc80..ee65d87bedb 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -133,9 +133,10 @@ extern struct group_info init_groups; .nr_cpus_allowed = NR_CPUS, \ .mm = NULL, \ .active_mm = &init_mm, \ - .run_list = LIST_HEAD_INIT(tsk.run_list), \ + .rt = { \ + .run_list = LIST_HEAD_INIT(tsk.rt.run_list), \ + .time_slice = HZ, }, \ .ioprio = 0, \ - .time_slice = HZ, \ .tasks = LIST_HEAD_INIT(tsk.tasks), \ .ptrace_children= LIST_HEAD_INIT(tsk.ptrace_children), \ .ptrace_list = LIST_HEAD_INIT(tsk.ptrace_list), \ diff --git a/include/linux/sched.h b/include/linux/sched.h index 72e1b8ecfbe..a06d09ebd5c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -929,6 +929,11 @@ struct sched_entity { #endif }; +struct sched_rt_entity { + struct list_head run_list; + unsigned int time_slice; +}; + struct task_struct { volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ void *stack; @@ -945,9 +950,9 @@ struct task_struct { #endif int prio, static_prio, normal_prio; - struct list_head run_list; const struct sched_class *sched_class; struct sched_entity se; + struct sched_rt_entity rt; #ifdef CONFIG_PREEMPT_NOTIFIERS /* list of struct preempt_notifier: */ @@ -972,7 +977,6 @@ struct task_struct { unsigned int policy; cpumask_t cpus_allowed; int nr_cpus_allowed; - unsigned int time_slice; #ifdef CONFIG_PREEMPT_RCU int rcu_read_lock_nesting; -- cgit v1.2.3 From 78f2c7db6068fd6ef75b8c120f04a388848eacb5 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 25 Jan 2008 21:08:27 +0100 Subject: sched: SCHED_FIFO/SCHED_RR watchdog timer Introduce a new rlimit that allows the user to set a runtime timeout on real-time tasks their slice. Once this limit is exceeded the task will receive SIGXCPU. So it measures runtime since the last sleep. Input and ideas by Thomas Gleixner and Lennart Poettering. Signed-off-by: Peter Zijlstra CC: Lennart Poettering CC: Michael Kerrisk CC: Ulrich Drepper Signed-off-by: Ingo Molnar --- include/linux/sched.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/linux') diff --git a/include/linux/sched.h b/include/linux/sched.h index a06d09ebd5c..fe3f8fbc614 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -932,6 +932,7 @@ struct sched_entity { struct sched_rt_entity { struct list_head run_list; unsigned int time_slice; + unsigned long timeout; }; struct task_struct { -- cgit v1.2.3 From 02b67cc3ba36bdba351d6c3a00593f4ec550d9d3 Mon Sep 17 00:00:00 2001 From: Herbert Xu Date: Fri, 25 Jan 2008 21:08:28 +0100 Subject: sched: do not do cond_resched() when CONFIG_PREEMPT Why do we even have cond_resched when real preemption is on? It seems to be a waste of space and time. remove cond_resched with CONFIG_PREEMPT on. Signed-off-by: Ingo Molnar --- include/linux/kernel.h | 4 ++-- include/linux/sched.h | 13 ++++++++++++- 2 files changed, 14 insertions(+), 3 deletions(-) (limited to 'include/linux') diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 94bc9965696..a7283c9bead 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -105,8 +105,8 @@ struct user; * supposed to. */ #ifdef CONFIG_PREEMPT_VOLUNTARY -extern int cond_resched(void); -# define might_resched() cond_resched() +extern int _cond_resched(void); +# define might_resched() _cond_resched() #else # define might_resched() do { } while (0) #endif diff --git a/include/linux/sched.h b/include/linux/sched.h index fe3f8fbc614..7907845c234 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1885,7 +1885,18 @@ static inline int need_resched(void) * cond_resched_lock() will drop the spinlock before scheduling, * cond_resched_softirq() will enable bhs before scheduling. */ -extern int cond_resched(void); +#ifdef CONFIG_PREEMPT +static inline int cond_resched(void) +{ + return 0; +} +#else +extern int _cond_resched(void); +static inline int cond_resched(void) +{ + return _cond_resched(); +} +#endif extern int cond_resched_lock(spinlock_t * lock); extern int cond_resched_softirq(void); -- cgit v1.2.3 From 8f4d37ec073c17e2d4aa8851df5837d798606d6f Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 25 Jan 2008 21:08:29 +0100 Subject: sched: high-res preemption tick Use HR-timers (when available) to deliver an accurate preemption tick. The regular scheduler tick that runs at 1/HZ can be too coarse when nice level are used. The fairness system will still keep the cpu utilisation 'fair' by then delaying the task that got an excessive amount of CPU time but try to minimize this by delivering preemption points spot-on. The average frequency of this extra interrupt is sched_latency / nr_latency. Which need not be higher than 1/HZ, its just that the distribution within the sched_latency period is important. Signed-off-by: Peter Zijlstra Signed-off-by: Ingo Molnar --- include/linux/hrtimer.h | 9 +++++++++ include/linux/sched.h | 3 ++- 2 files changed, 11 insertions(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h index 7a9398e1970..ecc8e2685e2 100644 --- a/include/linux/hrtimer.h +++ b/include/linux/hrtimer.h @@ -217,6 +217,11 @@ static inline ktime_t hrtimer_cb_get_time(struct hrtimer *timer) return timer->base->get_time(); } +static inline int hrtimer_is_hres_active(struct hrtimer *timer) +{ + return timer->base->cpu_base->hres_active; +} + /* * The resolution of the clocks. The resolution value is returned in * the clock_getres() system call to give application programmers an @@ -248,6 +253,10 @@ static inline ktime_t hrtimer_cb_get_time(struct hrtimer *timer) return timer->base->softirq_time; } +static inline int hrtimer_is_hres_active(struct hrtimer *timer) +{ + return 0; +} #endif extern ktime_t ktime_get(void); diff --git a/include/linux/sched.h b/include/linux/sched.h index 7907845c234..43e0339d65f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -257,6 +257,7 @@ extern void trap_init(void); extern void account_process_tick(struct task_struct *task, int user); extern void update_process_times(int user); extern void scheduler_tick(void); +extern void hrtick_resched(void); extern void sched_show_task(struct task_struct *p); @@ -849,7 +850,7 @@ struct sched_class { #endif void (*set_curr_task) (struct rq *rq); - void (*task_tick) (struct rq *rq, struct task_struct *p); + void (*task_tick) (struct rq *rq, struct task_struct *p, int queued); void (*task_new) (struct rq *rq, struct task_struct *p); void (*set_cpus_allowed)(struct task_struct *p, cpumask_t *newmask); -- cgit v1.2.3 From fa85ae2418e6843953107cd6a06f645752829bc0 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 25 Jan 2008 21:08:29 +0100 Subject: sched: rt time limit Very simple time limit on the realtime scheduling classes. Allow the rq's realtime class to consume sched_rt_ratio of every sched_rt_period slice. If the class exceeds this quota the fair class will preempt the realtime class. Signed-off-by: Peter Zijlstra Signed-off-by: Ingo Molnar --- include/linux/sched.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include/linux') diff --git a/include/linux/sched.h b/include/linux/sched.h index 43e0339d65f..d5ea144df83 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1490,6 +1490,8 @@ extern unsigned int sysctl_sched_child_runs_first; extern unsigned int sysctl_sched_features; extern unsigned int sysctl_sched_migration_cost; extern unsigned int sysctl_sched_nr_migrate; +extern unsigned int sysctl_sched_rt_period; +extern unsigned int sysctl_sched_rt_ratio; #if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP) extern unsigned int sysctl_sched_min_bal_int_shares; extern unsigned int sysctl_sched_max_bal_int_shares; -- cgit v1.2.3 From 6f505b16425a51270058e4a93441fe64de3dd435 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 25 Jan 2008 21:08:30 +0100 Subject: sched: rt group scheduling Extend group scheduling to also cover the realtime classes. It uses the time limiting introduced by the previous patch to allow multiple realtime groups. The hard time limit is required to keep behaviour deterministic. The algorithms used make the realtime scheduler O(tg), linear scaling wrt the number of task groups. This is the worst case behaviour I can't seem to get out of, the avg. case of the algorithms can be improved, I focused on correctness and worst case. [ akpm@linux-foundation.org: move side-effects out of BUG_ON(). ] Signed-off-by: Peter Zijlstra Signed-off-by: Ingo Molnar --- include/linux/init_task.h | 5 +++-- include/linux/sched.h | 10 +++++++++- 2 files changed, 12 insertions(+), 3 deletions(-) (limited to 'include/linux') diff --git a/include/linux/init_task.h b/include/linux/init_task.h index ee65d87bedb..796019b22b6 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -130,12 +130,13 @@ extern struct group_info init_groups; .normal_prio = MAX_PRIO-20, \ .policy = SCHED_NORMAL, \ .cpus_allowed = CPU_MASK_ALL, \ - .nr_cpus_allowed = NR_CPUS, \ .mm = NULL, \ .active_mm = &init_mm, \ .rt = { \ .run_list = LIST_HEAD_INIT(tsk.rt.run_list), \ - .time_slice = HZ, }, \ + .time_slice = HZ, \ + .nr_cpus_allowed = NR_CPUS, \ + }, \ .ioprio = 0, \ .tasks = LIST_HEAD_INIT(tsk.tasks), \ .ptrace_children= LIST_HEAD_INIT(tsk.ptrace_children), \ diff --git a/include/linux/sched.h b/include/linux/sched.h index d5ea144df83..04eecbf0241 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -934,6 +934,15 @@ struct sched_rt_entity { struct list_head run_list; unsigned int time_slice; unsigned long timeout; + int nr_cpus_allowed; + +#ifdef CONFIG_FAIR_GROUP_SCHED + struct sched_rt_entity *parent; + /* rq on which this entity is (to be) queued: */ + struct rt_rq *rt_rq; + /* rq "owned" by this entity/group: */ + struct rt_rq *my_q; +#endif }; struct task_struct { @@ -978,7 +987,6 @@ struct task_struct { unsigned int policy; cpumask_t cpus_allowed; - int nr_cpus_allowed; #ifdef CONFIG_PREEMPT_RCU int rcu_read_lock_nesting; -- cgit v1.2.3 From 48d5e258216f1c7713633439beb98a38c7290649 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 25 Jan 2008 21:08:31 +0100 Subject: sched: rt throttling vs no_hz We need to teach no_hz about the rt throttling because its tick driven. Signed-off-by: Peter Zijlstra Signed-off-by: Ingo Molnar --- include/linux/sched.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include/linux') diff --git a/include/linux/sched.h b/include/linux/sched.h index 04eecbf0241..acadcab89ef 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -230,6 +230,8 @@ static inline int select_nohz_load_balancer(int cpu) } #endif +extern unsigned long rt_needs_cpu(int cpu); + /* * Only dump TASK_* tasks. (0 for all tasks) */ -- cgit v1.2.3 From d3d74453c34f8fd87674a8cf5b8a327c68f22e99 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 25 Jan 2008 21:08:31 +0100 Subject: hrtimer: fixup the HRTIMER_CB_IRQSAFE_NO_SOFTIRQ fallback Currently all highres=off timers are run from softirq context, but HRTIMER_CB_IRQSAFE_NO_SOFTIRQ timers expect to run from irq context. Fix this up by splitting it similar to the highres=on case. Signed-off-by: Peter Zijlstra Signed-off-by: Ingo Molnar --- include/linux/hrtimer.h | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) (limited to 'include/linux') diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h index ecc8e2685e2..49067f14fac 100644 --- a/include/linux/hrtimer.h +++ b/include/linux/hrtimer.h @@ -115,10 +115,8 @@ struct hrtimer { enum hrtimer_restart (*function)(struct hrtimer *); struct hrtimer_clock_base *base; unsigned long state; -#ifdef CONFIG_HIGH_RES_TIMERS enum hrtimer_cb_mode cb_mode; struct list_head cb_entry; -#endif #ifdef CONFIG_TIMER_STATS void *start_site; char start_comm[16]; @@ -194,10 +192,10 @@ struct hrtimer_cpu_base { spinlock_t lock; struct lock_class_key lock_key; struct hrtimer_clock_base clock_base[HRTIMER_MAX_CLOCK_BASES]; + struct list_head cb_pending; #ifdef CONFIG_HIGH_RES_TIMERS ktime_t expires_next; int hres_active; - struct list_head cb_pending; unsigned long nr_events; #endif }; @@ -319,6 +317,7 @@ extern void hrtimer_init_sleeper(struct hrtimer_sleeper *sl, /* Soft interrupt function to run the hrtimer queues: */ extern void hrtimer_run_queues(void); +extern void hrtimer_run_pending(void); /* Bootup initialization: */ extern void __init hrtimers_init(void); -- cgit v1.2.3 From 6478d8800b75253b2a934ddcb734e13ade023ad0 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Fri, 25 Jan 2008 21:08:33 +0100 Subject: sched: remove the !PREEMPT_BKL code remove the !PREEMPT_BKL code. this removes 160 lines of legacy code. Signed-off-by: Ingo Molnar --- include/linux/hardirq.h | 6 +----- include/linux/smp_lock.h | 14 +------------- 2 files changed, 2 insertions(+), 18 deletions(-) (limited to 'include/linux') diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h index 8d302298a16..2961ec78804 100644 --- a/include/linux/hardirq.h +++ b/include/linux/hardirq.h @@ -72,11 +72,7 @@ #define in_softirq() (softirq_count()) #define in_interrupt() (irq_count()) -#if defined(CONFIG_PREEMPT) && !defined(CONFIG_PREEMPT_BKL) -# define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != kernel_locked()) -#else -# define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != 0) -#endif +#define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != 0) #ifdef CONFIG_PREEMPT # define PREEMPT_CHECK_OFFSET 1 diff --git a/include/linux/smp_lock.h b/include/linux/smp_lock.h index 58962c51dee..aab3a4cff4e 100644 --- a/include/linux/smp_lock.h +++ b/include/linux/smp_lock.h @@ -17,22 +17,10 @@ extern void __lockfunc __release_kernel_lock(void); __release_kernel_lock(); \ } while (0) -/* - * Non-SMP kernels will never block on the kernel lock, - * so we are better off returning a constant zero from - * reacquire_kernel_lock() so that the compiler can see - * it at compile-time. - */ -#if defined(CONFIG_SMP) && !defined(CONFIG_PREEMPT_BKL) -# define return_value_on_smp return -#else -# define return_value_on_smp -#endif - static inline int reacquire_kernel_lock(struct task_struct *task) { if (unlikely(task->lock_depth >= 0)) - return_value_on_smp __reacquire_kernel_lock(); + return __reacquire_kernel_lock(); return 0; } -- cgit v1.2.3 From e118adef232e637a8f091c1ded2fbf44fcf3ecc8 Mon Sep 17 00:00:00 2001 From: Pavel Machek Date: Fri, 25 Jan 2008 21:08:34 +0100 Subject: timers: don't #error on higher HZ values For some crazy reason (trying to work around hw problem in i810) I wanted to use HZ around 4000. Signed-off-by: Pavel Machek Signed-off-by: Andrew Morton Signed-off-by: Ingo Molnar --- include/linux/jiffies.h | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'include/linux') diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h index 8b080024bbc..7ba9e47bf06 100644 --- a/include/linux/jiffies.h +++ b/include/linux/jiffies.h @@ -29,6 +29,12 @@ # define SHIFT_HZ 9 #elif HZ >= 768 && HZ < 1536 # define SHIFT_HZ 10 +#elif HZ >= 1536 && HZ < 3072 +# define SHIFT_HZ 11 +#elif HZ >= 3072 && HZ < 6144 +# define SHIFT_HZ 12 +#elif HZ >= 6144 && HZ < 12288 +# define SHIFT_HZ 13 #else # error You lose. #endif -- cgit v1.2.3 From 9745512ce79de686df354dc70a8d1a74d801892d Mon Sep 17 00:00:00 2001 From: Arjan van de Ven Date: Fri, 25 Jan 2008 21:08:34 +0100 Subject: sched: latencytop support LatencyTOP kernel infrastructure; it measures latencies in the scheduler and tracks it system wide and per process. Signed-off-by: Arjan van de Ven Signed-off-by: Ingo Molnar --- include/linux/latencytop.h | 44 ++++++++++++++++++++++++++++++++++++++++++++ include/linux/sched.h | 5 +++++ include/linux/stacktrace.h | 3 +++ 3 files changed, 52 insertions(+) create mode 100644 include/linux/latencytop.h (limited to 'include/linux') diff --git a/include/linux/latencytop.h b/include/linux/latencytop.h new file mode 100644 index 00000000000..901c2d6377a --- /dev/null +++ b/include/linux/latencytop.h @@ -0,0 +1,44 @@ +/* + * latencytop.h: Infrastructure for displaying latency + * + * (C) Copyright 2008 Intel Corporation + * Author: Arjan van de Ven + * + */ + +#ifndef _INCLUDE_GUARD_LATENCYTOP_H_ +#define _INCLUDE_GUARD_LATENCYTOP_H_ + +#ifdef CONFIG_LATENCYTOP + +#define LT_SAVECOUNT 32 +#define LT_BACKTRACEDEPTH 12 + +struct latency_record { + unsigned long backtrace[LT_BACKTRACEDEPTH]; + unsigned int count; + unsigned long time; + unsigned long max; +}; + + +struct task_struct; + +void account_scheduler_latency(struct task_struct *task, int usecs, int inter); + +void clear_all_latency_tracing(struct task_struct *p); + +#else + +static inline void +account_scheduler_latency(struct task_struct *task, int usecs, int inter) +{ +} + +static inline void clear_all_latency_tracing(struct task_struct *p) +{ +} + +#endif + +#endif diff --git a/include/linux/sched.h b/include/linux/sched.h index acadcab89ef..dfc76e172f3 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -88,6 +88,7 @@ struct sched_param { #include #include #include +#include #include @@ -1220,6 +1221,10 @@ struct task_struct { int make_it_fail; #endif struct prop_local_single dirties; +#ifdef CONFIG_LATENCYTOP + int latency_record_count; + struct latency_record latency_record[LT_SAVECOUNT]; +#endif }; /* diff --git a/include/linux/stacktrace.h b/include/linux/stacktrace.h index e7fa657d0c4..5da9794b2d7 100644 --- a/include/linux/stacktrace.h +++ b/include/linux/stacktrace.h @@ -9,10 +9,13 @@ struct stack_trace { }; extern void save_stack_trace(struct stack_trace *trace); +extern void save_stack_trace_tsk(struct task_struct *tsk, + struct stack_trace *trace); extern void print_stack_trace(struct stack_trace *trace, int spaces); #else # define save_stack_trace(trace) do { } while (0) +# define save_stack_trace_tsk(tsk, trace) do { } while (0) # define print_stack_trace(trace, spaces) do { } while (0) #endif -- cgit v1.2.3 From 90739081ef8d5495d50abba9c5d333be9acd872a Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Fri, 25 Jan 2008 21:08:34 +0100 Subject: softlockup: fix signedness fix softlockup tunables signedness. mark tunables read-mostly. Signed-off-by: Ingo Molnar --- include/linux/sched.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/sched.h b/include/linux/sched.h index dfc76e172f3..53534f90a96 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -269,10 +269,10 @@ extern void softlockup_tick(void); extern void spawn_softlockup_task(void); extern void touch_softlockup_watchdog(void); extern void touch_all_softlockup_watchdogs(void); -extern int softlockup_thresh; +extern unsigned long softlockup_thresh; extern unsigned long sysctl_hung_task_check_count; extern unsigned long sysctl_hung_task_timeout_secs; -extern long sysctl_hung_task_warnings; +extern unsigned long sysctl_hung_task_warnings; #else static inline void softlockup_tick(void) { -- cgit v1.2.3 From 286100a6cf1c1f692e5f81d14b364ff12b7662f5 Mon Sep 17 00:00:00 2001 From: Alexey Dobriyan Date: Fri, 25 Jan 2008 21:08:34 +0100 Subject: sched, futex: detach sched.h and futex.h Signed-off-by: Alexey Dobriyan Signed-off-by: Ingo Molnar --- include/linux/futex.h | 6 +++++- include/linux/sched.h | 2 +- 2 files changed, 6 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/futex.h b/include/linux/futex.h index 92d420fe03f..1a15f8e237a 100644 --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -1,8 +1,12 @@ #ifndef _LINUX_FUTEX_H #define _LINUX_FUTEX_H -#include +#include +#include +struct inode; +struct mm_struct; +struct task_struct; union ktime; /* Second argument to futex syscall */ diff --git a/include/linux/sched.h b/include/linux/sched.h index 53534f90a96..734f6d8f6ed 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -78,7 +78,6 @@ struct sched_param { #include #include #include -#include #include #include @@ -94,6 +93,7 @@ struct sched_param { struct exec_domain; struct futex_pi_state; +struct robust_list_head; struct bio; /* -- cgit v1.2.3 From 6d082592b62689fb91578d0338d04a9f50991990 Mon Sep 17 00:00:00 2001 From: Arjan van de Ven Date: Fri, 25 Jan 2008 21:08:35 +0100 Subject: sched: keep total / count stats in addition to the max for Right now, the linux kernel (with scheduler statistics enabled) keeps track of the maximum time a process is waiting to be scheduled. While the maximum is a very useful metric, tracking average and total is equally useful (at least for latencytop) to figure out the accumulated effect of scheduler delays. The accumulated effect is important to judge the performance impact of scheduler tuning/behavior. Signed-off-by: Arjan van de Ven Signed-off-by: Ingo Molnar --- include/linux/sched.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include/linux') diff --git a/include/linux/sched.h b/include/linux/sched.h index 734f6d8f6ed..df5b24ee80b 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -895,6 +895,8 @@ struct sched_entity { #ifdef CONFIG_SCHEDSTATS u64 wait_start; u64 wait_max; + u64 wait_count; + u64 wait_sum; u64 sleep_start; u64 sleep_max; -- cgit v1.2.3