From b9b158fe1ca2c166ff918de30cb098eafcae487a Mon Sep 17 00:00:00 2001 From: Viktor Radnai Date: Sat, 19 Apr 2008 19:45:01 +0200 Subject: sched: better rt-group documentation Viktor was nice enough to enhance the document based on my replies to his questions on the subject. Signed-off-by: Peter Zijlstra Signed-off-by: Ingo Molnar --- Documentation/scheduler/sched-rt-group.txt | 188 +++++++++++++++++++++++------ 1 file changed, 153 insertions(+), 35 deletions(-) (limited to 'Documentation/scheduler') diff --git a/Documentation/scheduler/sched-rt-group.txt b/Documentation/scheduler/sched-rt-group.txt index 1c6332f4543..14f901f639e 100644 --- a/Documentation/scheduler/sched-rt-group.txt +++ b/Documentation/scheduler/sched-rt-group.txt @@ -1,59 +1,177 @@ + Real-Time group scheduling + -------------------------- +CONTENTS +======== -Real-Time group scheduling. +1. Overview + 1.1 The problem + 1.2 The solution +2. The interface + 2.1 System-wide settings + 2.2 Default behaviour + 2.3 Basis for grouping tasks +3. Future plans -The problem space: -In order to schedule multiple groups of realtime tasks each group must -be assigned a fixed portion of the CPU time available. Without a minimum -guarantee a realtime group can obviously fall short. A fuzzy upper limit -is of no use since it cannot be relied upon. Which leaves us with just -the single fixed portion. +1. Overview +=========== -CPU time is divided by means of specifying how much time can be spent -running in a given period. Say a frame fixed realtime renderer must -deliver 25 frames a second, which yields a period of 0.04s. Now say -it will also have to play some music and respond to input, leaving it -with around 80% for the graphics. We can then give this group a runtime -of 0.8 * 0.04s = 0.032s. -This way the graphics group will have a 0.04s period with a 0.032s runtime -limit. +1.1 The problem +--------------- -Now if the audio thread needs to refill the DMA buffer every 0.005s, but -needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s -= 0.00015s. +Realtime scheduling is all about determinism, a group has to be able to rely on +the amount of bandwidth (eg. CPU time) being constant. In order to schedule +multiple groups of realtime tasks, each group must be assigned a fixed portion +of the CPU time available. Without a minimum guarantee a realtime group can +obviously fall short. A fuzzy upper limit is of no use since it cannot be +relied upon. Which leaves us with just the single fixed portion. +1.2 The solution +---------------- -The Interface: +CPU time is divided by means of specifying how much time can be spent running +in a given period. We allocate this "run time" for each realtime group which +the other realtime groups will not be permitted to use. -system wide: +Any time not allocated to a realtime group will be used to run normal priority +tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by +SCHED_OTHER. -/proc/sys/kernel/sched_rt_period_ms -/proc/sys/kernel/sched_rt_runtime_us +Let's consider an example: a frame fixed realtime renderer must deliver 25 +frames a second, which yields a period of 0.04s per frame. Now say it will also +have to play some music and respond to input, leaving it with around 80% CPU +time dedicated for the graphics. We can then give this group a run time of 0.8 +* 0.04s = 0.032s. -CONFIG_FAIR_USER_SCHED +This way the graphics group will have a 0.04s period with a 0.032s run time +limit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but +needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s = +0.00015s. So this group can be scheduled with a period of 0.005s and a run time +of 0.00015s. -/sys/kernel/uids//cpu_rt_runtime_us +The remaining CPU time will be used for user input and other tass. Because +realtime tasks have explicitly allocated the CPU time they need to perform +their tasks, buffer underruns in the graphocs or audio can be eliminated. -or +NOTE: the above example is not fully implemented as of yet (2.6.25). We still +lack an EDF scheduler to make non-uniform periods usable. -CONFIG_FAIR_CGROUP_SCHED -/cgroup//cpu.rt_runtime_us +2. The Interface +================ -[ time is specified in us because the interface is s32; this gives an - operating range of ~35m to 1us ] -The period takes values in [ 1, INT_MAX ], runtime in [ -1, INT_MAX - 1 ]. +2.1 System wide settings +------------------------ -A runtime of -1 specifies runtime == period, ie. no limit. +The system wide settings are configured under the /proc virtual file system: -New groups get the period from /proc/sys/kernel/sched_rt_period_us and -a runtime of 0. +/proc/sys/kernel/sched_rt_period_us: + The scheduling period that is equivalent to 100% CPU bandwidth -Settings are constrained to: +/proc/sys/kernel/sched_rt_runtime_us: + A global limit on how much time realtime scheduling may use. Even without + CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime + processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth + available to all realtime groups. + + * Time is specified in us because the interface is s32. This gives an + operating range from 1us to about 35 minutes. + * sched_rt_period_us takes values from 1 to INT_MAX. + * sched_rt_runtime_us takes values from -1 to (INT_MAX - 1). + * A run time of -1 specifies runtime == period, ie. no limit. + + +2.2 Default behaviour +--------------------- + +The default values for sched_rt_period_us (1000000 or 1s) and +sched_rt_runtime_us (950000 or 0.95s). This gives 0.05s to be used by +SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away +realtime tasks will not lock up the machine but leave a little time to recover +it. By setting runtime to -1 you'd get the old behaviour back. + +By default all bandwidth is assigned to the root group and new groups get the +period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you +want to assign bandwidth to another group, reduce the root group's bandwidth +and assign some or all of the difference to another group. + +Realtime group scheduling means you have to assign a portion of total CPU +bandwidth to the group before it will accept realtime tasks. Therefore you will +not be able to run realtime tasks as any user other than root until you have +done that, even if the user has the rights to run processes with realtime +priority! + + +2.3 Basis for grouping tasks +---------------------------- + +There are two compile-time settings for allocating CPU bandwidth. These are +configured using the "Basis for grouping tasks" multiple choice menu under +General setup > Group CPU Scheduler: + +a. CONFIG_USER_SCHED (aka "Basis for grouping tasks" = "user id") + +This lets you use the virtual files under +"/sys/kernel/uids//cpu_rt_runtime_us" to control he CPU time reserved for +each user . + +The other option is: + +.o CONFIG_CGROUP_SCHED (aka "Basis for grouping tasks" = "Control groups") + +This uses the /cgroup virtual file system and "/cgroup//cpu.rt_runtime_us" +to control the CPU time reserved for each control group instead. + +For more information on working with control groups, you should read +Documentation/cgroups.txt as well. + +Group settings are checked against the following limits in order to keep the configuration +schedulable: \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period -in order to keep the configuration schedulable. +For now, this can be simplified to just the following (but see Future plans): + + \Sum_{i} runtime_{i} <= global_runtime + + +3. Future plans +=============== + +There is work in progress to make the scheduling period for each group +("/sys/kernel/uids//cpu_rt_period_us" or +"/cgroup//cpu.rt_period_us" respectively) configurable as well. + +The constraint on the period is that a subgroup must have a smaller or +equal period to its parent. But realistically its not very useful _yet_ +as its prone to starvation without deadline scheduling. + +Consider two sibling groups A and B; both have 50% bandwidth, but A's +period is twice the length of B's. + +* group A: period=100000us, runtime=10000us + - this runs for 0.01s once every 0.1s + +* group B: period= 50000us, runtime=10000us + - this runs for 0.01s twice every 0.1s (or once every 0.05 sec). + +This means that currently a while (1) loop in A will run for the full period of +B and can starve B's tasks (assuming they are of lower priority) for a whole +period. + +The next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring +full deadline scheduling to the linux kernel. Deadline scheduling the above +groups and treating end of the period as a deadline will ensure that they both +get their allocated time. + +Implementing SCHED_EDF might take a while to complete. Priority Inheritance is +the biggest challenge as the current linux PI infrastructure is geared towards +the limited static priority levels 0-139. With deadline scheduling you need to +do deadline inheritance (since priority is inversely proportional to the +deadline delta (deadline - now). + +This means the whole PI machinery will have to be reworked - and that is one of +the most complex pieces of code we have. -- cgit v1.2.3