diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/DocBook/kernel-api.tmpl | 2 | ||||
-rw-r--r-- | Documentation/accounting/cgroupstats.txt | 27 | ||||
-rw-r--r-- | Documentation/cachetlb.txt | 27 | ||||
-rw-r--r-- | Documentation/cgroups.txt | 545 | ||||
-rw-r--r-- | Documentation/cpu-hotplug.txt | 4 | ||||
-rw-r--r-- | Documentation/cpusets.txt | 226 | ||||
-rw-r--r-- | Documentation/input/input-programming.txt | 15 | ||||
-rw-r--r-- | Documentation/kdump/kdump.txt | 26 | ||||
-rw-r--r-- | Documentation/kernel-parameters.txt | 7 | ||||
-rw-r--r-- | Documentation/markers.txt | 81 |
10 files changed, 878 insertions, 82 deletions
diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl index d3290c46af5..aa38cc5692a 100644 --- a/Documentation/DocBook/kernel-api.tmpl +++ b/Documentation/DocBook/kernel-api.tmpl @@ -46,7 +46,7 @@ <sect1><title>Atomic and pointer manipulation</title> !Iinclude/asm-x86/atomic_32.h -!Iinclude/asm-x86/unaligned_32.h +!Iinclude/asm-x86/unaligned.h </sect1> <sect1><title>Delaying, scheduling, and timer routines</title> diff --git a/Documentation/accounting/cgroupstats.txt b/Documentation/accounting/cgroupstats.txt new file mode 100644 index 00000000000..eda40fd39ca --- /dev/null +++ b/Documentation/accounting/cgroupstats.txt @@ -0,0 +1,27 @@ +Control Groupstats is inspired by the discussion at +http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics as +suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263. + +Per cgroup statistics infrastructure re-uses code from the taskstats +interface. A new set of cgroup operations are registered with commands +and attributes specific to cgroups. It should be very easy to +extend per cgroup statistics, by adding members to the cgroupstats +structure. + +The current model for cgroupstats is a pull, a push model (to post +statistics on interesting events), should be very easy to add. Currently +user space requests for statistics by passing the cgroup path. +Statistics about the state of all the tasks in the cgroup is returned to +user space. + +NOTE: We currently rely on delay accounting for extracting information +about tasks blocked on I/O. If CONFIG_TASK_DELAY_ACCT is disabled, this +information will not be available. + +To extract cgroup statistics a utility very similar to getdelays.c +has been developed, the sample output of the utility is shown below + +~/balbir/cgroupstats # ./getdelays -C "/cgroup/a" +sleeping 1, blocked 0, running 1, stopped 0, uninterruptible 0 +~/balbir/cgroupstats # ./getdelays -C "/cgroup" +sleeping 155, blocked 0, running 1, stopped 0, uninterruptible 2 diff --git a/Documentation/cachetlb.txt b/Documentation/cachetlb.txt index 552cabac060..da42ab414c4 100644 --- a/Documentation/cachetlb.txt +++ b/Documentation/cachetlb.txt @@ -87,30 +87,7 @@ changes occur: This is used primarily during fault processing. -5) void flush_tlb_pgtables(struct mm_struct *mm, - unsigned long start, unsigned long end) - - The software page tables for address space 'mm' for virtual - addresses in the range 'start' to 'end-1' are being torn down. - - Some platforms cache the lowest level of the software page tables - in a linear virtually mapped array, to make TLB miss processing - more efficient. On such platforms, since the TLB is caching the - software page table structure, it needs to be flushed when parts - of the software page table tree are unlinked/freed. - - Sparc64 is one example of a platform which does this. - - Usually, when munmap()'ing an area of user virtual address - space, the kernel leaves the page table parts around and just - marks the individual pte's as invalid. However, if very large - portions of the address space are unmapped, the kernel frees up - those portions of the software page tables to prevent potential - excessive kernel memory usage caused by erratic mmap/mmunmap - sequences. It is at these times that flush_tlb_pgtables will - be invoked. - -6) void update_mmu_cache(struct vm_area_struct *vma, +5) void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t pte) At the end of every page fault, this routine is invoked to @@ -123,7 +100,7 @@ changes occur: translations for software managed TLB configurations. The sparc64 port currently does this. -7) void tlb_migrate_finish(struct mm_struct *mm) +6) void tlb_migrate_finish(struct mm_struct *mm) This interface is called at the end of an explicit process migration. This interface provides a hook diff --git a/Documentation/cgroups.txt b/Documentation/cgroups.txt new file mode 100644 index 00000000000..98a26f81fa7 --- /dev/null +++ b/Documentation/cgroups.txt @@ -0,0 +1,545 @@ + CGROUPS + ------- + +Written by Paul Menage <menage@google.com> based on Documentation/cpusets.txt + +Original copyright statements from cpusets.txt: +Portions Copyright (C) 2004 BULL SA. +Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. +Modified by Paul Jackson <pj@sgi.com> +Modified by Christoph Lameter <clameter@sgi.com> + +CONTENTS: +========= + +1. Control Groups + 1.1 What are cgroups ? + 1.2 Why are cgroups needed ? + 1.3 How are cgroups implemented ? + 1.4 What does notify_on_release do ? + 1.5 How do I use cgroups ? +2. Usage Examples and Syntax + 2.1 Basic Usage + 2.2 Attaching processes +3. Kernel API + 3.1 Overview + 3.2 Synchronization + 3.3 Subsystem API +4. Questions + +1. Control Groups +========== + +1.1 What are cgroups ? +---------------------- + +Control Groups provide a mechanism for aggregating/partitioning sets of +tasks, and all their future children, into hierarchical groups with +specialized behaviour. + +Definitions: + +A *cgroup* associates a set of tasks with a set of parameters for one +or more subsystems. + +A *subsystem* is a module that makes use of the task grouping +facilities provided by cgroups to treat groups of tasks in +particular ways. A subsystem is typically a "resource controller" that +schedules a resource or applies per-cgroup limits, but it may be +anything that wants to act on a group of processes, e.g. a +virtualization subsystem. + +A *hierarchy* is a set of cgroups arranged in a tree, such that +every task in the system is in exactly one of the cgroups in the +hierarchy, and a set of subsystems; each subsystem has system-specific +state attached to each cgroup in the hierarchy. Each hierarchy has +an instance of the cgroup virtual filesystem associated with it. + +At any one time there may be multiple active hierachies of task +cgroups. Each hierarchy is a partition of all tasks in the system. + +User level code may create and destroy cgroups by name in an +instance of the cgroup virtual file system, specify and query to +which cgroup a task is assigned, and list the task pids assigned to +a cgroup. Those creations and assignments only affect the hierarchy +associated with that instance of the cgroup file system. + +On their own, the only use for cgroups is for simple job +tracking. The intention is that other subsystems hook into the generic +cgroup support to provide new attributes for cgroups, such as +accounting/limiting the resources which processes in a cgroup can +access. For example, cpusets (see Documentation/cpusets.txt) allows +you to associate a set of CPUs and a set of memory nodes with the +tasks in each cgroup. + +1.2 Why are cgroups needed ? +---------------------------- + +There are multiple efforts to provide process aggregations in the +Linux kernel, mainly for resource tracking purposes. Such efforts +include cpusets, CKRM/ResGroups, UserBeanCounters, and virtual server +namespaces. These all require the basic notion of a +grouping/partitioning of processes, with newly forked processes ending +in the same group (cgroup) as their parent process. + +The kernel cgroup patch provides the minimum essential kernel +mechanisms required to efficiently implement such groups. It has +minimal impact on the system fast paths, and provides hooks for +specific subsystems such as cpusets to provide additional behaviour as +desired. + +Multiple hierarchy support is provided to allow for situations where +the division of tasks into cgroups is distinctly different for +different subsystems - having parallel hierarchies allows each +hierarchy to be a natural division of tasks, without having to handle +complex combinations of tasks that would be present if several +unrelated subsystems needed to be forced into the same tree of +cgroups. + +At one extreme, each resource controller or subsystem could be in a +separate hierarchy; at the other extreme, all subsystems +would be attached to the same hierarchy. + +As an example of a scenario (originally proposed by vatsa@in.ibm.com) +that can benefit from multiple hierarchies, consider a large +university server with various users - students, professors, system +tasks etc. The resource planning for this server could be along the +following lines: + + CPU : Top cpuset + / \ + CPUSet1 CPUSet2 + | | + (Profs) (Students) + + In addition (system tasks) are attached to topcpuset (so + that they can run anywhere) with a limit of 20% + + Memory : Professors (50%), students (30%), system (20%) + + Disk : Prof (50%), students (30%), system (20%) + + Network : WWW browsing (20%), Network File System (60%), others (20%) + / \ + Prof (15%) students (5%) + +Browsers like firefox/lynx go into the WWW network class, while (k)nfsd go +into NFS network class. + +At the same time firefox/lynx will share an appropriate CPU/Memory class +depending on who launched it (prof/student). + +With the ability to classify tasks differently for different resources +(by putting those resource subsystems in different hierarchies) then +the admin can easily set up a script which receives exec notifications +and depending on who is launching the browser he can + + # echo browser_pid > /mnt/<restype>/<userclass>/tasks + +With only a single hierarchy, he now would potentially have to create +a separate cgroup for every browser launched and associate it with +approp network and other resource class. This may lead to +proliferation of such cgroups. + +Also lets say that the administrator would like to give enhanced network +access temporarily to a student's browser (since it is night and the user +wants to do online gaming :) OR give one of the students simulation +apps enhanced CPU power, + +With ability to write pids directly to resource classes, its just a +matter of : + + # echo pid > /mnt/network/<new_class>/tasks + (after some time) + # echo pid > /mnt/network/<orig_class>/tasks + +Without this ability, he would have to split the cgroup into +multiple separate ones and then associate the new cgroups with the +new resource classes. + + + +1.3 How are cgroups implemented ? +--------------------------------- + +Control Groups extends the kernel as follows: + + - Each task in the system has a reference-counted pointer to a + css_set. + + - A css_set contains a set of reference-counted pointers to + cgroup_subsys_state objects, one for each cgroup subsystem + registered in the system. There is no direct link from a task to + the cgroup of which it's a member in each hierarchy, but this + can be determined by following pointers through the + cgroup_subsys_state objects. This is because accessing the + subsystem state is something that's expected to happen frequently + and in performance-critical code, whereas operations that require a + task's actual cgroup assignments (in particular, moving between + cgroups) are less common. A linked list runs through the cg_list + field of each task_struct using the css_set, anchored at + css_set->tasks. + + - A cgroup hierarchy filesystem can be mounted for browsing and + manipulation from user space. + + - You can list all the tasks (by pid) attached to any cgroup. + +The implementation of cgroups requires a few, simple hooks +into the rest of the kernel, none in performance critical paths: + + - in init/main.c, to initialize the root cgroups and initial + css_set at system boot. + + - in fork and exit, to attach and detach a task from its css_set. + +In addition a new file system, of type "cgroup" may be mounted, to +enable browsing and modifying the cgroups presently known to the +kernel. When mounting a cgroup hierarchy, you may specify a +comma-separated list of subsystems to mount as the filesystem mount +options. By default, mounting the cgroup filesystem attempts to +mount a hierarchy containing all registered subsystems. + +If an active hierarchy with exactly the same set of subsystems already +exists, it will be reused for the new mount. If no existing hierarchy +matches, and any of the requested subsystems are in use in an existing +hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy +is activated, associated with the requested subsystems. + +It's not currently possible to bind a new subsystem to an active +cgroup hierarchy, or to unbind a subsystem from an active cgroup +hierarchy. This may be possible in future, but is fraught with nasty +error-recovery issues. + +When a cgroup filesystem is unmounted, if there are any +child cgroups created below the top-level cgroup, that hierarchy +will remain active even though unmounted; if there are no +child cgroups then the hierarchy will be deactivated. + +No new system calls are added for cgroups - all support for +querying and modifying cgroups is via this cgroup file system. + +Each task under /proc has an added file named 'cgroup' displaying, +for each active hierarchy, the subsystem names and the cgroup name +as the path relative to the root of the cgroup file system. + +Each cgroup is represented by a directory in the cgroup file system +containing the following files describing that cgroup: + + - tasks: list of tasks (by pid) attached to that cgroup + - notify_on_release flag: run /sbin/cgroup_release_agent on exit? + +Other subsystems such as cpusets may add additional files in each +cgroup dir + +New cgroups are created using the mkdir system call or shell +command. The properties of a cgroup, such as its flags, are +modified by writing to the appropriate file in that cgroups +directory, as listed above. + +The named hierarchical structure of nested cgroups allows partitioning +a large system into nested, dynamically changeable, "soft-partitions". + +The attachment of each task, automatically inherited at fork by any +children of that task, to a cgroup allows organizing the work load +on a system into related sets of tasks. A task may be re-attached to +any other cgroup, if allowed by the permissions on the necessary +cgroup file system directories. + +When a task is moved from one cgroup to another, it gets a new +css_set pointer - if there's an already existing css_set with the +desired collection of cgroups then that group is reused, else a new +css_set is allocated. Note that the current implementation uses a +linear search to locate an appropriate existing css_set, so isn't +very efficient. A future version will use a hash table for better +performance. + +To allow access from a cgroup to the css_sets (and hence tasks) +that comprise it, a set of cg_cgroup_link objects form a lattice; +each cg_cgroup_link is linked into a list of cg_cgroup_links for +a single cgroup on its cont_link_list field, and a list of +cg_cgroup_links for a single css_set on its cg_link_list. + +Thus the set of tasks in a cgroup can be listed by iterating over +each css_set that references the cgroup, and sub-iterating over +each css_set's task set. + +The use of a Linux virtual file system (vfs) to represent the +cgroup hierarchy provides for a familiar permission and name space +for cgroups, with a minimum of additional kernel code. + +1.4 What does notify_on_release do ? +------------------------------------ + +*** notify_on_release is disabled in the current patch set. It will be +*** reactivated in a future patch in a less-intrusive manner + +If the notify_on_release flag is enabled (1) in a cgroup, then +whenever the last task in the cgroup leaves (exits or attaches to +some other cgroup) and the last child cgroup of that cgroup +is removed, then the kernel runs the command specified by the contents +of the "release_agent" file in that hierarchy's root directory, +supplying the pathname (relative to the mount point of the cgroup +file system) of the abandoned cgroup. This enables automatic +removal of abandoned cgroups. The default value of +notify_on_release in the root cgroup at system boot is disabled +(0). The default value of other cgroups at creation is the current +value of their parents notify_on_release setting. The default value of +a cgroup hierarchy's release_agent path is empty. + +1.5 How do I use cgroups ? +-------------------------- + +To start a new job that is to be contained within a cgroup, using +the "cpuset" cgroup subsystem, the steps are something like: + + 1) mkdir /dev/cgroup + 2) mount -t cgroup -ocpuset cpuset /dev/cgroup + 3) Create the new cgroup by doing mkdir's and write's (or echo's) in + the /dev/cgroup virtual file system. + 4) Start a task that will be the "founding father" of the new job. + 5) Attach that task to the new cgroup by writing its pid to the + /dev/cgroup tasks file for that cgroup. + 6) fork, exec or clone the job tasks from this founding father task. + +For example, the following sequence of commands will setup a cgroup +named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, +and then start a subshell 'sh' in that cgroup: + + mount -t cgroup cpuset -ocpuset /dev/cgroup + cd /dev/cgroup + mkdir Charlie + cd Charlie + /bin/echo 2-3 > cpus + /bin/echo 1 > mems + /bin/echo $$ > tasks + sh + # The subshell 'sh' is now running in cgroup Charlie + # The next line should display '/Charlie' + cat /proc/self/cgroup + +2. Usage Examples and Syntax +============================ + +2.1 Basic Usage +--------------- + +Creating, modifying, using the cgroups can be done through the cgroup +virtual filesystem. + +To mount a cgroup hierarchy will all available subsystems, type: +# mount -t cgroup xxx /dev/cgroup + +The "xxx" is not interpreted by the cgroup code, but will appear in +/proc/mounts so may be any useful identifying string that you like. + +To mount a cgroup hierarchy with just the cpuset and numtasks +subsystems, type: +# mount -t cgroup -o cpuset,numtasks hier1 /dev/cgroup + +To change the set of subsystems bound to a mounted hierarchy, just +remount with different options: + +# mount -o remount,cpuset,ns /dev/cgroup + +Note that changing the set of subsystems is currently only supported +when the hierarchy consists of a single (root) cgroup. Supporting +the ability to arbitrarily bind/unbind subsystems from an existing +cgroup hierarchy is intended to be implemented in the future. + +Then under /dev/cgroup you can find a tree that corresponds to the +tree of the cgroups in the system. For instance, /dev/cgroup +is the cgroup that holds the whole system. + +If you want to create a new cgroup under /dev/cgroup: +# cd /dev/cgroup +# mkdir my_cgroup + +Now you want to do something with this cgroup. +# cd my_cgroup + +In this directory you can find several files: +# ls +notify_on_release release_agent tasks +(plus whatever files are added by the attached subsystems) + +Now attach your shell to this cgroup: +# /bin/echo $$ > tasks + +You can also create cgroups inside your cgroup by using mkdir in this +directory. +# mkdir my_sub_cs + +To remove a cgroup, just use rmdir: +# rmdir my_sub_cs + +This will fail if the cgroup is in use (has cgroups inside, or +has processes attached, or is held alive by other subsystem-specific +reference). + +2.2 Attaching processes +----------------------- + +# /bin/echo PID > tasks + +Note that it is PID, not PIDs. You can only attach ONE task at a time. +If you have several tasks to attach, you have to do it one after another: + +# /bin/echo PID1 > tasks +# /bin/echo PID2 > tasks + ... +# /bin/echo PIDn > tasks + +3. Kernel API +============= + +3.1 Overview +------------ + +Each kernel subsystem that wants to hook into the generic cgroup +system needs to create a cgroup_subsys object. This contains +various methods, which are callbacks from the cgroup system, along +with a subsystem id which will be assigned by the cgroup system. + +Other fields in the cgroup_subsys object include: + +- subsys_id: a unique array index for the subsystem, indicating which + entry in cgroup->subsys[] this subsystem should be + managing. Initialized by cgroup_register_subsys(); prior to this + it should be initialized to -1 + +- hierarchy: an index indicating which hierarchy, if any, this + subsystem is currently attached to. If this is -1, then the + subsystem is not attached to any hierarchy, and all tasks should be + considered to be members of the subsystem's top_cgroup. It should + be initialized to -1. + +- name: should be initialized to a unique subsystem name prior to + calling cgroup_register_subsystem. Should be no longer than + MAX_CGROUP_TYPE_NAMELEN + +Each cgroup object created by the system has an array of pointers, +indexed by subsystem id; this pointer is entirely managed by the +subsystem; the generic cgroup code will never touch this pointer. + +3.2 Synchronization +------------------- + +There is a global mutex, cgroup_mutex, used by the cgroup +system. This should be taken by anything that wants to modify a +cgroup. It may also be taken to prevent cgroups from being +modified, but more specific locks may be more appropriate in that +situation. + +See kernel/cgroup.c for more details. + +Subsystems can take/release the cgroup_mutex via the functions +cgroup_lock()/cgroup_unlock(), and can +take/release the callback_mutex via the functions +cgroup_lock()/cgroup_unlock(). + +Accessing a task's cgroup pointer may be done in the following ways: +- while holding cgroup_mutex +- while holding the task's alloc_lock (via task_lock()) +- inside an rcu_read_lock() section via rcu_dereference() + +3.3 Subsystem API +-------------------------- + +Each subsystem should: + +- add an entry in linux/cgroup_subsys.h +- define a cgroup_subsys object called <name>_subsys + +Each subsystem may export the following methods. The only mandatory +methods are create/destroy. Any others that are null are presumed to +be successful no-ops. + +struct cgroup_subsys_state *create(struct cgroup *cont) +LL=cgroup_mutex + +Called to create a subsystem state object for a cgroup. The +subsystem should allocate its subsystem state object for the passed +cgroup, returning a pointer to the new object on success or a +negative error code. On success, the subsystem pointer should point to +a structure of type cgroup_subsys_state (typically embedded in a +larger subsystem-specific object), which will be initialized by the +cgroup system. Note that this will be called at initialization to +create the root subsystem state for this subsystem; this case can be +identified by the passed cgroup object having a NULL parent (since +it's the root of the hierarchy) and may be an appropriate place for +initialization code. + +void destroy(struct cgroup *cont) +LL=cgroup_mutex + +The cgroup system is about to destroy the passed cgroup; the +subsystem should do any necessary cleanup + +int can_attach(struct cgroup_subsys *ss, struct cgroup *cont, + struct task_struct *task) +LL=cgroup_mutex + +Called prior to moving a task into a cgroup; if the subsystem +returns an error, this will abort the attach operation. If a NULL +task is passed, then a successful result indicates that *any* +unspecified task can be moved into the cgroup. Note that this isn't +called on a fork. If this method returns 0 (success) then this should +remain valid while the caller holds cgroup_mutex. + +void attach(struct cgroup_subsys *ss, struct cgroup *cont, + struct cgroup *old_cont, struct task_struct *task) +LL=cgroup_mutex + + +Called after the task has been attached to the cgroup, to allow any +post-attachment activity that requires memory allocations or blocking. + +void fork(struct cgroup_subsy *ss, struct task_struct *task) +LL=callback_mutex, maybe read_lock(tasklist_lock) + +Called when a task is forked into a cgroup. Also called during +registration for all existing tasks. + +void exit(struct cgroup_subsys *ss, struct task_struct *task) +LL=callback_mutex + +Called during task exit + +int populate(struct cgroup_subsys *ss, struct cgroup *cont) +LL=none + +Called after creation of a cgroup to allow a subsystem to populate +the cgroup directory with file entries. The subsystem should make +calls to cgroup_add_file() with objects of type cftype (see +include/linux/cgroup.h for details). Note that although this +method can return an error code, the error code is currently not +always handled well. + +void post_clone(struct cgroup_subsys *ss, struct cgroup *cont) + +Called at the end of cgroup_clone() to do any paramater +initialization which might be required before a task could attach. For +example in cpusets, no task may attach before 'cpus' and 'mems' are set +up. + +void bind(struct cgroup_subsys *ss, struct cgroup *root) +LL=callback_mutex + +Called when a cgroup subsystem is rebound to a different hierarchy +and root cgroup. Currently this will only involve movement between +the default hierarchy (which never has sub-cgroups) and a hierarchy +that is being created/destroyed (and hence has no sub-cgroups). + +4. Questions +============ + +Q: what's up with this '/bin/echo' ? +A: bash's builtin 'echo' command does not check calls to write() against + errors. If you use it in the cgroup file system, you won't be + able to tell whether a command succeeded or failed. + +Q: When I attach processes, only the first of the line gets really attached ! +A: We can only return one error code per call to write(). So you should also + put only ONE pid. + diff --git a/Documentation/cpu-hotplug.txt b/Documentation/cpu-hotplug.txt index b6d24c22274..a741f658a3c 100644 --- a/Documentation/cpu-hotplug.txt +++ b/Documentation/cpu-hotplug.txt @@ -220,7 +220,9 @@ A: The following happen, listed in no particular order :-) CPU_DOWN_PREPARE or CPU_DOWN_PREPARE_FROZEN, depending on whether or not the CPU is being offlined while tasks are frozen due to a suspend operation in progress -- All process is migrated away from this outgoing CPU to a new CPU +- All processes are migrated away from this outgoing CPU to new CPUs. + The new CPU is chosen from each process' current cpuset, which may be + a subset of all online CPUs. - All interrupts targeted to this CPU is migrated to a new CPU - timers/bottom half/task lets are also migrated to a new CPU - Once all services are migrated, kernel calls an arch specific routine diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt index ec9de6917f0..141bef1c859 100644 --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt @@ -7,6 +7,7 @@ Written by Simon.Derr@bull.net Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. Modified by Paul Jackson <pj@sgi.com> Modified by Christoph Lameter <clameter@sgi.com> +Modified by Paul Menage <menage@google.com> CONTENTS: ========= @@ -16,9 +17,9 @@ CONTENTS: 1.2 Why are cpusets needed ? 1.3 How are cpusets implemented ? 1.4 What are exclusive cpusets ? - 1.5 What does notify_on_release do ? - 1.6 What is memory_pressure ? - 1.7 What is memory spread ? + 1.5 What is memory_pressure ? + 1.6 What is memory spread ? + 1.7 What is sched_load_balance ? 1.8 How do I use cpusets ? 2. Usage Examples and Syntax 2.1 Basic Usage @@ -44,18 +45,19 @@ hierarchy visible in a virtual file system. These are the essential hooks, beyond what is already present, required to manage dynamic job placement on large systems. -Each task has a pointer to a cpuset. Multiple tasks may reference -the same cpuset. Requests by a task, using the sched_setaffinity(2) -system call to include CPUs in its CPU affinity mask, and using the -mbind(2) and set_mempolicy(2) system calls to include Memory Nodes -in its memory policy, are both filtered through that tasks cpuset, -filtering out any CPUs or Memory Nodes not in that cpuset. The -scheduler will not schedule a task on a CPU that is not allowed in -its cpus_allowed vector, and the kernel page allocator will not -allocate a page on a node that is not allowed in the requesting tasks -mems_allowed vector. - -User level code may create and destroy cpusets by name in the cpuset +Cpusets use the generic cgroup subsystem described in +Documentation/cgroup.txt. + +Requests by a task, using the sched_setaffinity(2) system call to +include CPUs in its CPU affinity mask, and using the mbind(2) and +set_mempolicy(2) system calls to include Memory Nodes in its memory +policy, are both filtered through that tasks cpuset, filtering out any +CPUs or Memory Nodes not in that cpuset. The scheduler will not +schedule a task on a CPU that is not allowed in its cpus_allowed +vector, and the kernel page allocator will not allocate a page on a +node that is not allowed in the requesting tasks mems_allowed vector. + +User level code may create and destroy cpusets by name in the cgroup virtual file system, manage the attributes and permissions of these cpusets and which CPUs and Memory Nodes are assigned to each cpuset, specify and query to which cpuset a task is assigned, and list the @@ -115,7 +117,7 @@ Cpusets extends these two mechanisms as follows: - Cpusets are sets of allowed CPUs and Memory Nodes, known to the kernel. - Each task in the system is attached to a cpuset, via a pointer - in the task structure to a reference counted cpuset structure. + in the task structure to a reference counted cgroup structure. - Calls to sched_setaffinity are filtered to just those CPUs allowed in that tasks cpuset. - Calls to mbind and set_mempolicy are filtered to just @@ -145,15 +147,10 @@ into the rest of the kernel, none in performance critical paths: - in page_alloc.c, to restrict memory to allowed nodes. - in vmscan.c, to restrict page recovery to the current cpuset. -In addition a new file system, of type "cpuset" may be mounted, -typically at /dev/cpuset, to enable browsing and modifying the cpusets -presently known to the kernel. No new system calls are added for -cpusets - all support for querying and modifying cpusets is via -this cpuset file system. - -Each task under /proc has an added file named 'cpuset', displaying -the cpuset name, as the path relative to the root of the cpuset file -system. +You should mount the "cgroup" filesystem type in order to enable +browsing and modifying the cpusets presently known to the kernel. No +new system calls are added for cpusets - all support for querying and +modifying cpusets is via this cpuset file system. The /proc/<pid>/status file for each task has two added lines, displaying the tasks cpus_allowed (on which CPUs it may be scheduled) @@ -163,16 +160,15 @@ in the format seen in the following example: Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff Mems_allowed: ffffffff,ffffffff -Each cpuset is represented by a directory in the cpuset file system -containing the following files describing that cpuset: +Each cpuset is represented by a directory in the cgroup file system +containing (on top of the standard cgroup files) the following +files describing that cpuset: - cpus: list of CPUs in that cpuset - mems: list of Memory Nodes in that cpuset - memory_migrate flag: if set, move pages to cpusets nodes - cpu_exclusive flag: is cpu placement exclusive? - mem_exclusive flag: is memory placement exclusive? - - tasks: list of tasks (by pid) attached to that cpuset - - notify_on_release flag: run /sbin/cpuset_release_agent on exit? - memory_pressure: measure of how much paging pressure in cpuset In addition, the root cpuset only has the following file: @@ -237,21 +233,7 @@ such as requests from interrupt handlers, is allowed to be taken outside even a mem_exclusive cpuset. -1.5 What does notify_on_release do ? ------------------------------------- - -If the notify_on_release flag is enabled (1) in a cpuset, then whenever -the last task in the cpuset leaves (exits or attaches to some other -cpuset) and the last child cpuset of that cpuset is removed, then -the kernel runs the command /sbin/cpuset_release_agent, supplying the -pathname (relative to the mount point of the cpuset file system) of the -abandoned cpuset. This enables automatic removal of abandoned cpusets. -The default value of notify_on_release in the root cpuset at system -boot is disabled (0). The default value of other cpusets at creation -is the current value of their parents notify_on_release setting. - - -1.6 What is memory_pressure ? +1.5 What is memory_pressure ? ----------------------------- The memory_pressure of a cpuset provides a simple per-cpuset metric of the rate that the tasks in a cpuset are attempting to free up in @@ -308,7 +290,7 @@ the tasks in the cpuset, in units of reclaims attempted per second, times 1000. -1.7 What is memory spread ? +1.6 What is memory spread ? --------------------------- There are two boolean flag files per cpuset that control where the kernel allocates pages for the file system buffers and related in @@ -378,6 +360,142 @@ policy, especially for jobs that might have one thread reading in the data set, the memory allocation across the nodes in the jobs cpuset can become very uneven. +1.7 What is sched_load_balance ? +-------------------------------- + +The kernel scheduler (kernel/sched.c) automatically load balances +tasks. If one CPU is underutilized, kernel code running on that +CPU will look for tasks on other more overloaded CPUs and move those +tasks to itself, within the constraints of such placement mechanisms +as cpusets and sched_setaffinity. + +The algorithmic cost of load balancing and its impact on key shared +kernel data structures such as the task list increases more than +linearly with the number of CPUs being balanced. So the scheduler +has support to partition the systems CPUs into a number of sched +domains such that it only load balances within each sched domain. +Each sched domain covers some subset of the CPUs in the system; +no two sched domains overlap; some CPUs might not be in any sched +domain and hence won't be load balanced. + +Put simply, it costs less to balance between two smaller sched domains +than one big one, but doing so means that overloads in one of the +two domains won't be load balanced to the other one. + +By default, there is one sched domain covering all CPUs, except those +marked isolated using the kernel boot time "isolcpus=" argument. + +This default load balancing across all CPUs is not well suited for +the following two situations: + 1) On large systems, load balancing across many CPUs is expensive. + If the system is managed using cpusets to place independent jobs + on separate sets of CPUs, full load balancing is unnecessary. + 2) Systems supporting realtime on some CPUs need to minimize + system overhead on those CPUs, including avoiding task load + balancing if that is not needed. + +When the per-cpuset flag "sched_load_balance" is enabled (the default +setting), it requests that all the CPUs in that cpusets allowed 'cpus' +be contained in a single sched domain, ensuring that load balancing +can move a task (not otherwised pinned, as by sched_setaffinity) +from any CPU in that cpuset to any other. + +When the per-cpuset flag "sched_load_balance" is disabled, then the +scheduler will avoid load balancing across the CPUs in that cpuset, +--except-- in so far as is necessary because some overlapping cpuset +has "sched_load_balance" enabled. + +So, for example, if the top cpuset has the flag "sched_load_balance" +enabled, then the scheduler will have one sched domain covering all +CPUs, and the setting of the "sched_load_balance" flag in any other +cpusets won't matter, as we're already fully load balancing. + +Therefore in the above two situations, the top cpuset flag +"sched_load_balance" should be disabled, and only some of the smaller, +child cpusets have this flag enabled. + +When doing this, you don't usually want to leave any unpinned tasks in +the top cpuset that might use non-trivial amounts of CPU, as such tasks +may be artificially constrained to some subset of CPUs, depending on +the particulars of this flag setting in descendent cpusets. Even if +such a task could use spare CPU cycles in some other CPUs, the kernel +scheduler might not consider the possibility of load balancing that +task to that underused CPU. + +Of course, tasks pinned to a particular CPU can be left in a cpuset +that disables "sched_load_balance" as those tasks aren't going anywhere +else anyway. + +There is an impedance mismatch here, between cpusets and sched domains. +Cpusets are hierarchical and nest. Sched domains are flat; they don't +overlap and each CPU is in at most one sched domain. + +It is necessary for sched domains to be flat because load balancing +across partially overlapping sets of CPUs would risk unstable dynamics +that would be beyond our understanding. So if each of two partially +overlapping cpusets enables the flag 'sched_load_balance', then we +form a single sched domain that is a superset of both. We won't move +a task to a CPU outside it cpuset, but the scheduler load balancing +code might waste some compute cycles considering that possibility. + +This mismatch is why there is not a simple one-to-one relation +between which cpusets have the flag "sched_load_balance" enabled, +and the sched domain configuration. If a cpuset enables the flag, it +will get balancing across all its CPUs, but if it disables the flag, +it will only be assured of no load balancing if no other overlapping +cpuset enables the flag. + +If two cpusets have partially overlapping 'cpus' allowed, and only +one of them has this flag enabled, then the other may find its +tasks only partially load balanced, just on the overlapping CPUs. +This is just the general case of the top_cpuset example given a few +paragraphs above. In the general case, as in the top cpuset case, +don't leave tasks that might use non-trivial amounts of CPU in +such partially load balanced cpusets, as they may be artificially +constrained to some subset of the CPUs allowed to them, for lack of +load balancing to the other CPUs. + +1.7.1 sched_load_balance implementation details. +------------------------------------------------ + +The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary +to most cpuset flags.) When enabled for a cpuset, the kernel will +ensure that it can load balance across all the CPUs in that cpuset +(makes sure that all the CPUs in the cpus_allowed of that cpuset are +in the same sched domain.) + +If two overlapping cpusets both have 'sched_load_balance' enabled, +then they will be (must be) both in the same sched domain. + +If, as is the default, the top cpuset has 'sched_load_balance' enabled, +then by the above that means there is a single sched domain covering +the whole system, regardless of any other cpuset settings. + +The kernel commits to user space that it will avoid load balancing +where it can. It will pick as fine a granularity partition of sched +domains as it can while still providing load balancing for any set +of CPUs allowed to a cpuset having 'sched_load_balance' enabled. + +The internal kernel cpuset to scheduler interface passes from the +cpuset code to the scheduler code a partition of the load balanced +CPUs in the system. This partition is a set of subsets (represented +as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all +the CPUs that must be load balanced. + +Whenever the 'sched_load_balance' flag changes, or CPUs come or go +from a cpuset with this flag enabled, or a cpuset with this flag +enabled is removed, the cpuset code builds a new such partition and +passes it to the scheduler sched domain setup code, to have the sched +domains rebuilt as necessary. + +This partition exactly defines what sched domains the scheduler should +setup - one sched domain for each element (cpumask_t) in the partition. + +The scheduler remembers the currently active sched domain partitions. +When the scheduler routine partition_sched_domains() is invoked from +the cpuset code to update these sched domains, it compares the new +partition requested with the current, and updates its sched domains, +removing the old and adding the new, for each change. 1.8 How do I use cpusets ? -------------------------- @@ -469,7 +587,7 @@ than stress the kernel. To start a new job that is to be contained within a cpuset, the steps are: 1) mkdir /dev/cpuset - 2) mount -t cpuset none /dev/cpuset + 2) mount -t cgroup -ocpuset cpuset /dev/cpuset 3) Create the new cpuset by doing mkdir's and write's (or echo's) in the /dev/cpuset virtual file system. 4) Start a task that will be the "founding father" of the new job. @@ -481,7 +599,7 @@ For example, the following sequence of commands will setup a cpuset named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, and then start a subshell 'sh' in that cpuset: - mount -t cpuset none /dev/cpuset + mount -t cgroup -ocpuset cpuset /dev/cpuset cd /dev/cpuset mkdir Charlie cd Charlie @@ -513,7 +631,7 @@ Creating, modifying, using the cpusets can be done through the cpuset virtual filesystem. To mount it, type: -# mount -t cpuset none /dev/cpuset +# mount -t cgroup -o cpuset cpuset /dev/cpuset Then under /dev/cpuset you can find a tree that corresponds to the tree of the cpusets in the system. For instance, /dev/cpuset @@ -556,6 +674,18 @@ To remove a cpuset, just use rmdir: This will fail if the cpuset is in use (has cpusets inside, or has processes attached). +Note that for legacy reasons, the "cpuset" filesystem exists as a +wrapper around the cgroup filesystem. + +The command + +mount -t cpuset X /dev/cpuset + +is equivalent to + +mount -t cgroup -ocpuset X /dev/cpuset +echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent + 2.2 Adding/removing cpus ------------------------ diff --git a/Documentation/input/input-programming.txt b/Documentation/input/input-programming.txt index d9d523099bb..4d932dc6609 100644 --- a/Documentation/input/input-programming.txt +++ b/Documentation/input/input-programming.txt @@ -42,8 +42,8 @@ static int __init button_init(void) goto err_free_irq; } - button_dev->evbit[0] = BIT(EV_KEY); - button_dev->keybit[LONG(BTN_0)] = BIT(BTN_0); + button_dev->evbit[0] = BIT_MASK(EV_KEY); + button_dev->keybit[BIT_WORD(BTN_0)] = BIT_MASK(BTN_0); error = input_register_device(button_dev); if (error) { @@ -217,14 +217,15 @@ If you don't need absfuzz and absflat, you can set them to zero, which mean that the thing is precise and always returns to exactly the center position (if it has any). -1.4 NBITS(), LONG(), BIT() +1.4 BITS_TO_LONGS(), BIT_WORD(), BIT_MASK() ~~~~~~~~~~~~~~~~~~~~~~~~~~ -These three macros from input.h help some bitfield computations: +These three macros from bitops.h help some bitfield computations: - NBITS(x) - returns the length of a bitfield array in longs for x bits - LONG(x) - returns the index in the array in longs for bit x - BIT(x) - returns the index in a long for bit x + BITS_TO_LONGS(x) - returns the length of a bitfield array in longs for + x bits + BIT_WORD(x) - returns the index in the array in longs for bit x + BIT_MASK(x) - returns the index in a long for bit x 1.5 The id* and name fields ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt index 1b37b28cc23..d0ac72cc19f 100644 --- a/Documentation/kdump/kdump.txt +++ b/Documentation/kdump/kdump.txt @@ -231,6 +231,32 @@ Dump-capture kernel config options (Arch Dependent, ia64) any space below the alignment point will be wasted. +Extended crashkernel syntax +=========================== + +While the "crashkernel=size[@offset]" syntax is sufficient for most +configurations, sometimes it's handy to have the reserved memory dependent +on the value of System RAM -- that's mostly for distributors that pre-setup +the kernel command line to avoid a unbootable system after some memory has +been removed from the machine. + +The syntax is: + + crashkernel=<range1>:<size1>[,<range2>:<size2>,...][@offset] + range=start-[end] + +For example: + + crashkernel=512M-2G:64M,2G-:128M + +This would mean: + + 1) if the RAM is smaller than 512M, then don't reserve anything + (this is the "rescue" case) + 2) if the RAM size is between 512M and 2G, then reserve 64M + 3) if the RAM size is larger than 2G, then reserve 128M + + Boot into System Kernel ======================= diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 189df0bcab9..0a3fed44524 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -497,6 +497,13 @@ and is between 256 and 4096 characters. It is defined in the file [KNL] Reserve a chunk of physical memory to hold a kernel to switch to with kexec on panic. + crashkernel=range1:size1[,range2:size2,...][@offset] + [KNL] Same as above, but depends on the memory + in the running system. The syntax of range is + start-[end] where start and end are both + a memory unit (amount[KMG]). See also + Documentation/kdump/kdump.txt for a example. + cs4232= [HW,OSS] Format: <io>,<irq>,<dma>,<dma2>,<mpuio>,<mpuirq> diff --git a/Documentation/markers.txt b/Documentation/markers.txt new file mode 100644 index 00000000000..295a71bc301 --- /dev/null +++ b/Documentation/markers.txt @@ -0,0 +1,81 @@ + Using the Linux Kernel Markers + + Mathieu Desnoyers + + +This document introduces Linux Kernel Markers and their use. It provides +examples of how to insert markers in the kernel and connect probe functions to +them and provides some examples of probe functions. + + +* Purpose of markers + +A marker placed in code provides a hook to call a function (probe) that you can +provide at runtime. A marker can be "on" (a probe is connected to it) or "off" +(no probe is attached). When a marker is "off" it has no effect, except for +adding a tiny time penalty (checking a condition for a branch) and space +penalty (adding a few bytes for the function call at the end of the +instrumented function and adds a data structure in a separate section). When a +marker is "on", the function you provide is called each time the marker is +executed, in the execution context of the caller. When the function provided +ends its execution, it returns to the caller (continuing from the marker site). + +You can put markers at important locations in the code. Markers are +lightweight hooks that can pass an arbitrary number of parameters, +described in a printk-like format string, to the attached probe function. + +They can be used for tracing and performance accounting. + + +* Usage + +In order to use the macro trace_mark, you should include linux/marker.h. + +#include <linux/marker.h> + +And, + +trace_mark(subsystem_event, "%d %s", someint, somestring); +Where : +- subsystem_event is an identifier unique to your event + - subsystem is the name of your subsystem. + - event is the name of the event to mark. +- "%d %s" is the formatted string for the serializer. +- someint is an integer. +- somestring is a char pointer. + +Connecting a function (probe) to a marker is done by providing a probe (function +to call) for the specific marker through marker_probe_register() and can be +activated by calling marker_arm(). Marker deactivation can be done by calling +marker_disarm() as many times as marker_arm() has been called. Removing a probe +is done through marker_probe_unregister(); it will disarm the probe and make +sure there is no caller left using the probe when it returns. Probe removal is +preempt-safe because preemption is disabled around the probe call. See the +"Probe example" section below for a sample probe module. + +The marker mechanism supports inserting multiple instances of the same marker. +Markers can be put in inline functions, inlined static functions, and +unrolled loops as well as regular functions. + +The naming scheme "subsystem_event" is suggested here as a convention intended +to limit collisions. Marker names are global to the kernel: they are considered +as being the same whether they are in the core kernel image or in modules. +Conflicting format strings for markers with the same name will cause the markers +to be detected to have a different format string not to be armed and will output +a printk warning which identifies the inconsistency: + +"Format mismatch for probe probe_name (format), marker (format)" + + +* Probe / marker example + +See the example provided in samples/markers/src + +Compile them with your kernel. + +Run, as root : +modprobe marker-example (insmod order is not important) +modprobe probe-example +cat /proc/marker-example (returns an expected error) +rmmod marker-example probe-example +dmesg |