aboutsummaryrefslogtreecommitdiff
path: root/Documentation/controllers
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/controllers')
-rw-r--r--Documentation/controllers/devices.txt48
-rw-r--r--Documentation/controllers/resource_counter.txt181
2 files changed, 229 insertions, 0 deletions
diff --git a/Documentation/controllers/devices.txt b/Documentation/controllers/devices.txt
new file mode 100644
index 00000000000..4dcea42432c
--- /dev/null
+++ b/Documentation/controllers/devices.txt
@@ -0,0 +1,48 @@
+Device Whitelist Controller
+
+1. Description:
+
+Implement a cgroup to track and enforce open and mknod restrictions
+on device files. A device cgroup associates a device access
+whitelist with each cgroup. A whitelist entry has 4 fields.
+'type' is a (all), c (char), or b (block). 'all' means it applies
+to all types and all major and minor numbers. Major and minor are
+either an integer or * for all. Access is a composition of r
+(read), w (write), and m (mknod).
+
+The root device cgroup starts with rwm to 'all'. A child device
+cgroup gets a copy of the parent. Administrators can then remove
+devices from the whitelist or add new entries. A child cgroup can
+never receive a device access which is denied its parent. However
+when a device access is removed from a parent it will not also be
+removed from the child(ren).
+
+2. User Interface
+
+An entry is added using devices.allow, and removed using
+devices.deny. For instance
+
+ echo 'c 1:3 mr' > /cgroups/1/devices.allow
+
+allows cgroup 1 to read and mknod the device usually known as
+/dev/null. Doing
+
+ echo a > /cgroups/1/devices.deny
+
+will remove the default 'a *:* mrw' entry.
+
+3. Security
+
+Any task can move itself between cgroups. This clearly won't
+suffice, but we can decide the best way to adequately restrict
+movement as people get some experience with this. We may just want
+to require CAP_SYS_ADMIN, which at least is a separate bit from
+CAP_MKNOD. We may want to just refuse moving to a cgroup which
+isn't a descendent of the current one. Or we may want to use
+CAP_MAC_ADMIN, since we really are trying to lock down root.
+
+CAP_SYS_ADMIN is needed to modify the whitelist or move another
+task to a new cgroup. (Again we'll probably want to change that).
+
+A cgroup may not be granted more permissions than the cgroup's
+parent has.
diff --git a/Documentation/controllers/resource_counter.txt b/Documentation/controllers/resource_counter.txt
new file mode 100644
index 00000000000..f196ac1d7d2
--- /dev/null
+++ b/Documentation/controllers/resource_counter.txt
@@ -0,0 +1,181 @@
+
+ The Resource Counter
+
+The resource counter, declared at include/linux/res_counter.h,
+is supposed to facilitate the resource management by controllers
+by providing common stuff for accounting.
+
+This "stuff" includes the res_counter structure and routines
+to work with it.
+
+
+
+1. Crucial parts of the res_counter structure
+
+ a. unsigned long long usage
+
+ The usage value shows the amount of a resource that is consumed
+ by a group at a given time. The units of measurement should be
+ determined by the controller that uses this counter. E.g. it can
+ be bytes, items or any other unit the controller operates on.
+
+ b. unsigned long long max_usage
+
+ The maximal value of the usage over time.
+
+ This value is useful when gathering statistical information about
+ the particular group, as it shows the actual resource requirements
+ for a particular group, not just some usage snapshot.
+
+ c. unsigned long long limit
+
+ The maximal allowed amount of resource to consume by the group. In
+ case the group requests for more resources, so that the usage value
+ would exceed the limit, the resource allocation is rejected (see
+ the next section).
+
+ d. unsigned long long failcnt
+
+ The failcnt stands for "failures counter". This is the number of
+ resource allocation attempts that failed.
+
+ c. spinlock_t lock
+
+ Protects changes of the above values.
+
+
+
+2. Basic accounting routines
+
+ a. void res_counter_init(struct res_counter *rc)
+
+ Initializes the resource counter. As usual, should be the first
+ routine called for a new counter.
+
+ b. int res_counter_charge[_locked]
+ (struct res_counter *rc, unsigned long val)
+
+ When a resource is about to be allocated it has to be accounted
+ with the appropriate resource counter (controller should determine
+ which one to use on its own). This operation is called "charging".
+
+ This is not very important which operation - resource allocation
+ or charging - is performed first, but
+ * if the allocation is performed first, this may create a
+ temporary resource over-usage by the time resource counter is
+ charged;
+ * if the charging is performed first, then it should be uncharged
+ on error path (if the one is called).
+
+ c. void res_counter_uncharge[_locked]
+ (struct res_counter *rc, unsigned long val)
+
+ When a resource is released (freed) it should be de-accounted
+ from the resource counter it was accounted to. This is called
+ "uncharging".
+
+ The _locked routines imply that the res_counter->lock is taken.
+
+
+ 2.1 Other accounting routines
+
+ There are more routines that may help you with common needs, like
+ checking whether the limit is reached or resetting the max_usage
+ value. They are all declared in include/linux/res_counter.h.
+
+
+
+3. Analyzing the resource counter registrations
+
+ a. If the failcnt value constantly grows, this means that the counter's
+ limit is too tight. Either the group is misbehaving and consumes too
+ many resources, or the configuration is not suitable for the group
+ and the limit should be increased.
+
+ b. The max_usage value can be used to quickly tune the group. One may
+ set the limits to maximal values and either load the container with
+ a common pattern or leave one for a while. After this the max_usage
+ value shows the amount of memory the container would require during
+ its common activity.
+
+ Setting the limit a bit above this value gives a pretty good
+ configuration that works in most of the cases.
+
+ c. If the max_usage is much less than the limit, but the failcnt value
+ is growing, then the group tries to allocate a big chunk of resource
+ at once.
+
+ d. If the max_usage is much less than the limit, but the failcnt value
+ is 0, then this group is given too high limit, that it does not
+ require. It is better to lower the limit a bit leaving more resource
+ for other groups.
+
+
+
+4. Communication with the control groups subsystem (cgroups)
+
+All the resource controllers that are using cgroups and resource counters
+should provide files (in the cgroup filesystem) to work with the resource
+counter fields. They are recommended to adhere to the following rules:
+
+ a. File names
+
+ Field name File name
+ ---------------------------------------------------
+ usage usage_in_<unit_of_measurement>
+ max_usage max_usage_in_<unit_of_measurement>
+ limit limit_in_<unit_of_measurement>
+ failcnt failcnt
+ lock no file :)
+
+ b. Reading from file should show the corresponding field value in the
+ appropriate format.
+
+ c. Writing to file
+
+ Field Expected behavior
+ ----------------------------------
+ usage prohibited
+ max_usage reset to usage
+ limit set the limit
+ failcnt reset to zero
+
+
+
+5. Usage example
+
+ a. Declare a task group (take a look at cgroups subsystem for this) and
+ fold a res_counter into it
+
+ struct my_group {
+ struct res_counter res;
+
+ <other fields>
+ }
+
+ b. Put hooks in resource allocation/release paths
+
+ int alloc_something(...)
+ {
+ if (res_counter_charge(res_counter_ptr, amount) < 0)
+ return -ENOMEM;
+
+ <allocate the resource and return to the caller>
+ }
+
+ void release_something(...)
+ {
+ res_counter_uncharge(res_counter_ptr, amount);
+
+ <release the resource>
+ }
+
+ In order to keep the usage value self-consistent, both the
+ "res_counter_ptr" and the "amount" in release_something() should be
+ the same as they were in the alloc_something() when the releasing
+ resource was allocated.
+
+ c. Provide the way to read res_counter values and set them (the cgroups
+ still can help with it).
+
+ c. Compile and run :)