aboutsummaryrefslogtreecommitdiff
path: root/kernel/fork.c
AgeCommit message (Collapse)Author
2007-10-19pid namespaces: move alloc_pid() to copy_process()Sukadev Bhattiprolu
Move alloc_pid() into copy_process(). This will keep all pid and pid namespace code together and simplify error handling when we support multiple pid namespaces. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: round up the APIPavel Emelianov
The set of functions process_session, task_session, process_group and task_pgrp is confusing, as the names can be mixed with each other when looking at the code for a long time. The proposals are to * equip the functions that return the integer with _nr suffix to represent that fact, * and to make all functions work with task (not process) by making the common prefix of the same name. For monotony the routines signal_session() and set_signal_session() are replaced with task_session_nr() and set_task_session(), especially since they are only used with the explicit task->signal dereference. Signed-off-by: Pavel Emelianov <xemul@openvz.org> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Task Control Groups: make cpusets a client of cgroupsPaul Menage
Remove the filesystem support logic from the cpusets system and makes cpusets a cgroup subsystem The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get passed through to the cgroup filesystem with the appropriate options to emulate the old cpuset filesystem behaviour. Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Task Control Groups: shared cgroup subsystem group arraysPaul Menage
Replace the struct css_set embedded in task_struct with a pointer; all tasks that have the same set of memberships across all hierarchies will share a css_set object, and will be linked via their css_sets field to the "tasks" list_head in the css_set. Assuming that many tasks share the same cgroup assignments, this reduces overall space usage and keeps the size of the task_struct down (three pointers added to task_struct compared to a non-cgroups kernel, no matter how many subsystems are registered). [akpm@linux-foundation.org: fix a printk] [akpm@linux-foundation.org: build fix] Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Task Control Groups: add fork()/exit() hooksPaul Menage
This adds the necessary hooks to the fork() and exit() paths to ensure that new children inherit their parent's cgroup assignments, and that exiting processes release reference counts on their cgroups. Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18Add scaled time to taskstats based process accountingMichael Neuling
This adds items to the taststats struct to account for user and system time based on scaling the CPU frequency and instruction issue rates. Adds account_(user|system)_time_scaled callbacks which architectures can use to account for time using this mechanism. Signed-off-by: Michael Neuling <mikey@neuling.org> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Jay Lan <jlan@engr.sgi.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18whitespace fixes: forkDaniel Walker
Signed-off-by: Daniel Walker <dwalker@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18Remove struct task_struct::io_waitAlexey Dobriyan
Hell knows what happened in commit 63b05203af57e7de4f3bb63b8b81d43bc196d32b during 2.6.9 development. Commit introduced io_wait field which remained write-only than and still remains write-only. Also garbage collect macros which "use" io_wait. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18freezer: prevent new tasks from inheriting TIF_FREEZE setRafael J. Wysocki
Tasks should go to the refrigerator only if explicitly requested to do that by the freezer and not as a result of inheriting the TIF_FREEZE flag set from the parent. Make it happen. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17ifdef struct task_struct::securityAlexey Dobriyan
For those who don't care about CONFIG_SECURITY. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: James Morris <jmorris@namei.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17Shrink task_struct if CONFIG_FUTEX=nAlexey Dobriyan
robust_list, compat_robust_list, pi_state_list, pi_state_cache are really used if futexes are on. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17Slab API: remove useless ctor parameter and reorder parametersChristoph Lameter
Slab constructors currently have a flags parameter that is never used. And the order of the arguments is opposite to other slab functions. The object pointer is placed before the kmem_cache pointer. Convert ctor(void *object, struct kmem_cache *s, unsigned long flags) to ctor(struct kmem_cache *s, void *object) throughout the kernel [akpm@linux-foundation.org: coupla fixes] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17mm: dirty balancing for tasksPeter Zijlstra
Based on ideas of Andrew: http://marc.info/?l=linux-kernel&m=102912915020543&w=2 Scale the bdi dirty limit inversly with the tasks dirty rate. This makes heavy writers have a lower dirty limit than the occasional writer. Andrea proposed something similar: http://lwn.net/Articles/152277/ The main disadvantage to his patch is that he uses an unrelated quantity to measure time, which leaves him with a workload dependant tunable. Other than that the two approaches appear quite similar. [akpm@linux-foundation.org: fix warning] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-15sched: guest CPU accounting: add guest-CPU /proc/<pid>/stat fieldsLaurent Vivier
like for cpustat, introduce the "gtime" (guest time of the task) and "cgtime" (guest time of the task children) fields for the tasks. Modify signal_struct and task_struct. Modify /proc/<pid>/stat to display these new fields. Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net> Acked-by: Avi Kivity <avi@qumranet.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-10[NET]: Add network namespace clone & unshare support.Eric W. Biederman
This patch allows you to create a new network namespace using sys_clone, or sys_unshare. As the network namespace is still experimental and under development clone and unshare support is only made available when CONFIG_NET_NS is selected at compile time. As this patch introduces network namespace support into code paths that exist when the CONFIG_NET is not selected there are a few additions made to net_namespace.h to allow a few more functions to be used when the networking stack is not compiled in. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-09-20signalfd simplificationDavide Libenzi
This simplifies signalfd code, by avoiding it to remain attached to the sighand during its lifetime. In this way, the signalfd remain attached to the sighand only during poll(2) (and select and epoll) and read(2). This also allows to remove all the custom "tsk == current" checks in kernel/signal.c, since dequeue_signal() will only be called by "current". I think this is also what Ben was suggesting time ago. The external effect of this, is that a thread can extract only its own private signals and the group ones. I think this is an acceptable behaviour, in that those are the signals the thread would be able to fetch w/out signalfd. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-20mm: Remove slab destructors from kmem_cache_create().Paul Mundt
Slab destructors were no longer supported after Christoph's c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been BUGs for both slab and slub, and slob never supported them either. This rips out support for the dtor pointer from kmem_cache_create() completely and fixes up every single callsite in the kernel (there were about 224, not including the slab allocator definitions themselves, or the documentation references). Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2007-07-19lguest: the host codeRusty Russell
This is the code for the "lg.ko" module, which allows lguest guests to be launched. [akpm@linux-foundation.org: update for futex-new-private-futexes] [akpm@linux-foundation.org: build fix] [jmorris@namei.org: lguest: use hrtimers] [akpm@linux-foundation.org: x86_64 build fix] Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Andi Kleen <ak@suse.de> Cc: Eric Dumazet <dada1@cosmosbay.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19lguest: export symbols for lguest as a moduleRusty Russell
lguest does some fairly lowlevel things to support a host, which normal modules don't need: math_state_restore: When the guest triggers a Device Not Available fault, we need to be able to restore the FPU __put_task_struct: We need to hold a reference to another task for inter-guest I/O, and put_task_struct() is an inline function which calls __put_task_struct. access_process_vm: We need to access another task for inter-guest I/O. map_vm_area & __get_vm_area: We need to map the switcher shim (ie. monitor) at 0xFFC01000. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19coredump masking: add an interface for core dump filterKawai, Hidehiro
This patch adds an interface to set/reset flags which determines each memory segment should be dumped or not when a core file is generated. /proc/<pid>/coredump_filter file is provided to access the flags. You can change the flag status for a particular process by writing to or reading from the file. The flag status is inherited to the child process when it is created. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17Freezer: make kernel threads nonfreezable by defaultRafael J. Wysocki
Currently, the freezer treats all tasks as freezable, except for the kernel threads that explicitly set the PF_NOFREEZE flag for themselves. This approach is problematic, since it requires every kernel thread to either set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't care for the freezing of tasks at all. It seems better to only require the kernel threads that want to or need to be frozen to use some freezer-related code and to remove any freezer-related code from the other (nonfreezable) kernel threads, which is done in this patch. The patch causes all kernel threads to be nonfreezable by default (ie. to have PF_NOFREEZE set by default) and introduces the set_freezable() function that should be called by the freezable kernel threads in order to unset PF_NOFREEZE. It also makes all of the currently freezable kernel threads call set_freezable(), so it shouldn't cause any (intentional) change of behaviour to appear. Additionally, it updates documentation to describe the freezing of tasks more accurately. [akpm@linux-foundation.org: build fixes] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net> Cc: Pavel Machek <pavel@ucw.cz> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16user namespace: add unshareSerge E. Hallyn
This patch enables the unshare of user namespaces. It adds a new clone flag CLONE_NEWUSER and implements copy_user_ns() which resets the current user_struct and adds a new root user (uid == 0) For now, unsharing the user namespace allows a process to reset its user_struct accounting and uid 0 in the new user namespace should be contained using appropriate means, for instance selinux The plan, when the full support is complete (all uid checks covered), is to keep the original user's rights in the original namespace, and let a process become uid 0 in the new namespace, with full capabilities to the new namespace. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: Andrew Morgan <agm@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16user namespace: add the frameworkCedric Le Goater
Basically, it will allow a process to unshare its user_struct table, resetting at the same time its own user_struct and all the associated accounting. A new root user (uid == 0) is added to the user namespace upon creation. Such root users have full privileges and it seems that theses privileges should be controlled through some means (process capabilities ?) The unshare is not included in this patch. Changes since [try #4]: - Updated get_user_ns and put_user_ns to accept NULL, and get_user_ns to return the namespace. Changes since [try #3]: - moved struct user_namespace to files user_namespace.{c,h} Changes since [try #2]: - removed struct user_namespace* argument from find_user() Changes since [try #1]: - removed struct user_namespace* argument from find_user() - added a root_user per user namespace Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: Andrew Morgan <agm@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16Audit: add TTY input auditingMiloslav Trmac
Add TTY input auditing, used to audit system administrator's actions. This is required by various security standards such as DCID 6/3 and PCI to provide non-repudiation of administrator's actions and to allow a review of past actions if the administrator seems to overstep their duties or if the system becomes misconfigured for unknown reasons. These requirements do not make it necessary to audit TTY output as well. Compared to an user-space keylogger, this approach records TTY input using the audit subsystem, correlated with other audit events, and it is completely transparent to the user-space application (e.g. the console ioctls still work). TTY input auditing works on a higher level than auditing all system calls within the session, which would produce an overwhelming amount of mostly useless audit events. Add an "audit_tty" attribute, inherited across fork (). Data read from TTYs by process with the attribute is sent to the audit subsystem by the kernel. The audit netlink interface is extended to allow modifying the audit_tty attribute, and to allow sending explanatory audit events from user-space (for example, a shell might send an event containing the final command, after the interactive command-line editing and history expansion is performed, which might be difficult to decipher from the TTY input alone). Because the "audit_tty" attribute is inherited across fork (), it would be set e.g. for sshd restarted within an audited session. To prevent this, the audit_tty attribute is cleared when a process with no open TTY file descriptors (e.g. after daemon startup) opens a TTY. See https://www.redhat.com/archives/linux-audit/2007-June/msg00000.html for a more detailed rationale document for an older version of this patch. [akpm@linux-foundation.org: build fix] Signed-off-by: Miloslav Trmac <mitr@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Paul Fulghum <paulkf@microgate.com> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Steve Grubb <sgrubb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16Use boot based time for process start time and boot time in /procTomas Janousek
Commit 411187fb05cd11676b0979d9fbf3291db69dbce2 caused boot time to move and process start times to become invalid after suspend. Using boot based time for those restores the old behaviour and fixes the issue. [akpm@linux-foundation.org: little cleanup] Signed-off-by: Tomas Janousek <tjanouse@redhat.com> Cc: Tomas Smetana <tsmetana@redhat.com> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-09sched: update delay-accounting to use CFS's precise statsBalbir Singh
update delay-accounting to use CFS's precise stats. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-05-23freezer: fix vfork problemRafael J. Wysocki
Currently try_to_freeze_tasks() has to wait until all of the vforked processes exit and for this reason every user can make it fail. To fix this problem we can introduce the additional process flag PF_FREEZER_SKIP to be used by tasks that do not want to be counted as freezable by the freezer and want to have TIF_FREEZE set nevertheless. Then, this flag can be set by tasks using sys_vfork() before they call wait_for_completion(&vfork) and cleared after they have woken up. After clearing it, the tasks should call try_to_freeze() as soon as possible. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17Remove SLAB_CTOR_CONSTRUCTORChristoph Lameter
SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: David Howells <dhowells@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Steven French <sfrench@us.ibm.com> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Anton Altaparmakov <aia21@cantab.net> Cc: Mark Fasheh <mark.fasheh@oracle.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@ucw.cz> Cc: David Chinner <dgc@sgi.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11signal/timer/event: signalfd coreDavide Libenzi
This patch series implements the new signalfd() system call. I took part of the original Linus code (and you know how badly it can be broken :), and I added even more breakage ;) Signals are fetched from the same signal queue used by the process, so signalfd will compete with standard kernel delivery in dequeue_signal(). If you want to reliably fetch signals on the signalfd file, you need to block them with sigprocmask(SIG_BLOCK). This seems to be working fine on my Dual Opteron machine. I made a quick test program for it: http://www.xmailserver.org/signafd-test.c The signalfd() system call implements signal delivery into a file descriptor receiver. The signalfd file descriptor if created with the following API: int signalfd(int ufd, const sigset_t *mask, size_t masksize); The "ufd" parameter allows to change an existing signalfd sigmask, w/out going to close/create cycle (Linus idea). Use "ufd" == -1 if you want a brand new signalfd file. The "mask" allows to specify the signal mask of signals that we are interested in. The "masksize" parameter is the size of "mask". The signalfd fd supports the poll(2) and read(2) system calls. The poll(2) will return POLLIN when signals are available to be dequeued. As a direct consequence of supporting the Linux poll subsystem, the signalfd fd can use used together with epoll(2) too. The read(2) system call will return a "struct signalfd_siginfo" structure in the userspace supplied buffer. The return value is the number of bytes copied in the supplied buffer, or -1 in case of error. The read(2) call can also return 0, in case the sighand structure to which the signalfd was attached, has been orphaned. The O_NONBLOCK flag is also supported, and read(2) will return -EAGAIN in case no signal is available. If the size of the buffer passed to read(2) is lower than sizeof(struct signalfd_siginfo), -EINVAL is returned. A read from the signalfd can also return -ERESTARTSYS in case a signal hits the process. The format of the struct signalfd_siginfo is, and the valid fields depends of the (->code & __SI_MASK) value, in the same way a struct siginfo would: struct signalfd_siginfo { __u32 signo; /* si_signo */ __s32 err; /* si_errno */ __s32 code; /* si_code */ __u32 pid; /* si_pid */ __u32 uid; /* si_uid */ __s32 fd; /* si_fd */ __u32 tid; /* si_fd */ __u32 band; /* si_band */ __u32 overrun; /* si_overrun */ __u32 trapno; /* si_trapno */ __s32 status; /* si_status */ __s32 svint; /* si_int */ __u64 svptr; /* si_ptr */ __u64 utime; /* si_utime */ __u64 stime; /* si_stime */ __u64 addr; /* si_addr */ }; [akpm@linux-foundation.org: fix signalfd_copyinfo() on i386] Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11Use task_pgrp() task_session() in copy_process()Sukadev Bhattiprolu
Use task_pgrp() and task_session() in copy_process(), and avoid find_pid() call when attaching the task to its process group and session. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: <containers@lists.osdl.org> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11Use struct pid parameter in copy_process()Sukadev Bhattiprolu
Modify copy_process() to take a struct pid * parameter instead of a pid_t. This simplifies the code a bit and also avoids having to call find_pid() to convert the pid_t to a struct pid. Changelog: - Fixed Badari Pulavarty's comments and passed in &init_struct_pid from fork_idle(). - Fixed Eric Biederman's comments and simplified this patch and used a new patch to remove the likely(pid) check. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: <containers@lists.osdl.org> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11attach_pid() with struct pid parameterSukadev Bhattiprolu
attach_pid() currently takes a pid_t and then uses find_pid() to find the corresponding struct pid. Sometimes we already have the struct pid. We can then skip find_pid() if attach_pid() were to take a struct pid parameter. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: <containers@lists.osdl.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11getrusage(): fill ru_inblock and ru_oublock fields if possibleEric Dumazet
If CONFIG_TASK_IO_ACCOUNTING is defined, we update io accounting counters for each task. This patch permits reporting of values using the well known getrusage() syscall, filling ru_inblock and ru_oublock instead of null values. As TASK_IO_ACCOUNTING currently counts bytes counts, we approximate blocks count doing : nr_blocks = nr_bytes / 512 Example of use : ---------------------- After patch is applied, /usr/bin/time command can now give a good approximation of IO that the process had to do. $ /usr/bin/time grep tototo /usr/include/* Command exited with non-zero status 1 0.00user 0.02system 0:02.11elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k 24288inputs+0outputs (0major+259minor)pagefaults 0swaps $ /usr/bin/time dd if=/dev/zero of=/tmp/testfile count=1000 1000+0 enregistrements lus 1000+0 enregistrements écrits 512000 octets (512 kB) copiés, 0,00326601 seconde, 157 MB/s 0.00user 0.00system 0:00.00elapsed 80%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+3000outputs (0major+299minor)pagefaults 0swaps Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09rename thread_info to stackRoman Zippel
This finally renames the thread_info field in task structure to stack, so that the assumptions about this field are gone and archs have more freedom about placing the thread_info structure. Nonbroken archs which have a proper thread pointer can do the access to both current thread and task structure via a single pointer. It'll allow for a few more cleanups of the fork code, from which e.g. ia64 could benefit. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> [akpm@linux-foundation.org: build fix] Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ian Molton <spyro@f2s.com> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Cc: Mikael Starvik <starvik@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Greg Ungerer <gerg@uclinux.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Richard Curnow <rc@rc0.org.uk> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp> Cc: Andi Kleen <ak@muc.de> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08header cleaning: don't include smp_lock.h when not usedRandy Dunlap
Remove includes of <linux/smp_lock.h> where it is not used/needed. Suggested by Al Viro. Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc, sparc64, and arm (all 59 defconfigs). Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08Merge sys_clone()/sys_unshare() nsproxy and namespace handlingBadari Pulavarty
sys_clone() and sys_unshare() both makes copies of nsproxy and its associated namespaces. But they have different code paths. This patch merges all the nsproxy and its associated namespace copy/clone handling (as much as possible). Posted on container list earlier for feedback. - Create a new nsproxy and its associated namespaces and pass it back to caller to attach it to right process. - Changed all copy_*_ns() routines to return a new copy of namespace instead of attaching it to task->nsproxy. - Moved the CAP_SYS_ADMIN checks out of copy_*_ns() routines. - Removed unnessary !ns checks from copy_*_ns() and added BUG_ON() just incase. - Get rid of all individual unshare_*_ns() routines and make use of copy_*_ns() instead. [akpm@osdl.org: cleanups, warning fix] [clg@fr.ibm.com: remove dup_namespaces() declaration] [serue@us.ibm.com: fix CONFIG_IPC_NS=n, clone(CLONE_NEWIPC) retval] [akpm@linux-foundation.org: fix build with CONFIG_SYSVIPC=n] Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Serge Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: <containers@lists.osdl.org> Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07slab allocators: Remove SLAB_DEBUG_INITIAL flagChristoph Lameter
I have never seen a use of SLAB_DEBUG_INITIAL. It is only supported by SLAB. I think its purpose was to have a callback after an object has been freed to verify that the state is the constructor state again? The callback is performed before each freeing of an object. I would think that it is much easier to check the object state manually before the free. That also places the check near the code object manipulation of the object. Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was compiled with SLAB debugging on. If there would be code in a constructor handling SLAB_DEBUG_INITIAL then it would have to be conditional on SLAB_DEBUG otherwise it would just be dead code. But there is no such code in the kernel. I think SLUB_DEBUG_INITIAL is too problematic to make real use of, difficult to understand and there are easier ways to accomplish the same effect (i.e. add debug code before kfree). There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be clear in fs inode caches. Remove the pointless checks (they would even be pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors. This is the last slab flag that SLUB did not support. Remove the check for unimplemented flags from SLUB. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-02[PATCH] x86: PARAVIRT: add hooks to intercept mm creation and destructionJeremy Fitzhardinge
Add hooks to allow a paravirt implementation to track the lifetime of an mm. Paravirtualization requires three hooks, but only two are needed in common code. They are: arch_dup_mmap, which is called when a new mmap is created at fork arch_exit_mmap, which is called when the last process reference to an mm is dropped, which typically happens on exit and exec. The third hook is activate_mm, which is called from the arch-specific activate_mm() macro/function, and so doesn't need stub versions for other architectures. It's called when an mm is first used. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Andi Kleen <ak@suse.de> Cc: linux-arch@vger.kernel.org Cc: James Bottomley <James.Bottomley@SteelEye.com> Acked-by: Ingo Molnar <mingo@elte.hu>
2007-03-16[PATCH] initialise pi_lock if CONFIG_RT_MUTEXES=NZilvinas Valinskas
Fixes a bogus lockdep warning which causes lockdep to disable itself. Acked-by: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16[PATCH] hrtimers: namespace and enum cleanupThomas Gleixner
- hrtimers did not use the hrtimer_restart enum and relied on the implict int representation. Fix the prototypes and the functions using the enums. - Use seperate name spaces for the enumerations - Convert hrtimer_restart macro to inline function - Add comments No functional changes. [akpm@osdl.org: fix input driver] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: john stultz <johnstul@us.ibm.com> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Dmitry Torokhov <dtor@mail.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12[PATCH] tty: update the tty layer to work with struct pidEric W. Biederman
Of kernel subsystems that work with pids the tty layer is probably the largest consumer. But it has the nice virtue that the assiation with a session only lasts until the session leader exits. Which means that no reference counting is required. So using struct pid winds up being a simple optimization to avoid hash table lookups. In the long term the use of pid_nr also ensures that when we have multiple pid spaces mixed everything will work correctly. Signed-off-by: Eric W. Biederman <eric@maxwell.lnxi.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] ifdef ->rchar, ->wchar, ->syscr, ->syscw from task_structAlexey Dobriyan
They are fat: 4x8 bytes in task_struct. They are uncoditionally updated in every fork, read, write and sendfile. They are used only if you have some "extended acct fields feature". And please, please, please, read(2) knows about bytes, not characters, why it is called "rchar"? Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Jay Lan <jlan@engr.sgi.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-01[PATCH] fork_idle() should be __cpuinit, not __devinitAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-30Revert "[PATCH] namespaces: fix exit race by splitting exit"Linus Torvalds
This reverts commit 7a238fcba0629b6f2edbcd37458bae56fcf36be5 in preparation for a better and simpler fix proposed by Eric Biederman (and fixed up by Serge Hallyn) Acked-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-30[PATCH] namespaces: fix exit race by splitting exitSerge E. Hallyn
Fix exit race by splitting the nsproxy putting into two pieces. First piece reduces the nsproxy refcount. If we dropped the last reference, then it puts the mnt_ns, and returns the nsproxy as a hint to the caller. Else it returns NULL. The second piece of exiting task namespaces sets tsk->nsproxy to NULL, and drops the references to other namespaces and frees the nsproxy only if an nsproxy was passed in. A little awkward and should probably be reworked, but hopefully it fixes the NFS oops. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Daniel Hokka Zakrisson <daniel@hozac.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2006-12-13[PATCH] Optimize D-cache alias handling on forkRalf Baechle
Virtually index, physically tagged cache architectures can get away without cache flushing when forking. This patch adds a new cache flushing function flush_cache_dup_mm(struct mm_struct *) which for the moment I've implemented to do the same thing on all architectures except on MIPS where it's a no-op. Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10[PATCH] fdtable: Remove the free_files fieldVadim Lobanov
An fdtable can either be embedded inside a files_struct or standalone (after being expanded). When an fdtable is being discarded after all RCU references to it have expired, we must either free it directly, in the standalone case, or free the files_struct it is contained within, in the embedded case. Currently the free_files field controls this behavior, but we can get rid of it entirely, as all the necessary information is already recorded. We can distinguish embedded and standalone fdtables using max_fds, and if it is embedded we can divine the relevant files_struct using container_of(). Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10[PATCH] fdtable: Make fdarray and fdsets equal in sizeVadim Lobanov
Currently, each fdtable supports three dynamically-sized arrays of data: the fdarray and two fdsets. The code allows the number of fds supported by the fdarray (fdtable->max_fds) to differ from the number of fds supported by each of the fdsets (fdtable->max_fdset). In practice, it is wasteful for these two sizes to differ: whenever we hit a limit on the smaller-capacity structure, we will reallocate the entire fdtable and all the dynamic arrays within it, so any delta in the memory used by the larger-capacity structure will never be touched at all. Rather than hogging this excess, we shouldn't even allocate it in the first place, and keep the capacities of the fdarray and the fdsets equal. This patch removes fdtable->max_fdset. As an added bonus, most of the supporting code becomes simpler. Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10[PATCH] fdtable: Delete pointless code in dup_fd()Vadim Lobanov
The dup_fd() function creates a new files_struct and fdtable embedded inside that files_struct, and then possibly expands the fdtable using expand_files(). The out_release error path is invoked when expand_files() returns an error code. However, when this attempt to expand fails, the fdtable is left in its original embedded form, so it is pointless to try to free the associated fdarray and fdsets. Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10[PATCH] io-accounting: core statisticsAndrew Morton
The present per-task IO accounting isn't very useful. It simply counts the number of bytes passed into read() and write(). So if a process reads 1MB from an already-cached file, it is accused of having performed 1MB of I/O, which is wrong. (David Wright had some comments on the applicability of the present logical IO accounting: For billing purposes it is useless but for workload analysis it is very useful read_bytes/read_calls average read request size write_bytes/write_calls average write request size read_bytes/read_blocks ie logical/physical can indicate hit rate or thrashing write_bytes/write_blocks ie logical/physical guess since pdflush writes can be missed I often look for logical larger than physical to see filesystem cache problems. And the bytes/cpusec can help find applications that are dominating the cache and causing slow interactive response from page cache contention. I want to find the IO intensive applications and make sure they are doing efficient IO. Thus the acctcms(sysV) or csacms command would give the high IO commands). This patchset adds new accounting which tries to be more accurate. We account for three things: reads: attempt to count the number of bytes which this process really did cause to be fetched from the storage layer. Done at the submit_bio() level, so it is accurate for block-backed filesystems. I also attempt to wire up NFS and CIFS. writes: attempt to count the number of bytes which this process caused to be sent to the storage layer. This is done at page-dirtying time. The big inaccuracy here is truncate. If a process writes 1MB to a file and then deletes the file, it will in fact perform no writeout. But it will have been accounted as having caused 1MB of write. So... cancelled_writes: account the number of bytes which this process caused to not happen, by truncating pagecache. We _could_ just subtract this from the process's `write' accounting. But that means that some processes would be reported to have done negative amounts of write IO, which is silly. So we just report the raw number and punt this decision up to userspace. Now, we _could_ account for writes at the physical I/O level. But - This would require that we track memory-dirtying tasks at the per-page level (would require a new pointer in struct page). - It would mean that IO statistics for a process are usually only available long after that process has exitted. Which means that we probably cannot communicate this info via taskstats. This patch: Wire up the kernel-private data structures and the accessor functions to manipulate them. Cc: Jay Lan <jlan@sgi.com> Cc: Shailabh Nagar <nagar@watson.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Chris Sturtivant <csturtiv@sgi.com> Cc: Tony Ernst <tee@sgi.com> Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net> Cc: David Wright <daw@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>