aboutsummaryrefslogtreecommitdiff
path: root/fs/proc
AgeCommit message (Collapse)Author
2008-02-05maps4: use pagewalker in clear_refs and smapsMatt Mackall
Use the generic pagewalker for smaps and clear_refs Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05maps4: add proportional set size accounting in smapsFengguang Wu
The "proportional set size" (PSS) of a process is the count of pages it has in memory, where each page is divided by the number of processes sharing it. So if a process has 1000 pages all to itself, and 1000 shared with one other process, its PSS will be 1500. - lwn.net: "ELC: How much memory are applications really using?" The PSS proposed by Matt Mackall is a very nice metic for measuring an process's memory footprint. So collect and export it via /proc/<pid>/smaps. Matt Mackall's pagemap/kpagemap and John Berthels's exmap can also do the job. They are comprehensive tools. But for PSS, let's do it in the simple way. Cc: John Berthels <jjberthels@gmail.com> Cc: Bernardo Innocenti <bernie@codewiz.org> Cc: Padraig Brady <P@draigBrady.com> Cc: Denys Vlasenko <vda.linux@googlemail.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Hugh Dickins <hugh@veritas.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05is_vmalloc_addr(): Check if an address is within the vmalloc boundariesChristoph Lameter
Checking if an address is a vmalloc address is done in a couple of places. Define a common version in mm.h and replace the other checks. Again the include structures suck. The definition of VMALLOC_START and VMALLOC_END is not available in vmalloc.h since highmem.c cannot be included there. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-01[PATCH] switch audit_get_loginuid() to task_struct *Al Viro
all callers pass something->audit_context Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-01Merge branch 'task_killable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc * 'task_killable' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc: (22 commits) Remove commented-out code copied from NFS NFS: Switch from intr mount option to TASK_KILLABLE Add wait_for_completion_killable Add wait_event_killable Add schedule_timeout_killable Use mutex_lock_killable in vfs_readdir Add mutex_lock_killable Use lock_page_killable Add lock_page_killable Add fatal_signal_pending Add TASK_WAKEKILL exit: Use task_is_* signal: Use task_is_* sched: Use task_contributes_to_load, TASK_ALL and TASK_NORMAL ptrace: Use task_is_* power: Use task_is_* wait: Use TASK_NORMAL proc/base.c: Use task_is_* proc/array.c: Use TASK_REPORT perfmon: Use task_is_* ... Fixed up conflicts in NFS/sunrpc manually..
2008-01-28[ATM]: Oops reading net/atm/arpDenis V. Lunev
cat /proc/net/atm/arp causes the NULL pointer dereference in the get_proc_net+0xc/0x3a. This happens as proc_get_net believes that the parent proc dir entry contains struct net. Fix this assumption for "net/atm" case. The problem is introduced by the commit c0097b07abf5f92ab135d024dd41bd2aada1512f from Eric W. Biederman/Daniel Lezcano. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NET]: Consolidate net namespace related proc files creation.Denis V. Lunev
Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-25sched: latencytop supportArjan van de Ven
LatencyTOP kernel infrastructure; it measures latencies in the scheduler and tracks it system wide and per process. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-14fix the "remove task_ppid_nr_ns" commitOleg Nesterov
Commit 84427eaef1fb91704c7112bdb598c810003b99f3 (remove task_ppid_nr_ns) moved the task_tgid_nr_ns(task->real_parent) outside of lock_task_sighand(). This is wrong, ->real_parent could be freed/reused. Both ->parent/real_parent point to nothing after __exit_signal() because we remove the child from ->children list, and thus the child can't be reparented when its parent exits. rcu_read_lock() protects ->parent/real_parent, but _only_ if we know it was valid before we take rcu lock. Revert this part of the patch. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-13remove task_ppid_nr_nsRoland McGrath
task_ppid_nr_ns is called in three places. One of these should never have called it. In the other two, using it broke the existing semantics. This was presumably accidental. If the function had not been there, it would have been much more obvious to the eye that those patches were changing the behavior. We don't need this function. In task_state, the pid of the ptracer is not the ppid of the ptracer. In do_task_stat, ppid is the tgid of the real_parent, not its pid. I also moved the call outside of lock_task_sighand, since it doesn't need it. In sys_getppid, ppid is the tgid of the real_parent, not its pid. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-02restrict reading from /proc/<pid>/maps to those who share ->mm or can ptrace pidAl Viro
Contents of /proc/*/maps is sensitive and may become sensitive after open() (e.g. if target originally shares our ->mm and later does exec on suid-root binary). Check at read() (actually, ->start() of iterator) time that mm_struct we'd grabbed and locked is - still the ->mm of target - equal to reader's ->mm or the target is ptracable by reader. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-02Unify /proc/slabinfo configurationLinus Torvalds
Both SLUB and SLAB really did almost exactly the same thing for /proc/slabinfo setup, using duplicate code and per-allocator #ifdef's. This just creates a common CONFIG_SLABINFO that is enabled by both SLUB and SLAB, and shares all the setup code. Maybe SLOB will want this some day too. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-02slub: register slabinfo to procfsPekka Enberg
We need to register slabinfo to procfs when CONFIG_SLUB is enabled to make the file actually visible to user-space. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-10proc: remove/Fix proc generic d_revalidateEric W. Biederman
Ultimately to implement /proc perfectly we need an implementation of d_revalidate because files and directories can be removed behind the back of the VFS, and d_revalidate is the only way we can let the VFS know that this has happened. Unfortunately the linux VFS can not cope with anything in the path to a mount point going away. So a proper d_revalidate method that calls d_drop also needs to call have_submounts which is moderately expensive, so you really don't want a d_revalidate method that unconditionally calls it, but instead only calls it when the backing object has really gone away. proc generic entries only disappear on module_unload (when not counting the fledgling network namespace) so it is quite rare that we actually encounter that case and has not actually caused us real world trouble yet. So until we get a proper test for keeping dentries in the dcache fix the current d_revalidate method by completely removing it. This returns us to the current status quo. So with CONFIG_NETNS=n things should look as they have always looked. For CONFIG_NETNS=y things work most of the time but there are a few rare corner cases that don't behave properly. As the network namespace is barely present in 2.6.24 this should not be a problem. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: "Denis V. Lunev" <den@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-06proc/base.c: Use task_is_*Matthew Wilcox
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
2007-12-06proc/array.c: Use TASK_REPORTMatthew Wilcox
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
2007-12-05proc: fix proc_dir_entry refcountingAlexey Dobriyan
Creating PDEs with refcount 0 and "deleted" flag has problems (see below). Switch to usual scheme: * PDE is created with refcount 1 * every de_get does +1 * every de_put() and remove_proc_entry() do -1 * once refcount reaches 0, PDE is freed. This elegantly fixes at least two following races (both observed) without introducing new locks, without abusing old locks, without spreading lock_kernel(): 1) PDE leak remove_proc_entry de_put ----------------- ------ [refcnt = 1] if (atomic_read(&de->count) == 0) if (atomic_dec_and_test(&de->count)) if (de->deleted) /* also not taken! */ free_proc_entry(de); else de->deleted = 1; [refcount=0, deleted=1] 2) use after free remove_proc_entry de_put ----------------- ------ [refcnt = 1] if (atomic_dec_and_test(&de->count)) if (atomic_read(&de->count) == 0) free_proc_entry(de); /* boom! */ if (de->deleted) free_proc_entry(de); BUG: unable to handle kernel paging request at virtual address 6b6b6b6b printing eip: c10acdda *pdpt = 00000000338f8001 *pde = 0000000000000000 Oops: 0000 [#1] PREEMPT SMP Modules linked in: af_packet ipv6 cpufreq_ondemand loop serio_raw psmouse k8temp hwmon sr_mod cdrom Pid: 23161, comm: cat Not tainted (2.6.24-rc2-8c0863403f109a43d7000b4646da4818220d501f #4) EIP: 0060:[<c10acdda>] EFLAGS: 00210097 CPU: 1 EIP is at strnlen+0x6/0x18 EAX: 6b6b6b6b EBX: 6b6b6b6b ECX: 6b6b6b6b EDX: fffffffe ESI: c128fa3b EDI: f380bf34 EBP: ffffffff ESP: f380be44 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process cat (pid: 23161, ti=f380b000 task=f38f2570 task.ti=f380b000) Stack: c10ac4f0 00000278 c12ce000 f43cd2a8 00000163 00000000 7da86067 00000400 c128fa20 00896b18 f38325a8 c128fe20 ffffffff 00000000 c11f291e 00000400 f75be300 c128fa20 f769c9a0 c10ac779 f380bf34 f7bfee70 c1018e6b f380bf34 Call Trace: [<c10ac4f0>] vsnprintf+0x2ad/0x49b [<c10ac779>] vscnprintf+0x14/0x1f [<c1018e6b>] vprintk+0xc5/0x2f9 [<c10379f1>] handle_fasteoi_irq+0x0/0xab [<c1004f44>] do_IRQ+0x9f/0xb7 [<c117db3b>] preempt_schedule_irq+0x3f/0x5b [<c100264e>] need_resched+0x1f/0x21 [<c10190ba>] printk+0x1b/0x1f [<c107c8ad>] de_put+0x3d/0x50 [<c107c8f8>] proc_delete_inode+0x38/0x41 [<c107c8c0>] proc_delete_inode+0x0/0x41 [<c1066298>] generic_delete_inode+0x5e/0xc6 [<c1065aa9>] iput+0x60/0x62 [<c1063c8e>] d_kill+0x2d/0x46 [<c1063fa9>] dput+0xdc/0xe4 [<c10571a1>] __fput+0xb0/0xcd [<c1054e49>] filp_close+0x48/0x4f [<c1055ee9>] sys_close+0x67/0xa5 [<c10026b6>] sysenter_past_esp+0x5f/0x85 ======================= Code: c9 74 0c f2 ae 74 05 bf 01 00 00 00 4f 89 fa 5f 89 d0 c3 85 c9 57 89 c7 89 d0 74 05 f2 ae 75 01 4f 89 f8 5f c3 89 c1 89 c8 eb 06 <80> 38 00 74 07 40 4a 83 fa ff 75 f4 29 c8 c3 90 90 90 57 83 c9 EIP: [<c10acdda>] strnlen+0x6/0x18 SS:ESP 0068:f380be44 Also, remove broken usage of ->deleted from reiserfs: if sget() succeeds, module is already pinned and remove_proc_entry() can't happen => nobody can mark PDE deleted. Dummy proc root in netns code is not marked with refcount 1. AFAICS, we never get it, it's just for proper /proc/net removal. I double checked CLONE_NETNS continues to work. Patch survives many hours of modprobe/rmmod/cat loops without new bugs which can be attributed to refcounting. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/net-2.6: (27 commits) [INET]: Fix inet_diag dead-lock regression [NETNS]: Fix /proc/net breakage [TEXTSEARCH]: Do not allow zero length patterns in the textsearch infrastructure [NETFILTER]: fix forgotten module release in xt_CONNMARK and xt_CONNSECMARK [NETFILTER]: xt_TCPMSS: remove network triggerable WARN_ON [DECNET]: dn_nl_deladdr() almost always returns no error [IPV6]: Restore IPv6 when MTU is big enough [RXRPC]: Add missing select on CRYPTO mac80211: rate limit wep decrypt failed messages rfkill: fix double-mutex-locking mac80211: drop unencrypted frames if encryption is expected mac80211: Fix behavior of ieee80211_open and ieee80211_close ieee80211: fix unaligned access in ieee80211_copy_snap mac80211: free ifsta->extra_ie and clear IEEE80211_STA_PRIVACY_INVOKED SCTP: Fix build issues with SCTP AUTH. SCTP: Fix chunk acceptance when no authenticated chunks were listed. SCTP: Fix the supported extensions paramter SCTP: Fix SCTP-AUTH to correctly add HMACS paramter. SCTP: Fix the number of HB transmissions. [TCP] illinois: Incorrect beta usage ...
2007-12-02[NETNS]: Fix /proc/net breakageEric W. Biederman
Well I clearly goofed when I added the initial network namespace support for /proc/net. Currently things work but there are odd details visible to user space, even when we have a single network namespace. Since we do not cache proc_dir_entry dentries at the moment we can just modify ->lookup to return a different directory inode depending on the network namespace of the process looking at /proc/net, replacing the current technique of using a magic and fragile follow_link method. To accomplish that this patch: - introduces a shadow_proc method to allow different dentries to be returned from proc_lookup. - Removes the old /proc/net follow_link magic - Fixes a weakness in our not caching of proc generic dentries. As shadow_proc uses a task struct to decided which dentry to return we can go back later and fix the proc generic caching without modifying any code that uses the shadow_proc method. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-29proc: remove races from proc_id_readdir()Eric W. Biederman
Oleg noticed that the call of task_pid_nr_ns() in proc_pid_readdir is racy with respect to tasks exiting. After a bit of examination it also appears that the call itself is completely unnecessary. So to fix the problem this patch modifies next_tgid() to return both a tgid and the task struct in question. A structure is introduced to return these values because it is slightly cleaner and easier to optimize, and the resulting code is a little shorter. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-29proc: fix NULL ->i_fop oopsAlexey Dobriyan
proc_kill_inodes() can clear ->i_fop in the middle of vfs_readdir resulting in NULL dereference during "file->f_op->readdir(file, buf, filler)". The solution is to remove proc_kill_inodes() completely: a) we don't have tricky modules implementing their tricky readdir hooks which could keeping this revoke from hell. b) In a situation when module is gone but PDE still alive, standard readdir will return only "." and "..", because pde->next was cleared by remove_proc_entry(). c) the race proc_kill_inode() destined to prevent is not completely fixed, just race window made smaller, because vfs_readdir() is run without sb_lock held and without file_list_lock held. Effectively, ->i_fop is cleared at random moment, which can't fix properly anything. BUG: unable to handle kernel NULL pointer dereference at virtual address 00000018 printing eip: c1061205 *pdpt = 0000000005b22001 *pde = 0000000000000000 Oops: 0000 [#1] PREEMPT SMP Modules linked in: foo af_packet ipv6 cpufreq_ondemand loop serio_raw sr_mod k8temp cdrom hwmon amd_rng Pid: 2033, comm: find Not tainted (2.6.24-rc1-b1d08ac064268d0ae2281e98bf5e82627e0f0c56 #2) EIP: 0060:[<c1061205>] EFLAGS: 00010246 CPU: 0 EIP is at vfs_readdir+0x47/0x74 EAX: c6b6a780 EBX: 00000000 ECX: c1061040 EDX: c5decf94 ESI: c6b6a780 EDI: fffffffe EBP: c9797c54 ESP: c5decf78 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process find (pid: 2033, ti=c5dec000 task=c64bba90 task.ti=c5dec000) Stack: c5decf94 c1061040 fffffff7 0805ffbc 00000000 c6b6a780 c1061295 0805ffbc 00000000 00000400 00000000 00000004 0805ffbc 4588eff4 c5dec000 c10026ba 00000004 0805ffbc 00000400 0805ffbc 4588eff4 bfdc6c70 000000dc 0000007b Call Trace: [<c1061040>] filldir64+0x0/0xc5 [<c1061295>] sys_getdents64+0x63/0xa5 [<c10026ba>] sysenter_past_esp+0x5f/0x85 ======================= Code: 49 83 78 18 00 74 43 8d 6b 74 bf fe ff ff ff 89 e8 e8 b8 c0 12 00 f6 83 2c 01 00 00 10 75 22 8b 5e 10 8b 4c 24 04 89 f0 8b 14 24 <ff> 53 18 f6 46 1a 04 89 c7 75 0b 8b 56 0c 8b 46 08 e8 c8 66 00 EIP: [<c1061205>] vfs_readdir+0x47/0x74 SS:ESP 0068:c5decf78 hch: "Nice, getting rid of this is a very good step formwards. Unfortunately we have another copy of this junk in security/selinux/selinuxfs.c:sel_remove_entries() which would need the same treatment." Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Acked-by: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-26sched: fix prev_stime calculationIngo Molnar
Srivatsa Vaddagiri noticed occasionally incorrect CPU usage values in top and tracked it down to stime going below 0 in task_stime(). Negative values are possible there due to the sampled nature of stime/utime. Fix suggested by Balbir Singh. Signed-off-by: Ingo Molnar <mingo@elte.hu> Tested-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com>
2007-11-14proc: simplify and correct proc_flush_taskEric W. Biederman
Currently we special case when we have only the initial pid namespace. Unfortunately in doing so the copied case for the other namespaces was broken so we don't properly flush the thread directories :( So this patch removes the unnecessary special case (removing a usage of proc_mnt) and corrects the flushing of the thread directories. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-14proc: fix proc_kill_inodes to kill dentries on all proc superblocksEric W. Biederman
It appears we overlooked support for removing generic proc files when we added support for multiple proc super blocks. Handle that now. [akpm@linux-foundation.org: coding-style cleanups] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Cc: Alexey Dobriyan <adobriyan@sw.ru> Acked-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-13[NET]: Move unneeded data to initdata section.Denis V. Lunev
This patch reverts Eric's commit 2b008b0a8e96b726c603c5e1a5a7a509b5f61e35 It diets .text & .data section of the kernel if CONFIG_NET_NS is not set. This is safe after list operations cleanup. Signed-of-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07[NET]: Kill proc_net_create()David S. Miller
There are no more users. Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-30sched: fix /proc/<PID>/stat stime/utime monotonicity, part 2Balbir Singh
Extend Peter's patch to fix accounting issues, by keeping stime monotonic too. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Tested-by: Frans Pop <elendil@planet.nl>
2007-10-29sched: keep utime/stime monotonicPeter Zijlstra
keep utime/stime monotonic. cpustats use utime/stime as a ratio against sum_exec_runtime, as a consequence it can happen - when the ratio changes faster than time accumulates - that either can be appear to go backwards. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-26[NET]: Marking struct pernet_operations __net_initdata was inappropriateEric W. Biederman
It is not safe to to place struct pernet_operations in a special section. We need struct pernet_operations to last until we call unregister_pernet_subsys. Which doesn't happen until module unload. So marking struct pernet_operations is a disaster for modules in two ways. - We discard it before we call the exit method it points to. - Because I keep struct pernet_operations on a linked list discarding it for compiled in code removes elements in the middle of a linked list and does horrible things for linked insert. So this looks safe assuming __exit_refok is not discarded for modules. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-26[NET] fs/proc/proc_net.c: make a struct staticAdrian Bunk
Struct proc_net_ns_ops can become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-25Fix pointer mismatches in proc_sysctl.cDavid Howells
Fix pointer mismatches in proc_sysctl.c. The proc_handler() method returns a size_t through an arg pointer, but is given a pointer to a ssize_t to return into. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-22procfs: fix kernel-doc param warningsRandy Dunlap
Fix mnt_flush_task() misplaced kernel-doc. Fix typos in some of the doc text. Warning(linux-2.6.23-git17//fs/proc/base.c:2280): No description found for parameter 'mnt' Warning(linux-2.6.23-git17//fs/proc/base.c:2280): No description found for parameter 'pid' Warning(linux-2.6.23-git17//fs/proc/base.c:2280): No description found for parameter 'tgid' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-schedLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched: sched: fix guest time accounting going faster than user time accounting
2007-10-19Remove unused variables from fs/proc/base.cPavel Emelyanov
When removing the explicit task_struct->pid usage I found that proc_readfd_common() and proc_pident_readdir() get this field, but do not use it at all. So this cleanup is a cheap help with the task_struct->pid isolation. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Fix tsk->exit_state usageEugene Teo
tsk->exit_state can only be 0, EXIT_ZOMBIE, or EXIT_DEAD. A non-zero test is the same as tsk->exit_state & (EXIT_ZOMBIE | EXIT_DEAD), so just testing tsk->exit_state is sufficient. Signed-off-by: Eugene Teo <eugeneteo@kernel.sg> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19proc: export a processes resource limits via /proc/pidNeil Horman
Currently, there exists no method for a process to query the resource limits of another process. They can be inferred via some mechanisms but they cannot be explicitly determined. Given that this information can be usefull to know during the debugging of an application, I've written this patch which exports all of a processes limits via /proc/<pid>/limits. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Isolate some explicit usage of task->tgidPavel Emelyanov
With pid namespaces this field is now dangerous to use explicitly, so hide it behind the helpers. Also the pid and pgrp fields o task_struct and signal_struct are to be deprecated. Unfortunately this patch cannot be sent right now as this leads to tons of warnings, so start isolating them, and deprecate later. Actually the p->tgid == pid has to be changed to has_group_leader_pid(), but Oleg pointed out that in case of posix cpu timers this is the same, and thread_group_leader() is more preferable. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: changes to show virtual ids to userPavel Emelyanov
This is the largest patch in the set. Make all (I hope) the places where the pid is shown to or get from user operate on the virtual pids. The idea is: - all in-kernel data structures must store either struct pid itself or the pid's global nr, obtained with pid_nr() call; - when seeking the task from kernel code with the stored id one should use find_task_by_pid() call that works with global pids; - when showing pid's numerical value to the user the virtual one should be used, but however when one shows task's pid outside this task's namespace the global one is to be used; - when getting the pid from userspace one need to consider this as the virtual one and use appropriate task/pid-searching functions. [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: nuther build fix] [akpm@linux-foundation.org: yet nuther build fix] [akpm@linux-foundation.org: remove unneeded casts] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: initialize the namespace's proc_mntPavel Emelyanov
The namespace's proc_mnt must be kern_mount-ed to make this pointer always valid, independently of whether the user space mounted the proc or not. This solves raced in proc_flush_task, etc. with the proc_mnt switching from NULL to not-NULL. The initialization is done after the init's pid is created and hashed to make proc_get_sb() finr it and get for root inode. Sice the namespace holds the vfsmnt, vfsmnt holds the superblock and the superblock holds the namespace we must explicitly break this circle to destroy all the stuff. This is done after the init of the namespace dies. Running a few steps forward - when init exits it will kill all its children, so no proc_mnt will be needed after its death. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: make proc_flush_task() actually from entries from multiple ↵Pavel Emelyanov
namespaces This means that proc_flush_task_mnt() is to be called for many proc mounts and with different ids, depending on the namespace this pid is to be flushed from. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: make proc have multiple superblocks - one for each namespacePavel Emelyanov
Each pid namespace have to be visible through its own proc mount. Thus we need to have per-namespace proc trees with their own superblocks. We cannot easily show different pid namespace via one global proc tree, since each pid refers to different tasks in different namespaces. E.g. pid 1 refers to the init task in the initial namespace and to some other task when seeing from another namespace. Moreover - pid, exisintg in one namespace may not exist in the other. This approach has one move advantage is that the tasks from the init namespace can see what tasks live in another namespace by reading entries from another proc tree. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: helpers to find the task by its numerical idsPavel Emelyanov
When searching the task by numerical id on may need to find it using global pid (as it is done now in kernel) or by its virtual id, e.g. when sending a signal to a task from one namespace the sender will specify the task's virtual id and we should find the task by this value. [akpm@linux-foundation.org: fix gfs2 linkage] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: prepare proc_flust_task() to flush entries from multiple ↵Pavel Emelyanov
proc trees The first part is trivial - we just make the proc_flush_task() to operate on arbitrary vfsmount with arbitrary ids and pass the pid and global proc_mnt to it. The other change is more tricky: I moved the proc_flush_task() call in release_task() higher to address the following problem. When flushing task from many proc trees we need to know the set of ids (not just one pid) to find the dentries' names to flush. Thus we need to pass the task's pid to proc_flush_task() as struct pid is the only object that can provide all the pid numbers. But after __exit_signal() task has detached all his pids and this information is lost. This creates a tiny gap for proc_pid_lookup() to bring some dentries back to tree and keep them in hash (since pids are still alive before __exit_signal()) till the next shrink, but since proc_flush_task() does not provide a 100% guarantee that the dentries will be flushed, this is OK to do so. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Make access to task's nsproxy lighterPavel Emelyanov
When someone wants to deal with some other taks's namespaces it has to lock the task and then to get the desired namespace if the one exists. This is slow on read-only paths and may be impossible in some cases. E.g. Oleg recently noticed a race between unshare() and the (sent for review in cgroups) pid namespaces - when the task notifies the parent it has to know the parent's namespace, but taking the task_lock() is impossible there - the code is under write locked tasklist lock. On the other hand switching the namespace on task (daemonize) and releasing the namespace (after the last task exit) is rather rare operation and we can sacrifice its speed to solve the issues above. The access to other task namespaces is proposed to be performed like this: rcu_read_lock(); nsproxy = task_nsproxy(tsk); if (nsproxy != NULL) { / * * work with the namespaces here * e.g. get the reference on one of them * / } / * * NULL task_nsproxy() means that this task is * almost dead (zombie) * / rcu_read_unlock(); This patch has passed the review by Eric and Oleg :) and, of course, tested. [clg@fr.ibm.com: fix unshare()] [ebiederm@xmission.com: Update get_net_ns_by_pid] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: define and use task_active_pid_ns() wrapperSukadev Bhattiprolu
With multiple pid namespaces, a process is known by some pid_t in every ancestor pid namespace. Every time the process forks, the child process also gets a pid_t in every ancestor pid namespace. While a process is visible in >=1 pid namespaces, it can see pid_t's in only one pid namespace. We call this pid namespace it's "active pid namespace", and it is always the youngest pid namespace in which the process is known. This patch defines and uses a wrapper to find the active pid namespace of a process. The implementation of the wrapper will be changed in when support for multiple pid namespaces are added. Changelog: 2.6.22-rc4-mm2-pidns1: - [Pavel Emelianov, Alexey Dobriyan] Back out the change to use task_active_pid_ns() in child_reaper() since task->nsproxy can be NULL during task exit (so child_reaper() continues to use init_pid_ns). to implement child_reaper() since init_pid_ns.child_reaper to implement child_reaper() since tsk->nsproxy can be NULL during exit. 2.6.21-rc6-mm1: - Rename task_pid_ns() to task_active_pid_ns() to reflect that a process can have multiple pid namespaces. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: round up the APIPavel Emelianov
The set of functions process_session, task_session, process_group and task_pgrp is confusing, as the names can be mixed with each other when looking at the code for a long time. The proposals are to * equip the functions that return the integer with _nr suffix to represent that fact, * and to make all functions work with task (not process) by making the common prefix of the same name. For monotony the routines signal_session() and set_signal_session() are replaced with task_session_nr() and set_task_session(), especially since they are only used with the explicit task->signal dereference. Signed-off-by: Pavel Emelianov <xemul@openvz.org> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Task Control Groups: make cpusets a client of cgroupsPaul Menage
Remove the filesystem support logic from the cpusets system and makes cpusets a cgroup subsystem The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get passed through to the cgroup filesystem with the appropriate options to emulate the old cpuset filesystem behaviour. Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Task Control Groups: add procfs interfacePaul Menage
Add: /proc/cgroups - general system info /proc/*/cgroup - per-task cgroup membership info [a.p.zijlstra@chello.nl: cgroups: bdi init hooks] Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19sched: fix guest time accounting going faster than user time accountingChristian Borntraeger
cputime_add already adds, dont do it twice. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-17Don't truncate /proc/PID/environ at 4096 charactersJames Pearson
/proc/PID/environ currently truncates at 4096 characters, patch based on the /proc/PID/mem code. Signed-off-by: James Pearson <james-p@moving-picture.com> Cc: Anton Arapov <aarapov@redhat.com> Cc: Jan Engelhardt <jengelh@computergmbh.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>