aboutsummaryrefslogtreecommitdiff
path: root/fs/proc
AgeCommit message (Collapse)Author
2008-07-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (1232 commits) iucv: Fix bad merging. net_sched: Add size table for qdiscs net_sched: Add accessor function for packet length for qdiscs net_sched: Add qdisc_enqueue wrapper highmem: Export totalhigh_pages. ipv6 mcast: Omit redundant address family checks in ip6_mc_source(). net: Use standard structures for generic socket address structures. ipv6 netns: Make several "global" sysctl variables namespace aware. netns: Use net_eq() to compare net-namespaces for optimization. ipv6: remove unused macros from net/ipv6.h ipv6: remove unused parameter from ip6_ra_control tcp: fix kernel panic with listening_get_next tcp: Remove redundant checks when setting eff_sacks tcp: options clean up tcp: Fix MD5 signatures for non-linear skbs sctp: Update sctp global memory limit allocations. sctp: remove unnecessary byteshifting, calculate directly in big-endian sctp: Allow only 1 listening socket with SO_REUSEADDR sctp: Do not leak memory on multiple listen() calls sctp: Support ipv6only AF_INET6 sockets. ...
2008-07-20tty: Ldisc revampAlan Cox
Move the line disciplines towards a conventional ->ops arrangement. For the moment the actual 'tty_ldisc' struct in the tty is kept as part of the tty struct but this can then be changed if it turns out that when it all settles down we want to refcount ldiscs separately to the tty. Pull the ldisc code out of /proc and put it with our ldisc code. Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-18proc: consolidate per-net single-release callersPavel Emelyanov
They are symmetrical to single_open ones :) Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18proc: consolidate per-net single_open callersPavel Emelyanov
There are already 7 of them - time to kill some duplicate code. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14Merge branch 'x86/for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (821 commits) x86: make 64bit hpet_set_mapping to use ioremap too, v2 x86: get x86_phys_bits early x86: max_low_pfn_mapped fix #4 x86: change _node_to_cpumask_ptr to return const ptr x86: I/O APIC: remove an IRQ2-mask hack x86: fix numaq_tsc_disable calling x86, e820: remove end_user_pfn x86: max_low_pfn_mapped fix, #3 x86: max_low_pfn_mapped fix, #2 x86: max_low_pfn_mapped fix, #1 x86_64: fix delayed signals x86: remove conflicting nx6325 and nx6125 quirks x86: Recover timer_ack lost in the merge of the NMI watchdog x86: I/O APIC: Never configure IRQ2 x86: L-APIC: Always fully configure IRQ0 x86: L-APIC: Set IRQ0 as edge-triggered x86: merge dwarf2 headers x86: use AS_CFI instead of UNWIND_INFO x86: use ignore macro instead of hash comment x86: use matching CFI_ENDPROC ...
2008-07-14Security: split proc ptrace checking into read vs. attachStephen Smalley
Enable security modules to distinguish reading of process state via proc from full ptrace access by renaming ptrace_may_attach to ptrace_may_access and adding a mode argument indicating whether only read access or full attach access is requested. This allows security modules to permit access to reading process state without granting full ptrace access. The base DAC/capability checking remains unchanged. Read access to /proc/pid/mem continues to apply a full ptrace attach check since check_mem_permission() already requires the current task to already be ptracing the target. The other ptrace checks within proc for elements like environ, maps, and fds are changed to pass the read mode instead of attach. In the SELinux case, we model such reading of process state as a reading of a proc file labeled with the target process' label. This enables SELinux policy to permit such reading of process state without permitting control or manipulation of the target process, as there are a number of cases where programs probe for such information via proc but do not need to be able to control the target (e.g. procps, lsof, PolicyKit, ConsoleKit). At present we have to choose between allowing full ptrace in policy (more permissive than required/desired) or breaking functionality (or in some cases just silencing the denials via dontaudit rules but this can hide genuine attacks). This version of the patch incorporates comments from Casey Schaufler (change/replace existing ptrace_may_attach interface, pass access mode), and Chris Wright (provide greater consistency in the checking). Note that like their predecessors __ptrace_may_attach and ptrace_may_attach, the __ptrace_may_access and ptrace_may_access interfaces use different return value conventions from each other (0 or -errno vs. 1 or 0). I retained this difference to avoid any changes to the caller logic but made the difference clearer by changing the latter interface to return a bool rather than an int and by adding a comment about it to ptrace.h for any future callers. Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: James Morris <jmorris@namei.org>
2008-07-08Merge branches 'x86/numa-fixes', 'x86/apic', 'x86/apm', 'x86/bitops', ↵Ingo Molnar
'x86/build', 'x86/cleanups', 'x86/cpa', 'x86/cpu', 'x86/defconfig', 'x86/gart', 'x86/i8259', 'x86/intel', 'x86/irqstats', 'x86/kconfig', 'x86/ldt', 'x86/mce', 'x86/memtest', 'x86/pat', 'x86/ptemask', 'x86/resumetrace', 'x86/threadinfo', 'x86/timers', 'x86/vdso' and 'x86/xen' into x86/devel
2008-07-08x86, generic: CPA add statistics about state of direct mapping v4Andi Kleen
Add information about the mapping state of the direct mapping to /proc/meminfo. I chose /proc/meminfo because that is where all the other memory statistics are too and it is a generally useful metric even outside debugging situations. A lot of split kernel pages means the kernel will run slower. This way we can see how many large pages are really used for it and how many are split. Useful for general insight into the kernel. v2: Add hotplug locking to 64bit to plug a very obscure theoretical race. 32bit doesn't need it because it doesn't support hotadd for lowmem. Fix some typos v3: Rename dpages_cnt Add CONFIG ifdef for count update as requested by tglx Expand description v4: Fix stupid bugs added in v3 Move update_page_count to pageattr.c Signed-off-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-05Fix pagemap_read() use of struct mm_walkAndrew Morton
Fix some issues in pagemap_read noted by Alexey: - initialize pagemap_walk.mm to "mm" , so the code starts working as advertised - initialize ->private to "&pm" so it wouldn't immediately oops in pagemap_pte_hole() - unstatic struct pagemap_walk, so two threads won't fsckup each other (including those started by root, including flipping ->mm when you don't have permissions) - pagemap_read() contains two calls to ptrace_may_attach(), second one looks unneeded. - avoid possible kmalloc(0) and integer wraparound. Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Personally, I'd just remove the functionality entirely - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-05Fix clear_refs_write() use of struct mm_walkAndrew Morton
Don't use a static entry, so as to prevent races during concurrent use of this function. Reported-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-16Merge branch 'linus' into x86/irqstatsIngo Molnar
2008-06-12pagemap: fix large pages in pagemapDave Hansen
We were walking right into huge page areas in the pagemap walker, and calling the pmds pmd_bad() and clearing them. That leaked huge pages. Bad. This patch at least works around that for now. It ignores huge pages in the pagemap walker for the time being, and won't leak those pages. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-12pagemap: pass mm into pagewalkersDave Hansen
We need this at least for huge page detection for now, because powerpc needs the vm_area_struct to be able to determine whether a virtual address is referring to a huge page (its pmd_huge() doesn't work). It might also come in handy for some of the other users. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/chrisw/lsm-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/chrisw/lsm-2.6: capabilities: remain source compatible with 32-bit raw legacy capability support. LSM: remove stale web site from MAINTAINERS
2008-06-06pagemap: return EINVAL, not EIO, for unaligned reads of kpagecount or kpageflagsThomas Tuttle
If the user tries to read from a position that is not a multiple of 8, or read a number of bytes that is not a multiple of 8, they have passed an invalid argument to read, for the purpose of reading these files. It's not an IO error because we didn't encounter any trouble finding the data they asked for. Signed-off-by: Thomas Tuttle <ttuttle@google.com> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06pagemap: return map count, not reference count, in /proc/kpagecountThomas Tuttle
Since pagemap is all about examining pages mapped into processes' memory spaces, it makes sense for kpagecount to return the map counts, not the reference counts. Signed-off-by: Thomas Tuttle <ttuttle@google.com> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06proc: calculate the correct /proc/<pid> link countVegard Nossum
This patch: commit e9720acd728a46cb40daa52c99a979f7c4ff195c Author: Pavel Emelyanov <xemul@openvz.org> Date: Fri Mar 7 11:08:40 2008 -0800 [NET]: Make /proc/net a symlink on /proc/self/net (v3) introduced a /proc/self/net directory without bumping the corresponding link count for /proc/self. This patch replaces the static link count initializations with a call that counts the number of directory entries in the given pid_entry table whenever it is instantiated, and thus relieves the burden of manually keeping the two in sync. [akpm@linux-foundation.org: cleanup] Acked-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06pagemap: fix bug in add_to_pagemap, require aligned-length reads of ↵Thomas Tuttle
/proc/pid/pagemap Fix a bug in add_to_pagemap. Previously, since pm->out was a char *, put_user was only copying 1 byte of every PFN, resulting in the top 7 bytes of each PFN not being copied. By requiring that reads be a multiple of 8 bytes, I can make pm->out and pm->end u64*s instead of char*s, which makes put_user work properly, and also simplifies the logic in add_to_pagemap a bit. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Thomas Tuttle <ttuttle@google.com> Cc: Matt Mackall <mpm@selenic.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-31capabilities: remain source compatible with 32-bit raw legacy capability ↵Andrew G. Morgan
support. Source code out there hard-codes a notion of what the _LINUX_CAPABILITY_VERSION #define means in terms of the semantics of the raw capability system calls capget() and capset(). Its unfortunate, but true. Since the confusing header file has been in a released kernel, there is software that is erroneously using 64-bit capabilities with the semantics of 32-bit compatibilities. These recently compiled programs may suffer corruption of their memory when sys_getcap() overwrites more memory than they are coded to expect, and the raising of added capabilities when using sys_capset(). As such, this patch does a number of things to clean up the situation for all. It 1. forces the _LINUX_CAPABILITY_VERSION define to always retain its legacy value. 2. adopts a new #define strategy for the kernel's internal implementation of the preferred magic. 3. deprecates v2 capability magic in favor of a new (v3) magic number. The functionality of v3 is entirely equivalent to v2, the only difference being that the v2 magic causes the kernel to log a "deprecated" warning so the admin can find applications that may be using v2 inappropriately. [User space code continues to be encouraged to use the libcap API which protects the application from details like this. libcap-2.10 is the first to support v3 capabilities.] Fixes issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=447518. Thanks to Bojan Smojver for the report. [akpm@linux-foundation.org: s/depreciate/deprecate/g] [akpm@linux-foundation.org: be robust about put_user size] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Bojan Smojver <bojan@rexursive.com> Cc: stable@kernel.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
2008-05-25x86: make /proc/stat account for all interruptsJan Beulich
LAPIC interrupts, which don't go through the generic interrupt handling code, aren't accounted for in /proc/stat. Hence this patch adds a mechanism architectures can use to accordingly adjust the statistics. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-24proc: proc_get_inode() should get module only onceDenis V. Lunev
Any file under /proc/net opened more than once leaked the refcounter on the module it belongs to. The problem is that module_get is called for each file opening while module_put is called only when /proc inode is destroyed. So, lets put module counter if we are dealing with already initialised inode. Addresses http://bugzilla.kernel.org/show_bug.cgi?id=10737 Signed-off-by: Denis V. Lunev <den@openvz.org> Cc: David Miller <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Acked-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Robert Olsson <robert.olsson@its.uu.se> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Reported-by: Roland Kletzing <devzero@web.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24mm: fix atomic_t overflow in vmAlan Cox
The atomic_t type is 32bit but a 64bit system can have more than 2^32 pages of virtual address space available. Without this we overflow on ludicrously large mappings Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-17[PATCH] open sessionid permissionsSteve Grubb
The current permissions on sessionid are a little too restrictive. Signed-off-by: Steve Grubb <sgrubb@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-13capabilities: add bounding set to /proc/self/statusSerge E. Hallyn
There is currently no way to query the bounding set of another task. As there appears to be no security reason not to, and as Michael Kerrisk points out the following valid reasons to do so exist: * consistency (I can see all of the other per-thread/process sets in /proc/.../status) * debugging -- I could imagine that it would make the job of debugging an application that uses capabilities a little simpler. this patch adds the bounding set to /proc/self/status right after the effective set. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-08fs/proc/task_mmu.c: remove duplicated include filesHuang Weiyi
Removed duplicated include files <linux/ptrace.h> and <linux/seq_file.h> in fs/proc/task_mmu.c. Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-04task_nommu: fix compile failing bug because of spilt file.hBryan Wu
CC fs/proc/task_nommu.o fs/proc/task_nommu.c: In function ‘task_mem’: fs/proc/task_nommu.c:55: error: dereferencing pointer to incomplete type make[2]: *** [fs/proc/task_nommu.o] Error 1 make[1]: *** [fs/proc] Error 2 make: *** [fs] Error 2 Signed-off-by: Bryan Wu <cooloney@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (47 commits) rose: Wrong list_lock argument in rose_node seqops netns: Fix reassembly timer to use the right namespace netns: Fix device renaming for sysfs bnx2: Update version to 1.7.5. bnx2: Update RV2P firmware for 5709. bnx2: Zero out context memory for 5709. bnx2: Fix register test on 5709. bnx2: Fix remote PHY initial link state. bnx2: Refine remote PHY locking. bridge: forwarding table information for >256 devices tg3: Update version to 3.92 tg3: Add link state reporting to UMP firmware tg3: Fix ethtool loopback test for 5761 BX devices tg3: Fix 5761 NVRAM sizes tg3: Use constant 500KHz MI clock on adapters with a CPMU hci_usb.h: fix hard-to-trigger race dccp: ccid2.c, ccid3.c use clamp(), clamp_t() net: remove NR_CPUS arrays in net/core/dev.c net: use get/put_unaligned_* helpers bluetooth: use get/put_unaligned_* helpers ...
2008-05-02netns: assign PDE->data before gluing entry into /proc treeDenis V. Lunev
In this unfortunate case, proc_mkdir_mode wrapper can't be used anymore and this is no way to reuse proc_create_data due to nlinks assignment. So, copy the code from proc_mkdir and assign PDE->data at the appropriate moment. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-01[PATCH] split linux/file.hAl Viro
Initial splitoff of the low-level stuff; taken to fdtable.h Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-30mm: Add NR_WRITEBACK_TEMP counterMiklos Szeredi
Fuse will use temporary buffers to write back dirty data from memory mappings (normal writes are done synchronously). This is needed, because there cannot be any guarantee about the time in which a write will complete. By using temporary buffers, from the MM's point if view the page is written back immediately. If the writeout was due to memory pressure, this effectively migrates data from a full zone to a less full zone. This patch adds a new counter (NR_WRITEBACK_TEMP) for the number of pages used as temporary buffers. [Lee.Schermerhorn@hp.com: add vmstat_text for NR_WRITEBACK_TEMP] Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30tty: The big operations reworkAlan Cox
- Operations are now a shared const function block as with most other Linux objects - Introduce wrappers for some optional functions to get consistent behaviour - Wrap put_char which used to be patched by the tty layer - Document which functions are needed/optional - Make put_char report success/fail - Cache the driver->ops pointer in the tty as tty->ops - Remove various surplus lock calls we no longer need - Remove proc_write method as noted by Alexey Dobriyan - Introduce some missing sanity checks where certain driver/ldisc combinations would oops as they didn't check needed methods were present [akpm@linux-foundation.org: fix fs/compat_ioctl.c build] [akpm@linux-foundation.org: fix isicom] [akpm@linux-foundation.org: fix arch/ia64/hp/sim/simserial.c build] [akpm@linux-foundation.org: fix kgdb] Signed-off-by: Alan Cox <alan@redhat.com> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Cc: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30tty_io: fix remaining pid struct lockingAlan Cox
This fixes the last couple of pid struct locking failures I know about. [oleg@tv-sign.ru: clean up do_task_stat()] Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30do_task_stat: don't take rcu_read_lock()Oleg Nesterov
lock_task_sighand() was changed, and do_task_stat() doesn't need rcu_read_lock any longer. sighand->siglock protects all "interesting" fields. Except: it doesn't protect ->tty->pgrp, but neither does rcu_read_lock(), this should be fixed. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Pavel Emelyanov <xemul@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29sysctl: add the ->permissions callback on the ctl_table_rootPavel Emelyanov
When reading from/writing to some table, a root, which this table came from, may affect this table's permissions, depending on who is working with the table. The core hunk is at the bottom of this patch. All the rest is just pushing the ctl_table_root argument up to the sysctl_perm() function. This will be mostly (only?) used in the net sysctls. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@sw.ru> Cc: Denis V. Lunev <den@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29sysctl: merge equal proc_sys_read and proc_sys_writePavel Emelyanov
Many (most of) sysctls do not have a per-container sense. E.g. kernel.print_fatal_signals, vm.panic_on_oom, net.core.netdev_budget and so on and so forth. Besides, tuning then from inside a container is not even secure. On the other hand, hiding them completely from the container's tasks sometimes causes user-space to stop working. When developing net sysctl, the common practice was to duplicate a table and drop the write bits in table->mode, but this approach was not very elegant, lead to excessive memory consumption and was not suitable in general. Here's the alternative solution. To facilitate the per-container sysctls ctl_table_root-s were introduced. Each root contains a list of ctl_table_header-s that are visible to different namespaces. The idea of this set is to add the permissions() callback on the ctl_table_root to allow ctl root limit permissions to the same ctl_table-s. The main user of this functionality is the net-namespaces code, but later this will (should) be used by more and more namespaces, containers and control groups. Actually, this idea's core is in a single hunk in the third patch. First two patches are cleanups for sysctl code, while the third one mostly extends the arguments set of some sysctl functions. This patch: These ->read and ->write callbacks act in a very similar way, so merge these paths to reduce the number of places to patch later and shrink the .text size (a bit). Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: "David S. Miller" <davem@davemloft.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@sw.ru> Cc: Denis V. Lunev <den@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29proc: introduce proc_create_data to setup de->dataDenis V. Lunev
This set of patches fixes an proc ->open'less usage due to ->proc_fops flip in the most part of the kernel code. The original OOPS is described in the commit 2d3a4e3666325a9709cc8ea2e88151394e8f20fc: Typical PDE creation code looks like: pde = create_proc_entry("foo", 0, NULL); if (pde) pde->proc_fops = &foo_proc_fops; Notice that PDE is first created, only then ->proc_fops is set up to final value. This is a problem because right after creation a) PDE is fully visible in /proc , and b) ->proc_fops are proc_file_operations which do not have ->open callback. So, it's possible to ->read without ->open (see one class of oopses below). The fix is new API called proc_create() which makes sure ->proc_fops are set up before gluing PDE to main tree. Typical new code looks like: pde = proc_create("foo", 0, NULL, &foo_proc_fops); if (!pde) return -ENOMEM; Fix most networking users for a start. In the long run, create_proc_entry() for regular files will go. In addition to this, proc_create_data is introduced to fix reading from proc without PDE->data. The race is basically the same as above. create_proc_entries is replaced in the entire kernel code as new method is also simply better. This patch: The problem is the same as for de->proc_fops. Right now PDE becomes visible without data set. So, the entry could be looked up without data. This, in most cases, will simply OOPS. proc_create_data call is created to address this issue. proc_create now becomes a wrapper around it. Signed-off-by: Denis V. Lunev <den@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Chris Mason <chris.mason@oracle.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Dmitry Torokhov <dtor@mail.ru> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Grant Grundler <grundler@parisc-linux.org> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jaroslav Kysela <perex@suse.cz> Cc: Jeff Garzik <jgarzik@pobox.com> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Karsten Keil <kkeil@suse.de> Cc: Kyle McMartin <kyle@parisc-linux.org> Cc: Len Brown <lenb@kernel.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Cc: Mikael Starvik <starvik@axis.com> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Cc: Neil Brown <neilb@suse.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Osterlund <petero2@telia.com> Cc: Pierre Peiffer <peifferp@gmail.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Takashi Iwai <tiwai@suse.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29proc: convert /proc/tty/ldiscs to seq_file interfaceAlexey Dobriyan
Note: THIS_MODULE and header addition aren't technically needed because this code is not modular, but let's keep it anyway because people can copy this code into modular code. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29proc: remove ->get_info infrastructureAlexey Dobriyan
Now that last dozen or so users of ->get_info were removed, ditch it too. Everyone sane shouldd have switched to seq_file interface long ago. P.S.: Co-existing 3 interfaces (->get_info/->read_proc/->proc_fops) for proc is long-standing crap, BTW, thus a) put ->read_proc/->write_proc/read_proc_entry() users on death row, b) new such users should be rejected, c) everyone is encouraged to convert his favourite ->read_proc user or I'll do it, lazy bastards. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29proc: remove proc_root from driversAlexey Dobriyan
Remove proc_root export. Creation and removal works well if parent PDE is supplied as NULL -- it worked always that way. So, one useless export removed and consistency added, some drivers created PDEs with &proc_root as parent but removed them as NULL and so on. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29proc: remove proc_root_driverAlexey Dobriyan
Use creation by full path: "driver/foo". Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29proc: remove proc_root_fsAlexey Dobriyan
Use creation by full path instead: "fs/foo". Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29proc: remove proc_busAlexey Dobriyan
Remove proc_bus export and variable itself. Using pathnames works fine and is slightly more understandable and greppable. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29proc: drop several "PDE valid/invalid" checksAlexey Dobriyan
proc-misc code is noticeably full of "if (de)" checks when PDE passed is always valid. Remove them. Addition of such check in proc_lookup_de() is for failed lookup case. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29proc: less special case in xlate codeAlexey Dobriyan
If valid "parent" is passed to proc_create/remove_proc_entry(), then name of PDE should consist of only one path component, otherwise creation or or removal will fail. However, if NULL is passed as parent then create/remove accept full path as a argument. This is arbitrary restriction -- all infrastructure is in place. So, patch allows the following to succeed: create_proc_entry("foo/bar", 0, pde_baz); remove_proc_entry("baz/foo/bar", &proc_root); Also makes the following to behave identically: create_proc_entry("foo/bar", 0, NULL); create_proc_entry("foo/bar", 0, &proc_root); Discrepancy noticed by Den Lunev (IIRC). Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29proc: simplify locking in remove_proc_entry()Alexey Dobriyan
proc_subdir_lock protects only modifying and walking through PDE lists, so after we've found PDE to remove and actually removed it from lists, there is no need to hold proc_subdir_lock for the rest of operation. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29procfs: mem permission cleanupRoland McGrath
This cleans up the permission checks done for /proc/PID/mem i/o calls. It puts all the logic in a new function, check_mem_permission(). The old code repeated the (!MAY_PTRACE(task) || !ptrace_may_attach(task)) magical expression multiple times. The new function does all that work in one place, with clear comments. The old code called security_ptrace() twice on successful checks, once in MAY_PTRACE() and once in __ptrace_may_attach(). Now it's only called once, and only if all other checks have succeeded. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29proc: switch to proc_create()Alexey Dobriyan
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29procfs task exe symlinkMatt Helsley
The kernel implements readlink of /proc/pid/exe by getting the file from the first executable VMA. Then the path to the file is reconstructed and reported as the result. Because of the VMA walk the code is slightly different on nommu systems. This patch avoids separate /proc/pid/exe code on nommu systems. Instead of walking the VMAs to find the first executable file-backed VMA we store a reference to the exec'd file in the mm_struct. That reference would prevent the filesystem holding the executable file from being unmounted even after unmapping the VMAs. So we track the number of VM_EXECUTABLE VMAs and drop the new reference when the last one is unmapped. This avoids pinning the mounted filesystem. [akpm@linux-foundation.org: improve comments] [yamamoto@valinux.co.jp: fix dup_mmap] Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: David Howells <dhowells@redhat.com> Cc:"Eric W. Biederman" <ebiederm@xmission.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29proc: print more information when removing non-empty directoriesAlexey Dobriyan
This usually saves one recompile to insert similar printk like below. :) Sample nastygram: remove_proc_entry: removing non-empty directory '/proc/foo', leaking at least 'bar' ------------[ cut here ]------------ WARNING: at fs/proc/generic.c:776 remove_proc_entry+0x18a/0x200() Modules linked in: foo(-) container fan battery dock sbs ac sbshc backlight ipv6 loop af_packet amd_rng sr_mod i2c_amd8111 i2c_amd756 cdrom i2c_core button thermal processor Pid: 3034, comm: rmmod Tainted: G M 2.6.25-rc1 #5 Call Trace: [<ffffffff80231974>] warn_on_slowpath+0x64/0x90 [<ffffffff80232a6e>] printk+0x4e/0x60 [<ffffffff802d6c8a>] remove_proc_entry+0x18a/0x200 [<ffffffff8045cd88>] mutex_lock_nested+0x1c8/0x2d0 [<ffffffff8025f0f0>] __try_stop_module+0x0/0x40 [<ffffffff8025effd>] sys_delete_module+0x14d/0x200 [<ffffffff8045df3d>] lockdep_sys_exit_thunk+0x35/0x67 [<ffffffff8031c307>] __up_read+0x27/0xa0 [<ffffffff8045decc>] trace_hardirqs_on_thunk+0x35/0x3a [<ffffffff8020b6ab>] system_call_after_swapgs+0x7b/0x80 ---[ end trace 10ef850597e89c54 ]--- Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Cc: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28smaps: account swap entriesPeter Zijlstra
Show the amount of swap for each vma. This can be used to see where all the swap goes. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Matt Mackall <mpm@selenic.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>