aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2009-06-25futex: request only one page from get_user_pages()Thomas Gleixner
Yanmin noticed that fault_in_user_writeable() requests 4 pages instead of one. That's the result of blindly trusting Linus' proposal :) I even looked up the prototype to verify the correctness: the argument in question is confusingly enough named "len" while in reality it means number of pages. Pointed-out-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-06-24Merge branches 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/{vfs-2.6,audit-current} * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: another race fix in jfs_check_acl() Get "no acls for this inode" right, fix shmem breakage inline functions left without protection of ifdef (acl) * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: audit: inode watches depend on CONFIG_AUDIT not CONFIG_AUDIT_SYSCALL
2009-06-24audit: inode watches depend on CONFIG_AUDIT not CONFIG_AUDIT_SYSCALLEric Paris
Even though one cannot make use of the audit watch code without CONFIG_AUDIT_SYSCALL the spaghetti nature of the audit code means that the audit rule filtering requires that it at least be compiled. Thus build the audit_watch code when we build auditfilter like it was before cfcad62c74abfef83762dc05a556d21bdf3980a2 Clearly this is a point of potential future cleanup.. Reported-by: Frans Pop <elendil@planet.nl> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24futex: Fix the write access fault problem for realThomas Gleixner
commit 64d1304a64 (futex: setup writeable mapping for futex ops which modify user space data) did address only half of the problem of write access faults. The patch was made on two wrong assumptions: 1) access_ok(VERIFY_WRITE,...) would actually check write access. On x86 it does _NOT_. It's a pure address range check. 2) a RW mapped region can not go away under us. That's wrong as well. Nobody can prevent another thread to call mprotect(PROT_READ) on that region where the futex resides. If that call hits between the get_user_pages_fast() verification and the actual write access in the atomic region we are toast again. The solution is to not rely on access_ok and get_user() for any write access related fault on private and shared futexes. Instead we need to fault it in with verification of write access. There is no generic non destructive write mechanism which would fault the user page in trough a #PF, but as we already know that we will fault we can as well call get_user_pages() directly and avoid the #PF overhead. If get_user_pages() returns -EFAULT we know that we can not fix it anymore and need to bail out to user space. Remove a bunch of confusing comments on this issue as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org
2009-06-24Fix rule eviction order for AUDIT_DIRAl Viro
If syscall removes the root of subtree being watched, we definitely do not want the rules refering that subtree to be destroyed without the syscall in question having a chance to match them. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24Audit: clean up all op= output to include string quotingEric Paris
A number of places in the audit system we send an op= followed by a string that includes spaces. Somehow this works but it's just wrong. This patch moves all of those that I could find to be quoted. Example: Change From: type=CONFIG_CHANGE msg=audit(1244666690.117:31): auid=0 ses=1 subj=unconfined_u:unconfined_r:auditctl_t:s0-s0:c0.c1023 op=remove rule key="number2" list=4 res=0 Change To: type=CONFIG_CHANGE msg=audit(1244666690.117:31): auid=0 ses=1 subj=unconfined_u:unconfined_r:auditctl_t:s0-s0:c0.c1023 op="remove rule" key="number2" list=4 res=0 Signed-off-by: Eric Paris <eparis@redhat.com>
2009-06-23Audit: move audit_get_nd completely into audit_watchEric Paris
audit_get_nd() is only used by audit_watch and could be more cleanly implemented by having the audit watch functions call it when needed rather than making the generic audit rule parsing code deal with those objects. Signed-off-by: Eric Paris <eparis@redhat.com>
2009-06-23audit: seperate audit inode watches into a subfileEric Paris
In preparation for converting audit to use fsnotify instead of inotify we seperate the inode watching code into it's own file. This is similar to how the audit tree watching code is already seperated into audit_tree.c Signed-off-by: Eric Paris <eparis@redhat.com>
2009-06-23Audit: clean up audit_receive_skbEric Paris
audit_receive_skb is hard to clearly parse what it is doing to the netlink message. Clean the function up so it is easy and clear to see what is going on. Signed-off-by: Eric Paris <eparis@redhat.com>
2009-06-23Audit: cleanup netlink mesg handlingEric Paris
The audit handling of netlink messages is all over the place. Clean things up, use predetermined macros, generally make it more readable. Signed-off-by: Eric Paris <eparis@redhat.com>
2009-06-23Audit: unify the printk of an skb when auditd not aroundEric Paris
Remove code duplication of skb printk when auditd is not around in userspace to deal with this message. Signed-off-by: Eric Paris <eparis@redhat.com>
2009-06-23Audit: dereferencing krule as if it were an audit_watchEric Paris
audit_update_watch() runs all of the rules for a given watch and duplicates them, attaches a new watch to them, and then when it finishes that process and has called free on all of the old rules (ok maybe still inside the rcu grace period) it proceeds to use the last element from list_for_each_entry_safe() as if it were a krule rather than being the audit_watch which was anchoring the list to output a message about audit rules changing. This patch unfies the audit message from two different places into a helper function and calls it from the correct location in audit_update_rules(). We will now get an audit message about the config changing for each rule (with each rules filterkey) rather than the previous garbage. Signed-off-by: Eric Paris <eparis@redhat.com>
2009-06-23Audit: better estimation of execve record lengthEric Paris
The audit execve record splitting code estimates the length of the message generated. But it forgot to include the "" that wrap each string in its estimation. This means that execve messages with lots of tiny (1-2 byte) arguments could still cause records greater than 8k to be emitted. Simply fix the estimate. Signed-off-by: Eric Paris <eparis@redhat.com>
2009-06-23Audit: fix audit watch use after freeEric Paris
When an audit watch is added to a parent the temporary watch inside the original krule from userspace is freed. Yet the original watch is used after the real watch was created in audit_add_rules() Signed-off-by: Eric Paris <eparis@redhat.com>
2009-06-22mm/init: cpu_hotplug_init() must be initialized before SLABLinus Torvalds
SLAB uses get/put_online_cpus() which use a mutex which is itself only initialized when cpu_hotplug_init() is called. Currently we hang suring boot in SLAB due to doing that too late. Reported by James Bottomley and Sachin Sant (and possibly others). Debugged by Benjamin Herrenschmidt. This just removes the dynamic initialization of the data structures, and replaces it with a static one, avoiding this dependency entirely, and removing one unnecessary special initcall. Tested-by: Sachin Sant <sachinp@in.ibm.com> Tested-by: James Bottomley <James.Bottomley@HansenPartnership.com> Tested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-20Merge branch 'irq-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq, irq.h: Fix kernel-doc warnings genirq: fix comment to say IRQ_WAKE_THREAD
2009-06-20Merge branch 'perfcounters-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (49 commits) perfcounter: Handle some IO return values perf_counter: Push perf_sample_data through the swcounter code perf_counter tools: Define and use our own u64, s64 etc. definitions perf_counter: Close race in perf_lock_task_context() perf_counter, x86: Improve interactions with fast-gup perf_counter: Simplify and fix task migration counting perf_counter tools: Add a data file header perf_counter: Update userspace callchain sampling uses perf_counter: Make callchain samples extensible perf report: Filter to parent set by default perf_counter tools: Handle lost events perf_counter: Add event overlow handling fs: Provide empty .set_page_dirty() aop for anon inodes perf_counter: tools: Makefile tweaks for 64-bit powerpc perf_counter: powerpc: Add processor back-end for MPC7450 family perf_counter: powerpc: Make powerpc perf_counter code safe for 32-bit kernels perf_counter: powerpc: Change how processor-specific back-ends get selected perf_counter: powerpc: Use unsigned long for register and constraint values perf_counter: powerpc: Enable use of software counters on 32-bit powerpc perf_counter tools: Add and use isprint() ...
2009-06-20Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix out of scope variable access in sched_slice() sched: Hide runqueues from direct refer at source code level sched: Remove unneeded __ref tag sched, x86: Fix cpufreq + sched_clock() TSC scaling
2009-06-20Merge branch 'tracing-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (24 commits) tracing/urgent: warn in case of ftrace_start_up inbalance tracing/urgent: fix unbalanced ftrace_start_up function-graph: add stack frame test function-graph: disable when both x86_32 and optimize for size are configured ring-buffer: have benchmark test print to trace buffer ring-buffer: do not grab locks in nmi ring-buffer: add locks around rb_per_cpu_empty ring-buffer: check for less than two in size allocation ring-buffer: remove useless compile check for buffer_page size ring-buffer: remove useless warn on check ring-buffer: use BUF_PAGE_HDR_SIZE in calculating index tracing: update sample event documentation tracing/filters: fix race between filter setting and module unload tracing/filters: free filter_string in destroy_preds() ring-buffer: use commit counters for commit pointer accounting ring-buffer: remove unused variable ring-buffer: have benchmark test handle discarded events ring-buffer: prevent adding write in discarded area tracing/filters: strloc should be unsigned short tracing/filters: operand can be negative ... Fix up kmemcheck-induced conflict in kernel/trace/ring_buffer.c manually
2009-06-20Merge branch 'timers-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: NOHZ: Properly feed cpufreq ondemand governor
2009-06-20Merge branch 'tip/tracing/urgent-1' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/urgent
2009-06-20Merge branch 'tip/tracing/urgent' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/urgent
2009-06-20perf_counter: Push perf_sample_data through the swcounter codePeter Zijlstra
Push the perf_sample_data further outwards to the swcounter interface, to abstract it away some more. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-20tracing/urgent: warn in case of ftrace_start_up inbalanceFrederic Weisbecker
Prevent from further ftrace_start_up inbalances so that we avoid future nop patching omissions with dynamic ftrace. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org>
2009-06-20tracing/urgent: fix unbalanced ftrace_start_upFrederic Weisbecker
Perfcounter reports the following stats for a wide system profiling: # # (2364 samples) # # Overhead Symbol # ........ ...... # 15.40% [k] mwait_idle_with_hints 8.29% [k] read_hpet 5.75% [k] ftrace_caller 3.60% [k] ftrace_call [...] This snapshot has been taken while neither the function tracer nor the function graph tracer was running. With dynamic ftrace, such results show a wrong ftrace behaviour because all calls to ftrace_caller or ftrace_graph_caller (the patched calls to mcount) are supposed to be patched into nop if none of those tracers are running. The problem occurs after the first run of the function tracer. Once we launch it a second time, the callsites will never be nopped back, unless you set custom filters. For example it happens during the self tests at boot time. The function tracer selftest runs, and then the dynamic tracing is tested too. After that, the callsites are left un-nopped. This is because the reset callback of the function tracer tries to unregister two ftrace callbacks in once: the common function tracer and the function tracer with stack backtrace, regardless of which one is currently in use. It then creates an unbalance on ftrace_start_up value which is expected to be zero when the last ftrace callback is unregistered. When it reaches zero, the FTRACE_DISABLE_CALLS is set on the next ftrace command, triggering the patching into nop. But since it becomes unbalanced, ie becomes lower than zero, if the kernel functions are patched again (as in every further function tracer runs), they won't ever be nopped back. Note that ftrace_call and ftrace_graph_call are still patched back to ftrace_stub in the off case, but not the callers of ftrace_call and ftrace_graph_caller. It means that the tracing is well deactivated but we waste a useless call into every kernel function. This patch just unregisters the right ftrace_ops for the function tracer on its reset callback and ignores the other one which is not registered, fixing the unbalance. The problem also happens is .30 Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: stable@kernel.org
2009-06-19ptrace: wait_task_zombie: do not account traced sub-threadsOleg Nesterov
The bug is ancient. If we trace the sub-thread of our natural child and this sub-thread exits, we update parent->signal->cxxx fields. But we should not do this until the whole thread-group exits, otherwise we account this thread (and all other live threads) twice. Add the task_detached() check. No need to check thread_group_empty(), wait_consider_task()->delay_group_leader() already did this. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Roland McGrath <roland@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Vitaly Mayatskikh <vmayatsk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-19perf_counter: Close race in perf_lock_task_context()Peter Zijlstra
perf_lock_task_context() is buggy because it can return a dead context. the RCU read lock in perf_lock_task_context() only guarantees the memory won't get freed, it doesn't guarantee the object is valid (in our case refcount > 0). Therefore we can return a locked object that can get freed the moment we release the rcu read lock. perf_pin_task_context() then increases the refcount and does an unlock on freed memory. That increased refcount will cause a double free, in case it started out with 0. Ammend this by including the get_ctx() functionality in perf_lock_task_context() (all users already did this later anyway), and return a NULL context when the found one is already dead. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-19perf_counter: Simplify and fix task migration countingPeter Zijlstra
The task migrations counter was causing rare and hard to decypher memory corruptions under load. After a day of debugging and bisection we found that the problem was introduced with: 3f731ca: perf_counter: Fix cpu migration counter Turning them off fixes the crashes. Incidentally, the whole perf_counter_task_migration() logic can be done simpler as well, by injecting a proper sw-counter event. This cleanup also fixed the crashes. The precise failure mode is not completely clear yet, but we are clearly not unhappy about having a fix ;-) Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-18function-graph: add stack frame testSteven Rostedt
In case gcc does something funny with the stack frames, or the return from function code, we would like to detect that. An arch may implement passing of a variable that is unique to the function and can be saved on entering a function and can be tested when exiting the function. Usually the frame pointer can be used for this purpose. This patch also implements this for x86. Where it passes in the stack frame of the parent function, and will test that frame on exit. There was a case in x86_32 with optimize for size (-Os) where, for a few functions, gcc would align the stack frame and place a copy of the return address into it. The function graph tracer modified the copy and not the actual return address. On return from the funtion, it did not go to the tracer hook, but returned to the parent. This broke the function graph tracer, because the return of the parent (where gcc did not do this funky manipulation) returned to the location that the child function was suppose to. This caused strange kernel crashes. This test detected the problem and pointed out where the issue was. This modifies the parameters of one of the functions that the arch specific code calls, so it includes changes to arch code to accommodate the new prototype. Note, I notice that the parsic arch implements its own push_return_trace. This is now a generic function and the ftrace_push_return_trace should be used instead. This patch does not touch that code. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Helge Deller <deller@gmx.de> Cc: Kyle McMartin <kyle@mcmartin.ca> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-18function-graph: disable when both x86_32 and optimize for size are configuredSteven Rostedt
On x86_32, when optimize for size is set, gcc may align the frame pointer and make a copy of the the return address inside the stack frame. The return address that is located in the stack frame may not be the one used to return to the calling function. This will break the function graph tracer. The function graph tracer replaces the return address with a jump to a hook function that can trace the exit of the function. If it only replaces a copy, then the hook will not be called when the function returns. Worse yet, when the parent function returns, the function graph tracer will return back to the location of the child function which will easily crash the kernel with weird results. To see the problem, when i386 is compiled with -Os we get: c106be03: 57 push %edi c106be04: 8d 7c 24 08 lea 0x8(%esp),%edi c106be08: 83 e4 e0 and $0xffffffe0,%esp c106be0b: ff 77 fc pushl 0xfffffffc(%edi) c106be0e: 55 push %ebp c106be0f: 89 e5 mov %esp,%ebp c106be11: 57 push %edi c106be12: 56 push %esi c106be13: 53 push %ebx c106be14: 81 ec 8c 00 00 00 sub $0x8c,%esp c106be1a: e8 f5 57 fb ff call c1021614 <mcount> When it is compiled with -O2 instead we get: c10896f0: 55 push %ebp c10896f1: 89 e5 mov %esp,%ebp c10896f3: 83 ec 28 sub $0x28,%esp c10896f6: 89 5d f4 mov %ebx,0xfffffff4(%ebp) c10896f9: 89 75 f8 mov %esi,0xfffffff8(%ebp) c10896fc: 89 7d fc mov %edi,0xfffffffc(%ebp) c10896ff: e8 d0 08 fa ff call c1029fd4 <mcount> The compile with -Os will align the stack pointer then set up the frame pointer (%ebp), and it copies the return address back into the stack frame. The change to the return address in mcount is done to the copy and not the real place holder of the return address. Then compile with -O2 sets up the frame pointer first, this makes the change to the return address by mcount affect where the function will jump on exit. Reported-by: Jake Edge <jake@lwn.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-18gcov: enable GCOV_PROFILE_ALL for x86_64Peter Oberparleiter
Enable gcov profiling of the entire kernel on x86_64. Required changes include disabling profiling for: * arch/kernel/acpi/realmode and arch/kernel/boot/compressed: not linked to main kernel * arch/vdso, arch/kernel/vsyscall_64 and arch/kernel/hpet: profiling causes segfaults during boot (incompatible context) Signed-off-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Li Wei <W.Li@Sun.COM> Cc: Michael Ellerman <michaele@au1.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Heiko Carstens <heicars2@linux.vnet.ibm.com> Cc: Martin Schwidefsky <mschwid2@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18gcov: add gcov profiling infrastructurePeter Oberparleiter
Enable the use of GCC's coverage testing tool gcov [1] with the Linux kernel. gcov may be useful for: * debugging (has this code been reached at all?) * test improvement (how do I change my test to cover these lines?) * minimizing kernel configurations (do I need this option if the associated code is never run?) The profiling patch incorporates the following changes: * change kbuild to include profiling flags * provide functions needed by profiling code * present profiling data as files in debugfs Note that on some architectures, enabling gcc's profiling option "-fprofile-arcs" for the entire kernel may trigger compile/link/ run-time problems, some of which are caused by toolchain bugs and others which require adjustment of architecture code. For this reason profiling the entire kernel is initially restricted to those architectures for which it is known to work without changes. This restriction can be lifted once an architecture has been tested and found compatible with gcc's profiling. Profiling of single files or directories is still available on all platforms (see config help text). [1] http://gcc.gnu.org/onlinedocs/gcc/Gcov.html Signed-off-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Li Wei <W.Li@Sun.COM> Cc: Michael Ellerman <michaele@au1.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Heiko Carstens <heicars2@linux.vnet.ibm.com> Cc: Martin Schwidefsky <mschwid2@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18kernel: constructor supportPeter Oberparleiter
Call constructors (gcc-generated initcall-like functions) during kernel start and module load. Constructors are e.g. used for gcov data initialization. Disable constructor support for usermode Linux to prevent conflicts with host glibc. Signed-off-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Li Wei <W.Li@Sun.COM> Cc: Michael Ellerman <michaele@au1.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Heiko Carstens <heicars2@linux.vnet.ibm.com> Cc: Martin Schwidefsky <mschwid2@linux.vnet.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18nsproxy: extract create_nsproxy()Alexey Dobriyan
clone_nsproxy() does useless copying of old nsproxy -- every pointer will be rewritten to new ns or to old ns. Remove copying, rename clone_nsproxy(), create_nsproxy() will be used by C/R code to create fresh nsproxy on restart. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18utsns: extract creeate_uts_ns()Alexey Dobriyan
create_uts_ns() will be used by C/R to create fresh uts_ns. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18pidns: rewrite copy_pid_ns()Alexey Dobriyan
copy_pid_ns() is a perfect example of a case where unwinding leads to more code and makes it less clear. Watch the diffstat. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Reviewed-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18pidns: make create_pid_namespace() accept parent pidnsAlexey Dobriyan
create_pid_namespace() creates everything, but caller has to assign parent pidns by hand, which is unnatural. At the moment of call new ->level has to be taken from somewhere and parent pidns is already available. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18pids: clean up find_task_by_pid variantsChristoph Hellwig
find_task_by_pid_type_ns is only used to implement find_task_by_vpid and find_task_by_pid_ns, but both of them pass PIDTYPE_PID as first argument. So just fold find_task_by_pid_type_ns into find_task_by_pid_ns and use find_task_by_pid_ns to implement find_task_by_vpid. While we're at it also remove the exports for find_task_by_pid_ns and find_task_by_vpid - we don't have any modular callers left as the only modular caller of he old pre pid namespace find_task_by_pid (gfs2) was switched to pid_task which operates on a struct pid pointer instead of a pid_t. Given the confusion about pid_t values vs namespace that's generally the better option anyway and I think we're better of restricting modules to do it that way. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18sysctl.c: remove unused variableSukanto Ghosh
Remoce the unused variable 'val' from __do_proc_dointvec() The integer has been declared and used as 'val = -val' and there is no reference to it anywhere. Signed-off-by: Sukanto Ghosh <sukanto.cse.iitb@gmail.com> Cc: Jaswinder Singh Rajput <jaswinder@kernel.org> Cc: Sukanto Ghosh <sukanto.cse.iitb@gmail.com> Cc: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18kthreads: simplify migration_thread() exit pathOleg Nesterov
Now that kthread_stop() can be used even if the task has already exited, we can kill the "wait_to_die:" loop in migration_thread(). But we must pin rq->migration_thread after creation. Actually, I don't think CPU_UP_CANCELED or CPU_DEAD should wait for ->migration_thread exit. Perhaps we can simplify this code a bit more. migration_call() can set ->should_stop and forget about this thread. But we need a new helper in kthred.c for that. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Vitaliy Gusev <vgusev@openvz.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18kthreads: rework kthread_stop()Oleg Nesterov
Based on Eric's patch which in turn was based on my patch. kthread_stop() has the nasty problems: - it runs unpredictably long with the global semaphore held. - it deadlocks if kthread itself does kthread_stop() before it obeys the kthread_should_stop() request. - it is not useable if kthread exits on its own, see for example the ugly "wait_to_die:" hack in migration_thread() - it is not possible to just tell kthread it should stop, we must always wait for its exit. With this patch kthread() allocates all neccesary data (struct kthread) on its own stack, globals kthread_stop_xxx are deleted. ->vfork_done is used as a pointer into "struct kthread", this means kthread_stop() can easily wait for kthread's exit. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Vitaliy Gusev <vgusev@openvz.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18kthreads: simplify the startup synchronizationOleg Nesterov
We use two completions two create the kernel thread, this is a bit ugly. kthread() wakes up create_kthread() via ->started, then create_kthread() wakes up the caller kthread_create() via ->done. But kthread() does not need to wait for kthread(), it can just return. Instead kthread() itself can wake up the caller of kthread_create(). Kill kthread_create_info->started, ->done is enough. This improves the scalability a bit and sijmplifies the code. The only problem if kernel_thread() fails, in that case create_kthread() must do complete(&create->done). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Vitaliy Gusev <vgusev@openvz.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18mm: exit.c reorder wait_opts to remove padding on 64 bit buildsRichard Kennedy
Reorder struct wait_opts to remove 8 bytes of alignment padding on 64 bit builds. Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18do_wait: fix the theoretical race with stop/trace/contOleg Nesterov
do_wait: current->state = TASK_INTERRUPTIBLE; read_lock(&tasklist_lock); ... search for the task to reap ... In theory, the ->state changing can leak into the critical section. Since the child can change its status under read_lock(tasklist) in parallel (finish_stop/ptrace_stop), we can miss the wakeup if __wake_up_parent() sees us in TASK_RUNNING state. Add the barrier. Also, use __set_current_state() to set TASK_RUNNING. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18do_wait: kill the old BUG_ON, use while_each_thread()Oleg Nesterov
do_wait() does BUG_ON(tsk->signal != current->signal), this looks like a raher obsolete check. At least, I don't think do_wait() is the best place to verify that all threads have the same ->signal. Remove it. Also, change the code to use while_each_thread(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18do_wait: simplify retval/tsk_result/notask_error messOleg Nesterov
Now that we don't pass &retval down to other helpers we can simplify the code more. - kill tsk_result, just use retval - add the "notask" label right after the main loop, and s/got end/goto notask/ after the fastpath pid check. This way we don't need to initialize retval before this check and the code becomes a bit more clean, if this pid has no attached tasks we should just skip the list search. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18introduce "struct wait_opts" to simplify do_wait() patchesOleg Nesterov
Introduce "struct wait_opts" which holds the parameters for misc helpers in do_wait() pathes. This adds 13 lines to kernel/exit.c, but saves 256 bytes from .o and imho makes the code much more readable. This patch temporary uglifies rusage/siginfo code a little bit, will be addressed by further cleanups. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Ingo Molnar <mingo@elte.hu> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18shift "ptrace implies WUNTRACED" from ptrace_do_wait() to wait_task_stopped()Oleg Nesterov
No functional changes, preparation for the next patch. ptrace_do_wait() adds WUNTRACED to options for wait_task_stopped() which should always accept the stopped tracee, even if do_wait() was called without WUNTRACED. Change wait_task_stopped() to check "ptrace || WUNTRACED" instead. This makes the code more explicit, and "int options" argument becomes const in do_wait() pathes. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18copy_process(): remove the unneeded clear_tsk_thread_flag(TIF_SIGPENDING)Oleg Nesterov
The forked child can have TIF_SIGPENDING if it was copied from parent's ti->flags. But this is harmless and actually almost never happens, because copy_process() can't succeed if signal_pending() == T. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18wait_task_zombie: do not use thread_group_cputime()Oleg Nesterov
There is no reason for thread_group_cputime() in wait_task_zombie(), there must be no other threads. This call was previously needed to collect the per-cpu data which we do not have any longer. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Roland McGrath <roland@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Vitaly Mayatskikh <vmayatsk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>