aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2009-11-05Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix kthread_bind() by moving the body of kthread_bind() to sched.c sched: Disable SD_PREFER_LOCAL at node level sched: Fix boot crash by zalloc()ing most of the cpu masks sched: Strengthen buddies and mitigate buddy induced latencies
2009-11-05Merge branch 'tracing-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: ftrace: Fix unmatched locking in ftrace_regex_write() ring-buffer: Synchronize resizing buffer with reader lock
2009-11-04ftrace: Fix unmatched locking in ftrace_regex_write()Li Zefan
When a command is passed to the set_ftrace_filter, then the ftrace_regex_lock is still held going back to user space. # echo 'do_open : foo' > set_ftrace_filter (still holding ftrace_regex_lock when returning to user space!) Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4AEF7F8A.3080300@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-11-04ring-buffer: Synchronize resizing buffer with reader lockLai Jiangshan
We got a sudden panic when we reduced the size of the ringbuffer. We can reproduce the panic by the following steps: echo 1 > events/sched/enable cat trace_pipe > /dev/null & while ((1)) do echo 12000 > buffer_size_kb echo 512 > buffer_size_kb done (not more than 5 seconds, panic ...) Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4AF01735.9060409@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-11-03Merge branch 'pm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM: Remove some debug messages producing too much noise PM: Fix warning on suspend errors PM / Hibernate: Add newline to load_image() fail path PM / Hibernate: Fix error handling in save_image() PM / Hibernate: Fix blkdev refleaks PM / yenta: Split resume into early and late parts (rev. 4)
2009-11-03Correct nr_processes() when CPUs have been unpluggedIan Campbell
nr_processes() returns the sum of the per cpu counter process_counts for all online CPUs. This counter is incremented for the current CPU on fork() and decremented for the current CPU on exit(). Since a process does not necessarily fork and exit on the same CPU the process_count for an individual CPU can be either positive or negative and effectively has no meaning in isolation. Therefore calculating the sum of process_counts over only the online CPUs omits the processes which were started or stopped on any CPU which has since been unplugged. Only the sum of process_counts across all possible CPUs has meaning. The only caller of nr_processes() is proc_root_getattr() which calculates the number of links to /proc as stat->nlink = proc_root.nlink + nr_processes(); You don't have to be all that unlucky for the nr_processes() to return a negative value leading to a negative number of links (or rather, an apparently enormous number of links). If this happens then you can get failures where things like "ls /proc" start to fail because they got an -EOVERFLOW from some stat() call. Example with some debugging inserted to show what goes on: # ps haux|wc -l nr_processes: CPU0: 90 nr_processes: CPU1: 1030 nr_processes: CPU2: -900 nr_processes: CPU3: -136 nr_processes: TOTAL: 84 proc_root_getattr. nlink 12 + nr_processes() 84 = 96 84 # echo 0 >/sys/devices/system/cpu/cpu1/online # ps haux|wc -l nr_processes: CPU0: 85 nr_processes: CPU2: -901 nr_processes: CPU3: -137 nr_processes: TOTAL: -953 proc_root_getattr. nlink 12 + nr_processes() -953 = -941 75 # stat /proc/ nr_processes: CPU0: 84 nr_processes: CPU2: -901 nr_processes: CPU3: -137 nr_processes: TOTAL: -954 proc_root_getattr. nlink 12 + nr_processes() -954 = -942 File: `/proc/' Size: 0 Blocks: 0 IO Block: 1024 directory Device: 3h/3d Inode: 1 Links: 4294966354 Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2009-11-03 09:06:55.000000000 +0000 Modify: 2009-11-03 09:06:55.000000000 +0000 Change: 2009-11-03 09:06:55.000000000 +0000 I'm not 100% convinced that the per_cpu regions remain valid for offline CPUs, although my testing suggests that they do. If not then I think the correct solution would be to aggregate the process_count for a given CPU into a global base value in cpu_down(). This bug appears to pre-date the transition to git and it looks like it may even have been present in linux-2.6.0-test7-bk3 since it looks like the code Rusty patched in http://lwn.net/Articles/64773/ was already wrong. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-11-03PM / Hibernate: Add newline to load_image() fail pathJiri Slaby
Finish a line by \n when load_image fails in the middle of loading. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2009-11-03PM / Hibernate: Fix error handling in save_image()Jiri Slaby
There are too many retval variables in save_image(). Thus error return value from snapshot_read_next() may be ignored and only part of the snapshot (successfully) written. Remove 'error' variable, invert the condition in the do-while loop and convert the loop to use only 'ret' variable. Switch the rest of the function to consider only 'ret'. Also make sure we end printed line by \n if an error occurs. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2009-11-03PM / Hibernate: Fix blkdev refleaksJiri Slaby
While cruising through the swsusp code I found few blkdev reference leaks of resume_bdev. swsusp_read: remove blkdev_put altogether. Some fail paths do not do that. swsusp_check: make sure we always put a reference on fail paths software_resume: all fail paths between swsusp_check and swsusp_read omit swsusp_close. Add it in those cases. And since swsusp_read doesn't drop the reference anymore, do it here unconditionally. [rjw: Fixed a small coding style issue.] Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2009-11-03sched: Fix kthread_bind() by moving the body of kthread_bind() to sched.cMike Galbraith
Eric Paris reported that commit f685ceacab07d3f6c236f04803e2f2f0dbcc5afb causes boot time PREEMPT_DEBUG complaints. [ 4.590699] BUG: using smp_processor_id() in preemptible [00000000] code: rmmod/1314 [ 4.593043] caller is task_hot+0x86/0xd0 Since kthread_bind() messes with scheduler internals, move the body to sched.c, and lock the runqueue. Reported-by: Eric Paris <eparis@redhat.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Tested-by: Eric Paris <eparis@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1256813310.7574.3.camel@marge.simson.net> [ v2: fix !SMP build and clean up ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-02Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futex: Fix spurious wakeup for requeue_pi really
2009-11-02Merge branch 'perf-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf tools: Remove -Wcast-align perf tools: Fix compatibility with libelf 0.8 and autodetect perf events: Don't generate events for the idle task when exclude_idle is set perf events: Fix swevent hrtimer sampling by keeping track of remaining time when enabling/disabling swevent hrtimers
2009-11-02Merge branch 'tracing-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Remove cpu arg from the rb_time_stamp() function tracing: Fix comment typo and documentation example tracing: Fix trace_seq_printf() return value tracing: Update *ppos instead of filp->f_pos
2009-11-02sched: Fix boot crash by zalloc()ing most of the cpu masksRusty Russell
I got a boot crash when forcing cpumasks offstack on 32 bit, because find_new_ilb() returned 3 on my UP system (nohz.cpu_mask wasn't zeroed). AFAICT the others need to be zeroed too: only nohz.ilb_grp_nohz_mask is initialized before use. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <200911022037.21282.rusty@rustcorp.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-10-29Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: sched: move rq_weight data array out of .percpu percpu: allow pcpu_alloc() to be called with IRQs off
2009-10-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-param-fixesLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-param-fixes: param: fix setting arrays of bool param: fix NULL comparison on oom param: fix lots of bugs with writing charp params from sysfs, by leaking mem.
2009-10-29Merge branch 'hwpoison-2.6.32' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 * 'hwpoison-2.6.32' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: HWPOISON: fix invalid page count in printk output HWPOISON: Allow schedule_on_each_cpu() from keventd HWPOISON: fix/proc/meminfo alignment HWPOISON: fix oops on ksm pages HWPOISON: Fix page count leak in hwpoison late kill in do_swap_page HWPOISON: return early on non-LRU pages HWPOISON: Add brief hwpoison description to Documentation HWPOISON: Clean up PR_MCE_KILL interface
2009-10-29Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futex: Move drop_futex_key_refs out of spinlock'ed region rcu: Fix TREE_PREEMPT_RCU CPU_HOTPLUG bad-luck hang rcu: Stopgap fix for synchronize_rcu_expedited() for TREE_PREEMPT_RCU rcu: Prevent RCU IPI storms in presence of high call_rcu() load futex: Check for NULL keys in match_futex futex: Handle spurious wake up
2009-10-29Merge branch 'perf-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf timechart: Improve the visual appearance of scheduler delays perf timechart: Fix the wakeup-arrows that point to non-visible processes perf top: Fix --delay_secs 0 division by zero perf tools: Bump version to 0.0.2 perf_event: Adjust frequency and unthrottle for non-group-leader events
2009-10-29Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Do less agressive buddy clearing sched: Disable SD_PREFER_LOCAL for MC/CPU domains
2009-10-29sysctl: fix false positives when PROC_SYSCTL=nAlexey Dobriyan
Having ->procname but not ->proc_handler is valid when PROC_SYSCTL=n, people use such combination to reduce ifdefs with non-standard handlers. Addresses http://bugzilla.kernel.org/show_bug.cgi?id=14408 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reported-by: Peter Teoh <htmldeveloper@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-29cgroup: fix strstrip() misuseKOSAKI Motohiro
cgroup_write_X64() and cgroup_write_string() ignore the return value of strstrip(). it makes small inconsistent behavior. example: ========================= # cd /mnt/cgroup/hoge # cat memory.swappiness 60 # echo "59 " > memory.swappiness # cat memory.swappiness 59 # echo " 58" > memory.swappiness bash: echo: write error: Invalid argument This patch fixes it. Cc: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-29connector: fix regression introduced by sid connectorChristian Borntraeger
Since commit 02b51df1b07b4e9ca823c89284e704cadb323cd1 (proc connector: add event for process becoming session leader) we have the following warning: Badness at kernel/softirq.c:143 [...] Krnl PSW : 0404c00180000000 00000000001481d4 (local_bh_enable+0xb0/0xe0) [...] Call Trace: ([<000000013fe04100>] 0x13fe04100) [<000000000048a946>] sk_filter+0x9a/0xd0 [<000000000049d938>] netlink_broadcast+0x2c0/0x53c [<00000000003ba9ae>] cn_netlink_send+0x272/0x2b0 [<00000000003baef0>] proc_sid_connector+0xc4/0xd4 [<0000000000142604>] __set_special_pids+0x58/0x90 [<0000000000159938>] sys_setsid+0xb4/0xd8 [<00000000001187fe>] sysc_noemu+0x10/0x16 [<00000041616cb266>] 0x41616cb266 The warning is ---> WARN_ON_ONCE(in_irq() || irqs_disabled()); The network code must not be called with disabled interrupts but sys_setsid holds the tasklist_lock with spinlock_irq while calling the connector. After a discussion we agreed that we can move proc_sid_connector from __set_special_pids to sys_setsid. We also agreed that it is sufficient to change the check from task_session(curr) != pid into err > 0, since if we don't change the session, this means we were already the leader and return -EPERM. One last thing: There is also daemonize(), and some people might want to get a notification in that case. Since daemonize() is only needed if a user space does kernel_thread this does not look important (and there seems to be no consensus if this connector should be called in daemonize). If we really want this, we can add proc_sid_connector to daemonize() in an additional patch (Scott?) Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Scott James Remnant <scott@ubuntu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: David S. Miller <davem@davemloft.net> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Evgeniy Polyakov <zbr@ioremap.net> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-29param: fix setting arrays of boolRusty Russell
We create a dummy struct kernel_param on the stack for parsing each array element, but we didn't initialize the flags word. This matters for arrays of type "bool", where the flag indicates if it really is an array of bools or unsigned int (old-style). Reported-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: stable@kernel.org
2009-10-29param: fix NULL comparison on oomRusty Russell
kp->arg is always true: it's the contents of that pointer we care about. Reported-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: stable@kernel.org
2009-10-29param: fix lots of bugs with writing charp params from sysfs, by leaking mem.Rusty Russell
e180a6b7759a "param: fix charp parameters set via sysfs" fixed the case where charp parameters written via sysfs were freed, leaving drivers accessing random memory. Unfortunately, storing a flag in the kparam struct was a bad idea: it's rodata so setting it causes an oops on some archs. But that's not all: 1) module_param_array() on charp doesn't work reliably, since we use an uninitialized temporary struct kernel_param. 2) there's a fundamental race if a module uses this parameter and then it's changed: they will still access the old, freed, memory. The simplest fix (ie. for 2.6.32) is to never free the memory. This prevents all these problems, at cost of a memory leak. In practice, there are only 18 places where a charp is writable via sysfs, and all are root-only writable. Reported-by: Takashi Iwai <tiwai@suse.de> Cc: Sitsofe Wheeler <sitsofe@yahoo.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christof Schmitt <christof.schmitt@de.ibm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: stable@kernel.org
2009-10-28futex: Fix spurious wakeup for requeue_pi reallyThomas Gleixner
The requeue_pi path doesn't use unqueue_me() (and the racy lock_ptr == NULL test) nor does it use the wake_list of futex_wake() which where the reason for commit 41890f2 (futex: Handle spurious wake up) See debugging discussing on LKML Message-ID: <4AD4080C.20703@us.ibm.com> The changes in this fix to the wait_requeue_pi path were considered to be a likely unecessary, but harmless safety net. But it turns out that due to the fact that for unknown $@#!*( reasons EWOULDBLOCK is defined as EAGAIN we built an endless loop in the code path which returns correctly EWOULDBLOCK. Spurious wakeups in wait_requeue_pi code path are unlikely so we do the easy solution and return EWOULDBLOCK^WEAGAIN to user space and let it deal with the spurious wakeup. Cc: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: John Stultz <johnstul@linux.vnet.ibm.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> LKML-Reference: <4AE23C74.1090502@us.ibm.com> Cc: stable@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-10-29sched: move rq_weight data array out of .percpuJiri Kosina
Commit 34d76c41 introduced percpu array update_shares_data, size of which being proportional to NR_CPUS. Unfortunately this blows up ia64 for large NR_CPUS configuration, as ia64 allows only 64k for .percpu section. Fix this by allocating this array dynamically and keep only pointer to it percpu. The per-cpu handling doesn't impose significant performance penalty on potentially contented path in tg_shares_up(). ... ffffffff8104337c: 65 48 8b 14 25 20 cd mov %gs:0xcd20,%rdx ffffffff81043383: 00 00 ffffffff81043385: 48 c7 c0 00 e1 00 00 mov $0xe100,%rax ffffffff8104338c: 48 c7 45 a0 00 00 00 movq $0x0,-0x60(%rbp) ffffffff81043393: 00 ffffffff81043394: 48 c7 45 a8 00 00 00 movq $0x0,-0x58(%rbp) ffffffff8104339b: 00 ffffffff8104339c: 48 01 d0 add %rdx,%rax ffffffff8104339f: 49 8d 94 24 08 01 00 lea 0x108(%r12),%rdx ffffffff810433a6: 00 ffffffff810433a7: b9 ff ff ff ff mov $0xffffffff,%ecx ffffffff810433ac: 48 89 45 b0 mov %rax,-0x50(%rbp) ffffffff810433b0: bb 00 04 00 00 mov $0x400,%ebx ffffffff810433b5: 48 89 55 c0 mov %rdx,-0x40(%rbp) ... After: ... ffffffff8104337c: 65 8b 04 25 28 cd 00 mov %gs:0xcd28,%eax ffffffff81043383: 00 ffffffff81043384: 48 98 cltq ffffffff81043386: 49 8d bc 24 08 01 00 lea 0x108(%r12),%rdi ffffffff8104338d: 00 ffffffff8104338e: 48 8b 15 d3 7f 76 00 mov 0x767fd3(%rip),%rdx # ffffffff817ab368 <update_shares_data> ffffffff81043395: 48 8b 34 c5 00 ee 6d mov -0x7e921200(,%rax,8),%rsi ffffffff8104339c: 81 ffffffff8104339d: 48 c7 45 a0 00 00 00 movq $0x0,-0x60(%rbp) ffffffff810433a4: 00 ffffffff810433a5: b9 ff ff ff ff mov $0xffffffff,%ecx ffffffff810433aa: 48 89 7d c0 mov %rdi,-0x40(%rbp) ffffffff810433ae: 48 c7 45 a8 00 00 00 movq $0x0,-0x58(%rbp) ffffffff810433b5: 00 ffffffff810433b6: bb 00 04 00 00 mov $0x400,%ebx ffffffff810433bb: 48 01 f2 add %rsi,%rdx ffffffff810433be: 48 89 55 b0 mov %rdx,-0x50(%rbp) ... Signed-off-by: Jiri Kosina <jkosina@suse.cz> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Tejun Heo <tj@kernel.org>
2009-10-24tracing: Remove cpu arg from the rb_time_stamp() functionJiri Olsa
The cpu argument is not used inside the rb_time_stamp() function. Plus fix a typo. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091023233647.118547500@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-10-24tracing: Fix comment typo and documentation exampleJiri Olsa
Trivial patch to fix a documentation example and to fix a comment. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091023233646.871719877@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-10-24tracing: Fix trace_seq_printf() return valueJiri Olsa
trace_seq_printf() return value is a little ambiguous. It currently returns the length of the space available in the buffer. printf usually returns the amount written. This is not adequate here, because: trace_seq_printf(s, ""); is perfectly legal, and returning 0 would indicate that it failed. We can always see the amount written by looking at the before and after values of s->len. This is not quite the same use as printf. We only care if the string was successfully written to the buffer or not. Make trace_seq_printf() return 0 if the trace oversizes the buffer's free space, 1 otherwise. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091023233646.631787612@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-10-24tracing: Update *ppos instead of filp->f_posJiri Olsa
Instead of directly updating filp->f_pos we should update the *ppos argument. The filp->f_pos gets updated within the file_pos_write() function called from sys_write(). Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091023233646.399670810@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-10-23sched: Strengthen buddies and mitigate buddy induced latenciesMike Galbraith
This patch restores the effectiveness of LAST_BUDDY in preventing pgsql+oltp from collapsing due to wakeup preemption. It also switches LAST_BUDDY to exclusively do what it does best, namely mitigate the effects of aggressive wakeup preemption, which improves vmark throughput markedly, and restores mysql+oltp scalability. Since buddies are about scalability, enable them beginning at the point where we begin expanding sched_latency, namely sched_nr_latency. Previously, buddies were cleared aggressively, which seriously reduced their effectiveness. Not clearing aggressively however, produces a small drop in mysql+oltp throughput immediately after peak, indicating that LAST_BUDDY is actually doing some harm. This is right at the point where X on the desktop in competition with another load wants low latency service. Ergo, do not enable until we need to scale. To mitigate latency induced by buddies, or by a task just missing wakeup preemption, check latency at tick time. Last hunk prevents buddies from stymieing BALANCE_NEWIDLE via CACHE_HOT_BUDDY. Supporting performance tests: tip = v2.6.32-rc5-1497-ga525b32 tipx = NO_GENTLE_FAIR_SLEEPERS NEXT_BUDDY granularity knobs = 31 knobs + 31 buddies tip+x = NO_GENTLE_FAIR_SLEEPERS granularity knobs = 31 knobs (Three run averages except where noted.) vmark: ------ tip 108466 messages per second tip+ 125307 messages per second tip+x 125335 messages per second tipx 117781 messages per second 2.6.31.3 122729 messages per second mysql+oltp: ----------- clients 1 2 4 8 16 32 64 128 256 .......................................................................................... tip 9949.89 18690.20 34801.24 34460.04 32682.88 30765.97 28305.27 25059.64 19548.08 tip+ 10013.90 18526.84 34900.38 34420.14 33069.83 32083.40 30578.30 28010.71 25605.47 tipx 9698.71 18002.70 34477.56 33420.01 32634.30 31657.27 29932.67 26827.52 21487.18 2.6.31.3 8243.11 18784.20 34404.83 33148.38 31900.32 31161.90 29663.81 25995.94 18058.86 pgsql+oltp: ----------- clients 1 2 4 8 16 32 64 128 256 .......................................................................................... tip 13686.37 26609.25 51934.28 51347.81 49479.51 45312.65 36691.91 26851.57 24145.35 tip+ (1x) 13907.85 27135.87 52951.98 52514.04 51742.52 50705.43 49947.97 48374.19 46227.94 tip+x 13906.78 27065.81 52951.19 52542.59 52176.11 51815.94 50838.90 49439.46 46891.00 tipx 13742.46 26769.81 52351.99 51891.73 51320.79 50938.98 50248.65 48908.70 46553.84 2.6.31.3 13815.35 26906.46 52683.34 52061.31 51937.10 51376.80 50474.28 49394.47 47003.25 Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-10-23perf events: Don't generate events for the idle task when exclude_idle is setSoeren Sandmann
Getting samples for the idle task is often not interesting, so don't generate them when exclude_idle is set for the event in question. Signed-off-by: Søren Sandmann Pedersen <sandmann@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <ye8pr8fmlq7.fsf@camel16.daimi.au.dk> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-10-23perf events: Fix swevent hrtimer sampling by keeping track of remaining time ↵Soeren Sandmann
when enabling/disabling swevent hrtimers Make the hrtimer based events work for sysprof. Whenever a swevent is scheduled out, the hrtimer is canceled. When it is scheduled back in, the timer is restarted. This happens every scheduler tick, which means the timer never expired because it was getting repeatedly restarted over and over with the same period. To fix that, save the remaining time when disabling; when reenabling, use that saved time as the period instead of the user-specified sampling period. Also, move the starting and stopping of the hrtimers to helper functions instead of duplicating the code. Signed-off-by: Søren Sandmann Pedersen <sandmann@redhat.com> LKML-Reference: <ye8vdi7mluz.fsf@camel16.daimi.au.dk> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-10-22PM: Make warning in suspend_test_finish() less likely to happenRafael J. Wysocki
Increase TEST_SUSPEND_SECONDS to 10 so the warning in suspend_test_finish() doesn't annoy the users of slower systems so much. Also, make the warning print the suspend-resume cycle time, so that we know why the warning actually triggered. Patch prepared during the hacking session at the Kernel Summit in Tokyo. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-19HWPOISON: Allow schedule_on_each_cpu() from keventdAndi Kleen
Right now when calling schedule_on_each_cpu() from keventd there is a deadlock because it tries to schedule a work item on the current CPU too. This happens via lru_add_drain_all() in hwpoison. Just call the function for the current CPU in this case. This is actually faster too. Debugging with Fengguang Wu & Max Asbock Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-10-16futex: Move drop_futex_key_refs out of spinlock'ed regionDarren Hart
When requeuing tasks from one futex to another, the reference held by the requeued task to the original futex location needs to be dropped eventually. Dropping the reference may ultimately lead to a call to "iput_final" and subsequently call into filesystem- specific code - which may be non-atomic. It is therefore safer to defer this drop operation until after the futex_hash_bucket spinlock has been dropped. Originally-From: Helge Bahmann <hcb@chaoticmind.net> Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: <stable@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@linux.vnet.ibm.com> Cc: Sven-Thorsten Dietrich <sdietrich@novell.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <4AD7A298.5040802@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-10-15Merge the right tty-fixes branchLinus Torvalds
* branch 'tty-fixes' tty: use the new 'flush_delayed_work()' helper to do ldisc flush workqueue: add 'flush_delayed_work()' to run and wait for delayed work tty: Make flush_to_ldisc() locking more robust
2009-10-15rcu: Fix TREE_PREEMPT_RCU CPU_HOTPLUG bad-luck hangPaul E. McKenney
If the following sequence of events occurs, then TREE_PREEMPT_RCU will hang waiting for a grace period to complete, eventually OOMing the system: o A TREE_PREEMPT_RCU build of the kernel is booted on a system with more than 64 physical CPUs present (32 on a 32-bit system). Alternatively, a TREE_PREEMPT_RCU build of the kernel is booted with RCU_FANOUT set to a sufficiently small value that the physical CPUs populate two or more leaf rcu_node structures. o A task is preempted in an RCU read-side critical section while running on a CPU corresponding to a given leaf rcu_node structure. o All CPUs corresponding to this same leaf rcu_node structure record quiescent states for the current grace period. o All of these same CPUs go offline (hence the need for enough physical CPUs to populate more than one leaf rcu_node structure). This causes the preempted task to be moved to the root rcu_node structure. At this point, there is nothing left to cause the quiescent state to be propagated up the rcu_node tree, so the current grace period never completes. The simplest fix, especially after considering the deadlock possibilities, is to detect this situation when the last CPU is offlined, and to set that CPU's ->qsmask bit in its leaf rcu_node structure. This will cause the next invocation of force_quiescent_state() to end the grace period. Without this fix, this hang can be triggered in an hour or so on some machines with rcutorture and random CPU onlining/offlining. With this fix, these same machines pass a full 10 hours of this sort of abuse. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <20091015162614.GA19131@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-10-15rcu: Stopgap fix for synchronize_rcu_expedited() for TREE_PREEMPT_RCUPaul E. McKenney
For the short term, map synchronize_rcu_expedited() to synchronize_rcu() for TREE_PREEMPT_RCU and to synchronize_sched_expedited() for TREE_RCU. Longer term, there needs to be a real expedited grace period for TREE_PREEMPT_RCU, but candidate patches to date are considerably more complex and intrusive. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: npiggin@suse.de Cc: jens.axboe@oracle.com LKML-Reference: <12555405592331-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-10-15rcu: Prevent RCU IPI storms in presence of high call_rcu() loadPaul E. McKenney
As the number of callbacks on a given CPU rises, invoke force_quiescent_state() only every blimit number of callbacks (defaults to 10,000), and even then only if no other CPU has invoked force_quiescent_state() in the meantime. This should fix the performance regression reported by Nick. Reported-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: jens.axboe@oracle.com LKML-Reference: <12555405592133-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-10-14Merge branch 'tty-fixes'Linus Torvalds
* branch 'tty-fixes': tty: use the new 'flush_delayed_work()' helper to do ldisc flush workqueue: add 'flush_delayed_work()' to run and wait for delayed work Make flush_to_ldisc properly handle parallel calls
2009-10-14Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: oprofile: warn on freeing event buffer too early oprofile: fix race condition in event_buffer free lockdep: Use cpu_clock() for lockstat
2009-10-14Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix missing kernel-doc notation Revert "x86, timers: Check for pending timers after (device) interrupts" sched: Update the clock of runqueue select_task_rq() selected
2009-10-14Merge branch 'tracing-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing/filters: Fix memory leak when setting a filter tracing: fix trace_vprintk call
2009-10-14workqueue: add 'flush_delayed_work()' to run and wait for delayed workLinus Torvalds
It basically turns a delayed work into an immediate work, and then waits for it to finish, thus allowing you to force (and wait for) an immediate flush of a delayed work. We'll want to use this in the tty layer to clean up tty_flush_to_ldisc(). Acked-by: Oleg Nesterov <oleg@redhat.com> [ Fixed to use 'del_timer_sync()' as noted by Oleg ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-14futex: Check for NULL keys in match_futexDarren Hart
If userspace tries to perform a requeue_pi on a non-requeue_pi waiter, it will find the futex_q->requeue_pi_key to be NULL and OOPS. Check for NULL in match_futex() instead of doing explicit NULL pointer checks on all call sites. While match_futex(NULL, NULL) returning false is a little odd, it's still correct as we expect valid key references. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: Dinakar Guniguntala <dino@in.ibm.com> CC: John Stultz <johnstul@us.ibm.com> Cc: stable@kernel.org LKML-Reference: <4AD60687.10306@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-10-14workqueue: add 'flush_delayed_work()' to run and wait for delayed workLinus Torvalds
It basically turns a delayed work into an immediate work, and then waits for it to finish.
2009-10-14sched: Do less agressive buddy clearingPeter Zijlstra
Yanmin reported a hackbench regression due to: > commit de69a80be32445b0a71e8e3b757e584d7beb90f7 > Author: Peter Zijlstra <a.p.zijlstra@chello.nl> > Date: Thu Sep 17 09:01:20 2009 +0200 > > sched: Stop buddies from hogging the system I really liked de69a80b, and it affecting hackbench shows I wasn't crazy ;-) So hackbench is a multi-cast, with one sender spraying multiple receivers, who in their turn don't spray back. This would be exactly the scenario that patch 'cures'. Previously we would not clear the last buddy after running the next task, allowing the sender to get back to work sooner than it otherwise ought to have been, increasing latencies for other tasks. Now, since those receivers don't poke back, they don't enforce the buddy relation, which means there's nothing to re-elect the sender. Cure this by less agressively clearing the buddy stats. Only clear buddies when they were not chosen. It should still avoid a buddy sticking around long after its served its time. Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Mike Galbraith <efault@gmx.de> LKML-Reference: <1255084986.8802.46.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>