aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2008-08-18lockdep: fix spurious 'inconsistent lock state' warningDmitry Baryshkov
Since f82b217e3513fe3af342c0f3ee1494e86250c21c lockdep can output spurious warnings related to hwirqs due to hardirq_off shrinkage from int to bit-sized flag. Guard it with double negation to fix the warning. Signed-off-by: Dmitry Baryshkov <dbaryshkov@gmail.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-16Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep: fix build if CONFIG_PROVE_LOCKING not defined lockdep: use WARN() in kernel/lockdep.c lockdep: spin_lock_nest_lock(), checkpatch fixes lockdep: build fix
2008-08-16Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: scale sysctl_sched_shares_ratelimit with nr_cpus sched: fix rt-bandwidth hotplug race sched: fix the race between walk_tg_tree and sched_create_group
2008-08-15Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: security: Fix setting of PF_SUPERPRIV by __capable()
2008-08-15lockdep: fix build if CONFIG_PROVE_LOCKING not definedStephen Hemminger
If CONFIG_PROVE_LOCKING not defined, then no dependency information is available. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-15sched: scale sysctl_sched_shares_ratelimit with nr_cpusPeter Zijlstra
David reported that his Niagra spend a little too much time in tg_shares_up(), which considering he has a large cpu count makes sense. So scale the ratelimit value with the number of cpus like we do for other controls as well. Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-15completions: uninline try_wait_for_completion and completion_doneDave Chinner
m68k fails to build with these functions inlined in completion.h. Move them out of line into sched.c and export them to avoid this problem. Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-15kexec: use a mutex for locking rather than xchg()Andrew Morton
Functionally the same, but more conventional. Cc: Huang Ying <ying.huang@intel.com> Tested-by: Vivek Goyal <vgoyal@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-15kexec jump: fix for ftraceHuang Ying
Ftrace depends on some processor state that we destroyed during kexec and restored by restore_processor_state(). So save_processor_state() and restore_processor_state() are moved into machine_kexec() and ftrace is restored after restore_processor_state(). Signed-off-by: Huang Ying <ying.huang@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-15kexec jump: in sync with hibernation implementationHuang Ying
Add device_pm_lock() and device_pm_unlock() in kernel_kexec() in sync with current hibernation implementation. Signed-off-by: Huang Ying <ying.huang@intel.com> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-15kexec jump: remove duplication of kexec_restart_prepare()Huang Ying
Call kernel_restart_prepare() in kernel_kexec() instead of duplicating the code. Signed-off-by: Huang Ying <ying.huang@intel.com> Acked-by: Pavel Machek <pavel@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-15kexec jump: rename KEXEC_CONTROL_CODE_SIZE to KEXEC_CONTROL_PAGE_SIZEHuang Ying
Rename KEXEC_CONTROL_CODE_SIZE to KEXEC_CONTROL_PAGE_SIZE, because control page is used for not only code on some platform. For example in kexec jump, it is used for data and stack too. [akpm@linux-foundation.org: unbreak powerpc and arm, finish conversion] Signed-off-by: Huang Ying <ying.huang@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-15kexec jump: clean up #ifdef and commentsHuang Ying
Move if (kexec_image->preserve_context) { ... } into #ifdef CONFIG_KEXEC_JUMP to make code looks cleaner. Fix no longer correct comments of kernel_kexec(). Signed-off-by: Huang Ying <ying.huang@intel.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-15kexec: fix compilation warning on xchg(&kexec_lock, 0) in kernel_kexec()Huang Ying
kernel/kexec.c: In function 'kernel_kexec': kernel/kexec.c:1506: warning: value computed is not used Signed-off-by: Huang Ying <ying.huang@intel.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-14sched: fix rt-bandwidth hotplug racePeter Zijlstra
When we hot-unplug a cpu and rebuild the sched-domain, all cpus will be detatched. Alex observed the case where a runqueue was stealing bandwidth from an already disabled runqueue to satisfy its own needs. Stop this by skipping over already disabled runqueues. Reported-by: Alex Nixon <alex.nixon@citrix.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Alex Nixon <alex.nixon@citrix.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-14security: Fix setting of PF_SUPERPRIV by __capable()David Howells
Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>
2008-08-14sched: fix the race between walk_tg_tree and sched_create_groupZhang, Yanmin
With 2.6.27-rc3, I hit a kernel panic when running volanoMark on my new x86_64 machine. I also hit it with other 2.6.27-rc kernels. See below log. Basically, function walk_tg_tree and sched_create_group have a race between accessing and initiating tg->children. Below patch fixes it by moving tg->children initiation to the front of linking tg->siblings to parent->children. {----------------panic log------------} BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 IP: [<ffffffff802292ab>] walk_tg_tree+0x45/0x7f PGD 1be1c4067 PUD 1bdd8d067 PMD 0 Oops: 0000 [1] SMP CPU 11 Modules linked in: igb Pid: 22979, comm: java Not tainted 2.6.27-rc3 #1 RIP: 0010:[<ffffffff802292ab>] [<ffffffff802292ab>] walk_tg_tree+0x45/0x7f RSP: 0018:ffff8801bfbbbd18 EFLAGS: 00010083 RAX: 0000000000000000 RBX: ffff8800be0dce40 RCX: ffffffffffffffc0 RDX: ffff880102c43740 RSI: 0000000000000000 RDI: ffff8800be0dce40 RBP: ffff8801bfbbbd48 R08: ffff8800ba437bc8 R09: 0000000000001f40 R10: ffff8801be812100 R11: ffffffff805fdf44 R12: ffff880102c43740 R13: 0000000000000000 R14: ffffffff8022cf0f R15: ffffffff8022749f FS: 00000000568ac950(0063) GS:ffff8801bfa26d00(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 00000001bd848000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process java (pid: 22979, threadinfo ffff8801b145a000, task ffff8801bf18e450) Stack: 0000000000000001 ffff8800ba5c8d60 0000000000000001 0000000000000001 ffff8800bad1ccb8 0000000000000000 ffff8801bfbbbd98 ffffffff8022ed37 0000000000000001 0000000000000286 ffff8801bd5ee180 ffff8800ba437bc8 Call Trace: <IRQ> [<ffffffff8022ed37>] try_to_wake_up+0x71/0x24c [<ffffffff80247177>] autoremove_wake_function+0x9/0x2e [<ffffffff80228039>] ? __wake_up_common+0x46/0x76 [<ffffffff802296d5>] __wake_up+0x38/0x4f [<ffffffff806169cc>] tcp_v4_rcv+0x380/0x62e Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-13lockdep: use WARN() in kernel/lockdep.cArjan van de Ven
Use WARN() instead of a printk+WARN_ON() pair; this way the message becomes part of the warning section for better reporting/collection. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2008-08-13lockdep: spin_lock_nest_lock(), checkpatch fixesAndrew Morton
fix: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable #46: FILE: kernel/spinlock.c:326: +EXPORT_SYMBOL(_spin_lock_nest_lock); total: 0 errors, 1 warnings, 26 lines checked Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-13Merge commit 'v2.6.27-rc3' into core/urgentIngo Molnar
2008-08-13lockdep: build fixIngo Molnar
fix: kernel/built-in.o: In function `lockdep_stats_show': lockdep_proc.c:(.text+0x3cb2f): undefined reference to `lockdep_count_forward_deps' kernel/built-in.o: In function `l_show': lockdep_proc.c:(.text+0x3d02b): undefined reference to `lockdep_count_forward_deps' lockdep_proc.c:(.text+0x3d047): undefined reference to `lockdep_count_backward_deps' Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12genirq: switch /proc/irq/*/smp_affinity et al to seqfilesAlexey Dobriyan
Switch /proc/irq/*/smp_affinity , /proc/irq/default_smp_affinity to seq_files. cat(1) reads with 1024 chunks by default, with high enough NR_CPUS, there will be -EINVAL. As side effect, there are now two less users of the ->read_proc interface. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Paul Jackson <pj@sgi.com> Cc: Mike Travis <travis@sgi.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-12cpu hotplug: s390 doesn't support additional_cpus anymore.Heiko Carstens
s390 doesn't support the additional_cpus kernel parameter anymore since a long time. So we better update the code and documentation to reflect that. Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-12Merge branch 'core-fixes-for-linus-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: generic-ipi: fix stack and rcu interaction bug in smp_call_function_mask(), fix
2008-08-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linusLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: fix spinlock recursion in hvc_console stop_machine: remove unused variable modules: extend initcall_debug functionality to the module loader export virtio_rng.h lguest: use get_user_pages_fast() instead of get_user_pages() mm: Make generic weak get_user_pages_fast and EXPORT_GPL it lguest: don't set MAC address for guest unless specified
2008-08-12generic-ipi: fix stack and rcu interaction bug in smp_call_function_mask(), fixNick Piggin
> > Nick Piggin (1): > > generic-ipi: fix stack and rcu interaction bug in > > smp_call_function_mask() > > I'm still not 100% sure that I have this patch right... I might have seen > a lockup trace implicating the smp call function path... which may have > been due to some other problem or a different bug in the new call function > code, but if some more people can take a look at it before merging? OK indeed it did have a couple of bugs. Firstly, I wasn't freeing the data properly in the alloc && wait case. Secondly, I wasn't resetting CSD_FLAG_WAIT in the for each cpu loop (so only the first CPU would wait). After those fixes, the patch boots and runs with the kmalloc commented out (so it always executes the slowpath). Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-12stop_machine: remove unused variableLi Zefan
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-08-12modules: extend initcall_debug functionality to the module loaderArjan van de Ven
The kernel has this really nice facility where if you put "initcall_debug" on the kernel commandline, it'll print which function it's going to execute just before calling an initcall, and then after the call completes it will 1) print if it had an error code 2) checks for a few simple bugs (like leaving irqs off) and 3) print how long the init call took in milliseconds. While trying to optimize the boot speed of my laptop, I have been loving number 3 to figure out what to optimize... ... and then I wished that the same thing was done for module loading. This patch makes the module loader use this exact same functionality; it's a logical extension in my view (since modules are just sort of late binding initcalls anyway) and so far I've found it quite useful in finding where things are too slow in my boot. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-08-11Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched, cpu hotplug: fix set_cpus_allowed() use in hotplug callbacks sched: fix mysql+oltp regression sched_clock: delay using sched_clock() sched clock: couple local and remote clocks sched clock: simplify __update_sched_clock() sched: eliminate scd->prev_raw sched clock: clean up sched_clock_cpu() sched clock: revert various sched_clock() changes sched: move sched_clock before first use sched: test runtime rather than period in global_rt_runtime() sched: fix SCHED_HRTICK dependency sched: fix warning in hrtick_start_fair()
2008-08-11Merge branch 'timers-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: posix-timers: fix posix_timer_event() vs dequeue_signal() race posix-timers: do_schedule_next_timer: fix the setting of ->si_overrun
2008-08-11Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep: fix debug_lock_alloc lockdep: increase MAX_LOCKDEP_KEYS generic-ipi: fix stack and rcu interaction bug in smp_call_function_mask() lockdep: fix overflow in the hlock shrinkage code lockdep: rename map_[acquire|release]() => lock_map_[acquire|release]() lockdep: handle chains involving classes defined in modules mm: fix mm_take_all_locks() locking order lockdep: annotate mm_take_all_locks() lockdep: spin_lock_nest_lock() lockdep: lock protection locks lockdep: map_acquire lockdep: shrink held_lock structure lockdep: re-annotate scheduler runqueues lockdep: lock_set_subclass - reset a held lock's subclass lockdep: change scheduler annotation debug_locks: set oops_in_progress if we will log messages. lockdep: fix combinatorial explosion in lock subgraph traversal
2008-08-12Merge branch 'core/locking' into core/urgentIngo Molnar
2008-08-12Merge branch 'sched/clock' into sched/urgentIngo Molnar
2008-08-11lockdep: fix debug_lock_allocPeter Zijlstra
When we enable DEBUG_LOCK_ALLOC but do not enable PROVE_LOCKING and or LOCK_STAT, lock_alloc() and lock_release() turn into nops, even though we should be doing hlock checking (check=1). This causes a false warning and a lockdep self-disable. Rectify this. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-11sched, cpu hotplug: fix set_cpus_allowed() use in hotplug callbacksDmitry Adamushko
Mark Langsdorf reported: > One of my co-workers noticed that the powernow-k8 > driver no longer restarts when a CPU core is > hot-disabled and then hot-enabled on AMD quad-core > systems. > > The following comands work fine on 2.6.26 and fail > on 2.6.27-rc1: > > echo 0 > /sys/devices/system/cpu/cpu3/online > echo 1 > /sys/devices/system/cpu/cpu3/online > find /sys -name cpufreq > > For 2.6.26, the find will return a cpufreq > directory for each processor. In 2.6.27-rc1, > the cpu3 directory is missing. > > After digging through the code, the following > logic is failing when the core is hot-enabled > at runtime. The code works during the boot > sequence. > > cpumask_t = current->cpus_allowed; > set_cpus_allowed_ptr(current, &cpumask_of_cpu(cpu)); > if (smp_processor_id() != cpu) > return -ENODEV; So set the CPU active before calling the CPU_ONLINE notifier chain, there are a handful of notifiers that use set_cpus_allowed(). This fix also solves the problem with x86-microcode. I've sent alternative patches for microcode, but as this "rely on set_cpus_allowed_ptr() being workable in cpu-hotplug(CPU_ONLINE, ...)" assumption seems to be more broad than what we thought, perhaps this fix should be applied. With this patch we define that by the moment CPU_ONLINE is being sent, a 'cpu' is online and ready for tasks to be migrated onto it. Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Reported-by: Mark Langsdorf <mark.langsdorf@amd.com> Tested-by: Mark Langsdorf <mark.langsdorf@amd.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-11generic-ipi: fix stack and rcu interaction bug in smp_call_function_mask()Nick Piggin
* Venki Pallipadi <venkatesh.pallipadi@intel.com> wrote: > Found a OOPS on a big SMP box during an overnight reboot test with > upstream git. > > Suresh and I looked at the oops and looks like the root cause is in > generic_smp_call_function_interrupt() and smp_call_function_mask() with > wait parameter. > > The actual oops looked like > > [ 11.277260] BUG: unable to handle kernel paging request at ffff8802ffffffff > [ 11.277815] IP: [<ffff8802ffffffff>] 0xffff8802ffffffff > [ 11.278155] PGD 202063 PUD 0 > [ 11.278576] Oops: 0010 [1] SMP > [ 11.279006] CPU 5 > [ 11.279336] Modules linked in: > [ 11.279752] Pid: 0, comm: swapper Not tainted 2.6.27-rc2-00020-g685d87f #290 > [ 11.280039] RIP: 0010:[<ffff8802ffffffff>] [<ffff8802ffffffff>] 0xffff8802ffffffff > [ 11.280692] RSP: 0018:ffff88027f1f7f70 EFLAGS: 00010086 > [ 11.280976] RAX: 00000000ffffffff RBX: 0000000000000000 RCX: 0000000000000000 > [ 11.281264] RDX: 0000000000004f4e RSI: 0000000000000001 RDI: 0000000000000000 > [ 11.281624] RBP: ffff88027f1f7f98 R08: 0000000000000001 R09: ffffffff802509af > [ 11.281925] R10: ffff8800280c2780 R11: 0000000000000000 R12: ffff88027f097d48 > [ 11.282214] R13: ffff88027f097d70 R14: 0000000000000005 R15: ffff88027e571000 > [ 11.282502] FS: 0000000000000000(0000) GS:ffff88027f1c3340(0000) knlGS:0000000000000000 > [ 11.283096] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > [ 11.283382] CR2: ffff8802ffffffff CR3: 0000000000201000 CR4: 00000000000006e0 > [ 11.283760] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 11.284048] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [ 11.284337] Process swapper (pid: 0, threadinfo ffff88027f1f2000, task ffff88027f1f0640) > [ 11.284936] Stack: ffffffff80250963 0000000000000212 0000000000ee8c78 0000000000ee8a66 > [ 11.285802] ffff88027e571550 ffff88027f1f7fa8 ffffffff8021adb5 ffff88027f1f3e40 > [ 11.286599] ffffffff8020bdd6 ffff88027f1f3e40 <EOI> ffff88027f1f3ef8 0000000000000000 > [ 11.287120] Call Trace: > [ 11.287768] <IRQ> [<ffffffff80250963>] ? generic_smp_call_function_interrupt+0x61/0x12c > [ 11.288354] [<ffffffff8021adb5>] smp_call_function_interrupt+0x17/0x27 > [ 11.288744] [<ffffffff8020bdd6>] call_function_interrupt+0x66/0x70 > [ 11.289030] <EOI> [<ffffffff8024ab3b>] ? clockevents_notify+0x19/0x73 > [ 11.289380] [<ffffffff803b9b75>] ? acpi_idle_enter_simple+0x18b/0x1fa > [ 11.289760] [<ffffffff803b9b6b>] ? acpi_idle_enter_simple+0x181/0x1fa > [ 11.290051] [<ffffffff8053aeca>] ? cpuidle_idle_call+0x70/0xa2 > [ 11.290338] [<ffffffff80209f61>] ? cpu_idle+0x5f/0x7d > [ 11.290723] [<ffffffff8060224a>] ? start_secondary+0x14d/0x152 > [ 11.291010] > [ 11.291287] > [ 11.291654] Code: Bad RIP value. > [ 11.292041] RIP [<ffff8802ffffffff>] 0xffff8802ffffffff > [ 11.292380] RSP <ffff88027f1f7f70> > [ 11.292741] CR2: ffff8802ffffffff > [ 11.310951] ---[ end trace 137c54d525305f1c ]--- > > The problem is with the following sequence of events: > > - CPU A calls smp_call_function_mask() for CPU B with wait parameter > - CPU A sets up the call_function_data on the stack and does an rcu add to > call_function_queue > - CPU A waits until the WAIT flag is cleared > - CPU B gets the call function interrupt and starts going through the > call_function_queue > - CPU C also gets some other call function interrupt and starts going through > the call_function_queue > - CPU C, which is also going through the call_function_queue, starts referencing > CPU A's stack, as that element is still in call_function_queue > - CPU B finishes the function call that CPU A set up and as there are no other > references to it, rcu deletes the call_function_data (which was from CPU A > stack) > - CPU B sees the wait flag and just clears the flag (no call_rcu to free) > - CPU A which was waiting on the flag continues executing and the stack > contents change > > - CPU C is still in rcu_read section accessing the CPU A's stack sees > inconsistent call_funation_data and can try to execute > function with some random pointer, causing stack corruption for A > (by clearing the bits in mask field) and oops. Nice debugging work. I'd suggest something like the attached (boot tested) patch as the simple fix for now. I expect the benefits from the less synchronized, multiple-in-flight-data global queue will still outweigh the costs of dynamic allocations. But if worst comes to worst then we just go back to a globally synchronous one-at-a-time implementation, but that would be pretty sad! Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-11sched: fix mysql+oltp regressionMike Galbraith
Defer commit 6d299f1b53b84e2665f402d9bcc494800aba6386 to the next release. Testing of the tip/sched/clock tree revealed a mysql+oltp regression which bisection eventually traced back to this commit in mainline. Pertinent test results: Three run sysbench averages, throughput units in read/write requests/sec. clients 1 2 4 8 16 32 64 6e0534f 9646 17876 34774 33868 32230 30767 29441 2.6.26.1 9112 17936 34652 33383 31929 30665 29232 6d299f1 9112 14637 28370 33339 32038 30762 29204 Note: subsequent commits hide the majority of this regression until you apply the clock fixes, at which time it reemerges at full magnitude. We cannot see anything bad about the change itself so we defer it to the next release until this problem is fully analysed. Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-11Merge branch 'linus' into sched/urgentIngo Molnar
2008-08-11lockdep: rename map_[acquire|release]() => lock_map_[acquire|release]()Ingo Molnar
the names were too generic: drivers/uio/uio.c:87: error: expected identifier or '(' before 'do' drivers/uio/uio.c:87: error: expected identifier or '(' before 'while' drivers/uio/uio.c:113: error: 'map_release' undeclared here (not in a function) Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-11lockdep: handle chains involving classes defined in modulesRabin Vincent
Solve this by marking the classes as unused and not printing information about the unused classes. Reported-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Rabin Vincent <rabin@rab.in> Acked-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-11lockdep: spin_lock_nest_lock()Peter Zijlstra
Expose the new lock protection lock. This can be used to annotate places where we take multiple locks of the same class and avoid deadlocks by always taking another (top-level) lock first. NOTE: we're still bound to the MAX_LOCK_DEPTH (48) limit. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-11lockdep: lock protection locksPeter Zijlstra
On Fri, 2008-08-01 at 16:26 -0700, Linus Torvalds wrote: > On Fri, 1 Aug 2008, David Miller wrote: > > > > Taking more than a few locks of the same class at once is bad > > news and it's better to find an alternative method. > > It's not always wrong. > > If you can guarantee that anybody that takes more than one lock of a > particular class will always take a single top-level lock _first_, then > that's all good. You can obviously screw up and take the same lock _twice_ > (which will deadlock), but at least you cannot get into ABBA situations. > > So maybe the right thing to do is to just teach lockdep about "lock > protection locks". That would have solved the multi-queue issues for > networking too - all the actual network drivers would still have taken > just their single queue lock, but the one case that needs to take all of > them would have taken a separate top-level lock first. > > Never mind that the multi-queue locks were always taken in the same order: > it's never wrong to just have some top-level serialization, and anybody > who needs to take <n> locks might as well do <n+1>, because they sure as > hell aren't going to be on _any_ fastpaths. > > So the simplest solution really sounds like just teaching lockdep about > that one special case. It's not "nesting" exactly, although it's obviously > related to it. Do as Linus suggested. The lock protection lock is called nest_lock. Note that we still have the MAX_LOCK_DEPTH (48) limit to consider, so anything that spills that it still up shit creek. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-11lockdep: map_acquirePeter Zijlstra
Most the free-standing lock_acquire() usages look remarkably similar, sweep them into a new helper. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-11lockdep: shrink held_lock structureDave Jones
struct held_lock { u64 prev_chain_key; /* 0 8 */ struct lock_class * class; /* 8 8 */ long unsigned int acquire_ip; /* 16 8 */ struct lockdep_map * instance; /* 24 8 */ int irq_context; /* 32 4 */ int trylock; /* 36 4 */ int read; /* 40 4 */ int check; /* 44 4 */ int hardirqs_off; /* 48 4 */ /* size: 56, cachelines: 1 */ /* padding: 4 */ /* last cacheline: 56 bytes */ }; struct held_lock { u64 prev_chain_key; /* 0 8 */ long unsigned int acquire_ip; /* 8 8 */ struct lockdep_map * instance; /* 16 8 */ unsigned int class_idx:11; /* 24:21 4 */ unsigned int irq_context:2; /* 24:19 4 */ unsigned int trylock:1; /* 24:18 4 */ unsigned int read:2; /* 24:16 4 */ unsigned int check:2; /* 24:14 4 */ unsigned int hardirqs_off:1; /* 24:13 4 */ /* size: 32, cachelines: 1 */ /* padding: 4 */ /* bit_padding: 13 bits */ /* last cacheline: 32 bytes */ }; [mingo@elte.hu: shrunk hlock->class too] [peterz@infradead.org: fixup bit sizes] Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2008-08-11lockdep: re-annotate scheduler runqueuesPeter Zijlstra
Instead of using a per-rq lock class, use the regular nesting operations. However, take extra care with double_lock_balance() as it can release the already held rq->lock (and therefore change its nesting class). So what can happen is: spin_lock(rq->lock); // this rq subclass 0 double_lock_balance(rq, other_rq); // release rq // acquire other_rq->lock subclass 0 // acquire rq->lock subclass 1 spin_unlock(other_rq->lock); leaving you with rq->lock in subclass 1 So a subsequent double_lock_balance() call can try to nest a subclass 1 lock while already holding a subclass 1 lock. Fix this by introducing double_unlock_balance() which releases the other rq's lock, but also re-sets the subclass for this rq's lock to 0. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-11lockdep: lock_set_subclass - reset a held lock's subclassPeter Zijlstra
this can be used to reset a held lock's subclass, for arbitrary-depth iterated data structures such as trees or lists which have per-node locks. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-11Merge branch 'linus' into sched/clockIngo Molnar
2008-08-11sched_clock: delay using sched_clock()Peter Zijlstra
Some arch's can't handle sched_clock() being called too early - delay this until sched_clock_init() has been called. Reported-by: Bill Gatliff <bgat@billgatliff.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Nishanth Aravamudan <nacc@us.ibm.com> CC: Russell King - ARM Linux <linux@arm.linux.org.uk> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-07DMA: make dma-coherent.c documentation kdoc-friendlyDmitry Baryshkov
Spotted by Randy. Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Dmitry Baryshkov <dbaryshkov@gmail.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2008-08-05pm_qos: spelling fixesRichard Hughes
A documentation cleanup patch. With a minor tweak to clarify units for kbs. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: mark gross <mgross@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>