aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2009-05-17mlx4_en: Fix not deleted napi structuresYevgeny Petrilin
Napi structures are being created each time we open a port, but when the port is closed the napi structure is only disabled but not removed. This bug caused hang while removing the driver. Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-05-17ipconfig: handle case of delayed DHCP serverChris Friesen
If a DHCP server is delayed, it's possible for the client to receive the DHCPOFFER after it has already sent out a new DHCPDISCOVER message from a second interface. The client then sends out a DHCPREQUEST from the second interface, but the server doesn't recognize the device and rejects the request. This patch simply tracks the current device being configured and throws away the OFFER if it is not intended for the current device. A more sophisticated approach would be to put the OFFER information into the struct ic_device rather than storing it globally. Signed-off-by: Chris Friesen <cfriesen@nortel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-05-17netpoll: don't dereference NULL dev from npPavel Emelyanov
It looks like the dev in netpoll_poll can be NULL - at lease it's checked at the function beginning. Thus the dev->netde_ops dereference looks dangerous. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-05-17page-writeback: fix the calculation of the oldest_jif in wb_kupdate()Toshiyuki Okajima
wb_kupdate() function has a bug on linux-2.6.30-rc5. This bug causes generic_sync_sb_inodes() to start to write inodes back much earlier than our expectations because it miscalculates oldest_jif in wb_kupdate(). This bug was introduced in 704503d836042d4a4c7685b7036e7de0418fbc0f ('mm: fix proc_dointvec_userhz_jiffies "breakage"'). Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: padlock - Revert aes-all alias to aes crypto: api - Fix algorithm module auto-loading crypto: eseqiv - Fix IV generation for sync algorithms crypto: ixp4xx - check firmware for crypto support
2009-05-17Merge branch 'master' of /home/davem/src/GIT/linux-2.6/David S. Miller
2009-05-17Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM: check sysdev_suspend(PMSG_FREEZE) return value
2009-05-17reiserfs: fixup perms when xattrs are disabledJeff Mahoney
This adds CONFIG_REISERFS_FS_XATTR protection from reiserfs_permission. This is needed to avoid warnings during file deletions and chowns with xattrs disabled. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-17reiserfs: deal with NULL xattr root w/ xattrs disabledJeff Mahoney
This avoids an Oops in open_xa_root that can occur when deleting a file with xattrs disabled. It assumes that the xattr root will be there, and that is not guaranteed. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-17reiserfs: clean up ifdefsJeff Mahoney
With xattr cleanup even with xattrs disabled, much of the initial setup is still performed. Some #ifdefs are just not needed since the options they protect wouldn't be available anyway. This cleans those up. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-17Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: mm: SLOB fix reclaim_state mm: SLUB fix reclaim_state slub: add Documentation/ABI/testing/sysfs-kernel-slab slub: enforce MAX_ORDER
2009-05-17Merge branch 'smp-fix'Russell King
2009-05-17[ARM] realview: fix broadcast tick supportRussell King
Having discussed broadcast tick support with Thomas Glexiner, the broadcast tick devices should be registered with a higher rating than the global tick device, and it should have the ONESHOT and PERIODIC feature flags set. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Acked-by: Thomas Glexiner <tglx@linutronix.de>
2009-05-17[ARM] realview: remove useless smp_cross_call_done()Russell King
smp_cross_call_done() is a no-op for MPCore, and since it's only used by platform code, there's no point in having it unless it's doing something. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2009-05-17[ARM] smp: fix cpumask usage in ARM SMP codeRussell King
The ARM SMP code wasn't properly updated for the cpumask changes, which results in smp_timer_broadcast() broadcasting ticks to non-online CPUs. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2009-05-16Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/holtmann/bluetooth-2.6
2009-05-16Fix caller information for warn_slowpath_nullLinus Torvalds
Ian Campbell noticed that since "Eliminate thousands of warnings with gcc 3.2 build" (commit 57adc4d2dbf968fdbe516359688094eef4d46581) all WARN_ON()'s currently appear to come from warn_slowpath_null(), eg: WARNING: at kernel/softirq.c:143 warn_slowpath_null+0x1c/0x20() because now that warn_slowpath_null() is in the call path, the __builtin_return_address(0) returns that, rather than the place that caused the warning. Fix this by splitting up the warn_slowpath_null/fmt cases differently, using a common helper function, and getting the return address in the right place. This also happens to avoid the unnecessary stack usage for the non-stdargs case, and just generally cleans things up. Make the function name printout use %pS while at it. Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-16Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6: piix: The Sony TZ90 needs the cable type hardcoding icside: register second channel of version 6 PCB ide-tape: remove back-to-back REQUEST_SENSE detection
2009-05-16[ARM] 5513/1: Eurotech VIPER SBC: fix compilation errorRicardo Martins
Compilation for this board yields the following errors: arch/arm/mach-pxa/viper.c:511: error: 'FFUART' undeclared here (not in a function) arch/arm/mach-pxa/viper.c:520: error: 'BTUART' undeclared here (not in a function) arch/arm/mach-pxa/viper.c:529: error: 'STUART' undeclared here (not in a function) Fix them by including the necessary header. Signed-off-by: Ricardo Martins <rasm@fe.up.pt> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2009-05-16[ARM] 5509/1: ep93xx: clkdev enable UARTSHartley Sweeten
Fix the clkdev API support for the ep93xx uart clocks. The uarts available in the ep93xx have individual clock controls. The current implementation assumes that the bootloader has enabled the clocks before the kernel has booted. It also assumes that the bootloader has set the UARTBAUD bit indicating that the uarts are running off the 14.7456MHz external crystal. This fixes both issues. It also allows the uart clocks to be stopped when there are no users. Tested-by: Matthias Kaehlcke <matthias@kaehlcke.net> Cc: Ryan Mallon <ryan@bluewatersys.com> Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2009-05-16Merge branch 'omap-fixes' of ↵Russell King
git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap-2.6
2009-05-16Merge branch 'release' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: ACPI: Idle C-states disabled by max_cstate should not disable the TSC ACPI: idle: fix init-time TSC check regression ACPI processor: reset the throttling state once it's invalid ACPI processor: introduce module parameter processor.ignore_tpc ACPI, i915: build fix ACPI: suspend: restore BM_RLD on resume ACPI: resume: re-enable SCI-enable workaround thermal: fix off-by-1 error in trip point trigger condition eeepc-laptop: unregister_rfkill_notifier on failure asus-laptop: fix input keycode eeepc-laptop: support for super hybrid engine (SHE) eeepc-laptop: Work around rfkill firmware bug eeepc-laptop: report brightness control events via the input layer eeepc-laptop: fix wlan rfkill state change during init ACPI: suspend: don't let device _PS3 failure prevent suspend ACPI: power: update error message ACPI: video: DMI workaround another broken Acer BIOS enabling display brightness ACPICA: use acpi.* modparam namespace ACPI video: dmi check for broken _BQC on Acer Aspire 5720
2009-05-16piix: The Sony TZ90 needs the cable type hardcodingAlan Cox
The Sony TZ90 needs the cable type hardcoding. See bug #12734 Signed-off-by: Alan Cox <alan@linux.intel.com> Reported-by: Jonathan E. Snow <jesnow@uh.edu> [bart: port it from ata_piix to piix and give reporter the proper credit] Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2009-05-16icside: register second channel of version 6 PCBSergei Shtylyov
The second IDE channel of version 6 PCB is not being registered anymore since the commit 48c3c1072651922ed153bcf0a33ea82cf20df390 (ide: add struct ide_host (take 3)). Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2009-05-16ide-tape: remove back-to-back REQUEST_SENSE detectionTejun Heo
Impact: fix an oops which always triggers ide_tape_issue_pc() assumed drive->pc isn't NULL on invocation when checking for back-to-back request sense issues but drive->pc can be NULL and even when it's not NULL, it's not safe to dereference it once the previous command is complete because pc could have been freed or was on stack. Kill back-to-back REQUEST_SENSE detection. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2009-05-16Merge branch 'fixes-rc5' of git://aeryn.fluff.org.uk/bjdooks/linuxRussell King
2009-05-16ARM: OMAP2/3: Change omapfb to use clkdev for dispc and rfbi, v2Tony Lindgren
This makes the framebuffer work on omap3. Also fix the clk_get usage for checkpatch.pl "ERROR: do not use assignment in if condition". Cc: Imre Deak <imre.deak@nokia.com> Cc: linux-fbdev-devel@lists.sourceforge.net Acked-by: Krzysztof Helt <krzysztof.h1@wp.pl> Signed-off-by: Tony Lindgren <tony@atomide.com>
2009-05-16ARM: OMAP3: Fix HW SAVEANDRESTORE shift defineKalle Jokiniemi
The OMAP3430ES2_SAVEANDRESTORE_SHIFT macro is used by powerdomain code in "1 << OMAP3430ES2_SAVEANDRESTORE_SHIFT" manner, but the definition was also (1 << 4), meaning we actually modified bit 16. So the definition needs to be 4. This fixes also a cold reset HW bug in OMAP3430 ES3.x where some of the efuse bits are not isolated during wake-up from off mode. This can cause randomish cold resets with off mode. Enabling the USBTLL hardware SAVEANDRESTORE causes the core power up assert to be delayed in a way that we will not get faulty values when boot ROM is reading the unisolated registers. Signed-off-by: Kalle Jokiniemi <kalle.jokiniemi@digia.com> Acked-by: Kevin Hilman <khilman@deeprootsystems.com> Acked-by: Paul Walmsley <paul@pwsan.com> Signed-off-by: Tony Lindgren <tony@atomide.com>
2009-05-16ARM: OMAP3: Fix number of GPIO lines for 34xxVikram Pandita
As per 3430 TRM, there are 6 banks [0 to 191] Signed-off-by: Tom Rix <Tom.Rix@windriver.com> Signed-off-by: Vikram Pandita <vikram.pandita@ti.com> Signed-off-by: Tony Lindgren <tony@atomide.com>
2009-05-16Merge branches 'release', 'bugzilla-13032', 'bugzilla-13041+', ↵Len Brown
'bugzilla-13121', 'bugzilla-13165', 'bugzilla-13243', 'bugzilla-13259', 'resume-sci-en-regression', 'thermal-regression', 'tsc-regression' and 'asus-2.6.30' into release
2009-05-16ACPI: Idle C-states disabled by max_cstate should not disable the TSCLen Brown
Processor idle power states C2 and C3 stop the TSC on many machines. Linux recognizes this situation and marks the TSC as unstable: Marking TSC unstable due to TSC halts in idle But if those same machines are booted with "processor.max_cstate=1", then there is no need to validate C2 and C3, and no need to disable the TSC, which can be reliably used as a clocksource. Signed-off-by: Len Brown <len.brown@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-16ACPI: idle: fix init-time TSC check regressionLen Brown
A previous 2.6.30 patch, a71e4917dc0ebbcb5a0ecb7ca3486643c1c9a6e2, (ACPI: idle: mark_tsc_unstable() at init-time, not run-time) erroneously disabled the TSC on systems that did not actually have valid deep C-states. Move the check after the deep-C-states are validated, via new helper, tsc_check_state(), hich replaces tsc_halts_in_c(). Signed-off-by: Len Brown <len.brown@intel.com> Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Frans Pop <elendil@planet.nl>
2009-05-15Linux 2.6.30-rc6Linus Torvalds
2009-05-15ACPI processor: reset the throttling state once it's invalidZhang Rui
If the BIOS hands us an invalid throttling state, write a valid state. http://bugzilla.kernel.org/show_bug.cgi?id=13259 Signed-off-by: Zhang Rui <rui.zhang@intel.com> Tested-by: James Ettle <theholyettlz@googlemail.com> Signed-off-by: Len Brown <len.brown@intel.com>
2009-05-15ACPI processor: introduce module parameter processor.ignore_tpcZhang Rui
Introduce module parameter processor.ignore_tpc. Some laptops are shipped with buggy _TPC, this module parameter is used to to disable the buggy support. http://bugzilla.kernel.org/show_bug.cgi?id=13259 Signed-off-by: Zhang Rui <rui.zhang@intel.com> Tested-by: James Ettle <theholyettlz@googlemail.com> Signed-off-by: Len Brown <len.brown@intel.com>
2009-05-15ACPI, i915: build fixLen Brown
drivers/built-in.o: In function `intel_opregion_init': (.text+0x9d540): undefined reference to `acpi_video_register' http://bugzilla.kernel.org/show_bug.cgi?id=13165 Signed-off-by: Len Brown <len.brown@intel.com>
2009-05-15ACPI: suspend: restore BM_RLD on resumeLen Brown
In 2.6.29, 31878dd86b7df9a147f5e6cc6e07092b4308782b "ACPI: remove BM_RLD access from idle entry path" moved BM_RLD initialization to init-time from run time. But we discovered that some BIOS do not restore BM_RLD after suspend, causing device errors on C3 and C4 after resume. So now the kernel restores BM_RLD. http://bugzilla.kernel.org/show_bug.cgi?id=13032 Signed-off-by: Len Brown <len.brown@intel.com>
2009-05-15ACPI: resume: re-enable SCI-enable workaroundLin Ming
The BIOS bug workaround mistakenly got disabled when we followed the ACPI specification more closely by ignoring OS updates to that bit. (The BIOS is supposed to update SCI_EN, not the OS) http://bugzilla.kernel.org/show_bug.cgi?id=13289 Signed-off-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2009-05-15Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: PCI MSI: Fix MSI-X with NIU cards PCI: Fix pci-e port driver slot_reset bad default return value
2009-05-15PM: check sysdev_suspend(PMSG_FREEZE) return valueBjorn Helgaas
Check the return value of sysdev_suspend(). I think this was a typo. Without this change, the following "if" check is always false. I also changed the error message so it's distinguishable from the similar message a few lines above. Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2009-05-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/holtmann/bluetooth-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/holtmann/bluetooth-2.6: Bluetooth: Don't trigger disconnect timeout for security mode 3 pairing Bluetooth: Don't use hci_acl_connect_cancel() for incoming connections Bluetooth: Fix wrong module refcount when connection setup fails Another case of me handling the fallout from Davem's unfortunate addiction to shuffleboard. Won't anybody think of the children? Join the anti-shuffleboard league today!
2009-05-15Merge branch 'drm-intel-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel * 'drm-intel-next' of git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel: drm/i915: Add new GET_PIPE_FROM_CRTC_ID ioctl. drm/i915: Set HDMI hot plug interrupt enable for only the output in question. drm/i915: Include 965GME pci ID in IS_I965GM(dev) to match UMS. drm/i915: Use the GM45 VGA hotplug workaround on G45 as well. drm/i915: ignore LVDS on intel graphics systems that lie about having it drm/i915: sanity check IER at wait_request time drm/i915: workaround IGD i2c bus issue in kernel side (v2) drm/i915: Don't allow binding objects into the last page of the aperture. drm/i915: save/restore fence registers across suspend/resume drm/i915: x86 always has writeq. Add I915_READ64 for symmetry.
2009-05-15Merge branch 'upstream-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev * 'upstream-linus' of ssh://master.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev: libata: Media rotation rate and form factor heuristics libata: Report disk alignment and physical block size sata_fsl: Fix the command description of FSL SATA controller sata_fsl: Fix compile warnings [libata] sata_sx4: fixup interrupt handling [libata] sata_sx4: convert to new exception handling methods
2009-05-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6: iwlwifi: fix device id registration for 6000 series 2x2 devices ath5k: update channel in sw state after stopping RX and TX rtl8187: use DMA-aware buffers with usb_control_msg mac80211: avoid NULL ptr deref when finding max_rates in PID and minstrel airo: airo_get_encode{,ext} potential buffer overflow Pulled directly by Linus because Davem is off playing shuffle-board at some Alaskan cruise, and the NULL ptr deref issue hits people and should get merged sooner rather than later. David - make us proud on the shuffle-board tournament!
2009-05-15libata: Media rotation rate and form factor heuristicsMartin K. Petersen
This patch provides new heuristics for parsing both the form factor and media rotation rate ATA IDENFITY words. The reported ATA version must be 7 or greater and the device must return values defined as valid in the standard. Only then are the characteristics reported to SCSI via the VPD B1 page. This seems like a reasonable compromise to me considering that we have been shipping several kernel releases that key off the rotation rate bit without any version checking whatsoever. With no complaints so far. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2009-05-15libata: Report disk alignment and physical block sizeMartin K. Petersen
For disks with 4KB sectors, report the correct block size and alignment when filling out the READ CAPACITY(16) response. This patch is based upon code from Matthew Wilcox' 4KB ATA tree. I fixed the bug I reported a while back caused by ATA and SCSI using different approaches to describing the alignment. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2009-05-15sata_fsl: Fix the command description of FSL SATA controllerDave Liu
The bit 11 of command description is reserved bit in Freescale SATA controller and needs to be set to '1'. This is needed to make sure the last write from the controller to the buffer descriptor is seen before an interrupt is raised. Signed-off-by: Dave Liu <daveliu@freescale.com> Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2009-05-15sata_fsl: Fix compile warningsKumar Gala
We we build with dma_addr_t as a 64-bit quantity we get: drivers/ata/sata_fsl.c: In function 'sata_fsl_fill_sg': drivers/ata/sata_fsl.c:340: warning: format '%x' expects type 'unsigned int', but argument 4 has type 'dma_addr_t' Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2009-05-15[libata] sata_sx4: fixup interrupt handlingDavid Milburn
Issuing ATA_CMD_SET_FEATURES (0xef) times out because pdc20621_interrupt ignores command completion since ATA_TFLAG_POLLING flag is set. This has already been fixed for sata_promise: commit 51b94d2a5a90d4800e74d7348bcde098a28f4fb3 Author: Tejun Heo <htejun@gmail.com> Date: Fri Jun 8 13:46:55 2007 -0700 sata_promise: use TF interface for polling NODATA commands Also, this patch includes Mikael's original patches: http://marc.info/?l=linux-ide&m=121135828227724&w=2 http://marc.info/?l=linux-ide&m=121144512109826&w=2 Signed-off-by: Mikael Pettersson <mikpe@it.uu.se> Signed-off-by: David Milburn <dmilburn@redhat.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2009-05-15x86: Fix performance regression caused by paravirt_ops on native kernelsJeremy Fitzhardinge
Xiaohui Xin and some other folks at Intel have been looking into what's behind the performance hit of paravirt_ops when running native. It appears that the hit is entirely due to the paravirtualized spinlocks introduced by: | commit 8efcbab674de2bee45a2e4cdf97de16b8e609ac8 | Date: Mon Jul 7 12:07:51 2008 -0700 | | paravirt: introduce a "lock-byte" spinlock implementation The extra call/return in the spinlock path is somehow causing an increase in the cycles/instruction of somewhere around 2-7% (seems to vary quite a lot from test to test). The working theory is that the CPU's pipeline is getting upset about the call->call->locked-op->return->return, and seems to be failing to speculate (though I haven't seen anything definitive about the precise reasons). This doesn't entirely make sense, because the performance hit is also visible on unlock and other operations which don't involve locked instructions. But spinlock operations clearly swamp all the other pvops operations, even though I can't imagine that they're nearly as common (there's only a .05% increase in instructions executed). If I disable just the pv-spinlock calls, my tests show that pvops is identical to non-pvops performance on native (my measurements show that it is actually about .1% faster, but Xiaohui shows a .05% slowdown). Summary of results, averaging 10 runs of the "mmperf" test, using a no-pvops build as baseline: nopv Pv-nospin Pv-spin CPU cycles 100.00% 99.89% 102.18% instructions 100.00% 100.10% 100.15% CPI 100.00% 99.79% 102.03% cache ref 100.00% 100.84% 100.28% cache miss 100.00% 90.47% 88.56% cache miss rate 100.00% 89.72% 88.31% branches 100.00% 99.93% 100.04% branch miss 100.00% 103.66% 107.72% branch miss rt 100.00% 103.73% 107.67% wallclock 100.00% 99.90% 102.20% The clear effect here is that the 2% increase in CPI is directly reflected in the final wallclock time. (The other interesting effect is that the more ops are out of line calls via pvops, the lower the cache access and miss rates. Not too surprising, but it suggests that the non-pvops kernel is over-inlined. On the flipside, the branch misses go up correspondingly...) So, what's the fix? Paravirt patching turns all the pvops calls into direct calls, so _spin_lock etc do end up having direct calls. For example, the compiler generated code for paravirtualized _spin_lock is: <_spin_lock+0>: mov %gs:0xb4c8,%rax <_spin_lock+9>: incl 0xffffffffffffe044(%rax) <_spin_lock+15>: callq *0xffffffff805a5b30 <_spin_lock+22>: retq The indirect call will get patched to: <_spin_lock+0>: mov %gs:0xb4c8,%rax <_spin_lock+9>: incl 0xffffffffffffe044(%rax) <_spin_lock+15>: callq <__ticket_spin_lock> <_spin_lock+20>: nop; nop /* or whatever 2-byte nop */ <_spin_lock+22>: retq One possibility is to inline _spin_lock, etc, when building an optimised kernel (ie, when there's no spinlock/preempt instrumentation/debugging enabled). That will remove the outer call/return pair, returning the instruction stream to a single call/return, which will presumably execute the same as the non-pvops case. The downsides arel 1) it will replicate the preempt_disable/enable code at eack lock/unlock callsite; this code is fairly small, but not nothing; and 2) the spinlock definitions are already a very heavily tangled mass of #ifdefs and other preprocessor magic, and making any changes will be non-trivial. The other obvious answer is to disable pv-spinlocks. Making them a separate config option is fairly easy, and it would be trivial to enable them only when Xen is enabled (as the only non-default user). But it doesn't really address the common case of a distro build which is going to have Xen support enabled, and leaves the open question of whether the native performance cost of pv-spinlocks is worth the performance improvement on a loaded Xen system (10% saving of overall system CPU when guests block rather than spin). Still it is a reasonable short-term workaround. [ Impact: fix pvops performance regression when running native ] Analysed-by: "Xin Xiaohui" <xiaohui.xin@intel.com> Analysed-by: "Li Xin" <xin.li@intel.com> Analysed-by: "Nakajima Jun" <jun.nakajima@intel.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Xen-devel <xen-devel@lists.xensource.com> LKML-Reference: <4A0B62F7.5030802@goop.org> [ fixed the help text ] Signed-off-by: Ingo Molnar <mingo@elte.hu>