diff options
Diffstat (limited to 'Documentation')
148 files changed, 10640 insertions, 2424 deletions
diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index d05737aaa84..06b982affe7 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX @@ -82,6 +82,8 @@ block/ - info on the Block I/O (BIO) layer. blockdev/ - info on block devices & drivers +btmrvl.txt + - info on Marvell Bluetooth driver usage. cachetlb.txt - describes the cache/TLB flushing interfaces Linux uses. cdrom/ diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block index cbbd3e06994..5f3bedaf8e3 100644 --- a/Documentation/ABI/testing/sysfs-block +++ b/Documentation/ABI/testing/sysfs-block @@ -94,28 +94,37 @@ What: /sys/block/<disk>/queue/physical_block_size Date: May 2009 Contact: Martin K. Petersen <martin.petersen@oracle.com> Description: - This is the smallest unit the storage device can write - without resorting to read-modify-write operation. It is - usually the same as the logical block size but may be - bigger. One example is SATA drives with 4KB sectors - that expose a 512-byte logical block size to the - operating system. + This is the smallest unit a physical storage device can + write atomically. It is usually the same as the logical + block size but may be bigger. One example is SATA + drives with 4KB sectors that expose a 512-byte logical + block size to the operating system. For stacked block + devices the physical_block_size variable contains the + maximum physical_block_size of the component devices. What: /sys/block/<disk>/queue/minimum_io_size Date: April 2009 Contact: Martin K. Petersen <martin.petersen@oracle.com> Description: - Storage devices may report a preferred minimum I/O size, - which is the smallest request the device can perform - without incurring a read-modify-write penalty. For disk - drives this is often the physical block size. For RAID - arrays it is often the stripe chunk size. + Storage devices may report a granularity or preferred + minimum I/O size which is the smallest request the + device can perform without incurring a performance + penalty. For disk drives this is often the physical + block size. For RAID arrays it is often the stripe + chunk size. A properly aligned multiple of + minimum_io_size is the preferred request size for + workloads where a high number of I/O operations is + desired. What: /sys/block/<disk>/queue/optimal_io_size Date: April 2009 Contact: Martin K. Petersen <martin.petersen@oracle.com> Description: Storage devices may report an optimal I/O size, which is - the device's preferred unit of receiving I/O. This is - rarely reported for disk drives. For RAID devices it is - usually the stripe width or the internal block size. + the device's preferred unit for sustained I/O. This is + rarely reported for disk drives. For RAID arrays it is + usually the stripe width or the internal track size. A + properly aligned multiple of optimal_io_size is the + preferred request size for workloads where sustained + throughput is desired. If no optimal I/O size is + reported this file contains 0. diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci index 97ad190e13a..25be3250f7d 100644 --- a/Documentation/ABI/testing/sysfs-bus-pci +++ b/Documentation/ABI/testing/sysfs-bus-pci @@ -84,6 +84,16 @@ Description: from this part of the device tree. Depends on CONFIG_HOTPLUG. +What: /sys/bus/pci/devices/.../reset +Date: July 2009 +Contact: Michael S. Tsirkin <mst@redhat.com> +Description: + Some devices allow an individual function to be reset + without affecting other functions in the same device. + For devices that have this support, a file named reset + will be present in sysfs. Writing 1 to this file + will perform reset. + What: /sys/bus/pci/devices/.../vpd Date: February 2008 Contact: Ben Hutchings <bhutchings@solarflare.com> @@ -122,3 +132,10 @@ Description: This symbolic link appears when a device is a Virtual Function. The symbolic link points to the PCI device sysfs entry of the Physical Function this device associates with. + +What: /sys/bus/pci/slots/.../module +Date: June 2009 +Contact: linux-pci@vger.kernel.org +Description: + This symbolic link points to the PCI hotplug controller driver + module that manages the hotplug slot. diff --git a/Documentation/ABI/testing/sysfs-class-mtd b/Documentation/ABI/testing/sysfs-class-mtd new file mode 100644 index 00000000000..4d55a188898 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-mtd @@ -0,0 +1,125 @@ +What: /sys/class/mtd/ +Date: April 2009 +KernelVersion: 2.6.29 +Contact: linux-mtd@lists.infradead.org +Description: + The mtd/ class subdirectory belongs to the MTD subsystem + (MTD core). + +What: /sys/class/mtd/mtdX/ +Date: April 2009 +KernelVersion: 2.6.29 +Contact: linux-mtd@lists.infradead.org +Description: + The /sys/class/mtd/mtd{0,1,2,3,...} directories correspond + to each /dev/mtdX character device. These may represent + physical/simulated flash devices, partitions on a flash + device, or concatenated flash devices. They exist regardless + of whether CONFIG_MTD_CHAR is actually enabled. + +What: /sys/class/mtd/mtdXro/ +Date: April 2009 +KernelVersion: 2.6.29 +Contact: linux-mtd@lists.infradead.org +Description: + These directories provide the corresponding read-only device + nodes for /sys/class/mtd/mtdX/ . They are only created + (for the benefit of udev) if CONFIG_MTD_CHAR is enabled. + +What: /sys/class/mtd/mtdX/dev +Date: April 2009 +KernelVersion: 2.6.29 +Contact: linux-mtd@lists.infradead.org +Description: + Major and minor numbers of the character device corresponding + to this MTD device (in <major>:<minor> format). This is the + read-write device so <minor> will be even. + +What: /sys/class/mtd/mtdXro/dev +Date: April 2009 +KernelVersion: 2.6.29 +Contact: linux-mtd@lists.infradead.org +Description: + Major and minor numbers of the character device corresponding + to the read-only variant of thie MTD device (in + <major>:<minor> format). In this case <minor> will be odd. + +What: /sys/class/mtd/mtdX/erasesize +Date: April 2009 +KernelVersion: 2.6.29 +Contact: linux-mtd@lists.infradead.org +Description: + "Major" erase size for the device. If numeraseregions is + zero, this is the eraseblock size for the entire device. + Otherwise, the MEMGETREGIONCOUNT/MEMGETREGIONINFO ioctls + can be used to determine the actual eraseblock layout. + +What: /sys/class/mtd/mtdX/flags +Date: April 2009 +KernelVersion: 2.6.29 +Contact: linux-mtd@lists.infradead.org +Description: + A hexadecimal value representing the device flags, ORed + together: + + 0x0400: MTD_WRITEABLE - device is writable + 0x0800: MTD_BIT_WRITEABLE - single bits can be flipped + 0x1000: MTD_NO_ERASE - no erase necessary + 0x2000: MTD_POWERUP_LOCK - always locked after reset + +What: /sys/class/mtd/mtdX/name +Date: April 2009 +KernelVersion: 2.6.29 +Contact: linux-mtd@lists.infradead.org +Description: + A human-readable ASCII name for the device or partition. + This will match the name in /proc/mtd . + +What: /sys/class/mtd/mtdX/numeraseregions +Date: April 2009 +KernelVersion: 2.6.29 +Contact: linux-mtd@lists.infradead.org +Description: + For devices that have variable eraseblock sizes, this + provides the total number of erase regions. Otherwise, + it will read back as zero. + +What: /sys/class/mtd/mtdX/oobsize +Date: April 2009 +KernelVersion: 2.6.29 +Contact: linux-mtd@lists.infradead.org +Description: + Number of OOB bytes per page. + +What: /sys/class/mtd/mtdX/size +Date: April 2009 +KernelVersion: 2.6.29 +Contact: linux-mtd@lists.infradead.org +Description: + Total size of the device/partition, in bytes. + +What: /sys/class/mtd/mtdX/type +Date: April 2009 +KernelVersion: 2.6.29 +Contact: linux-mtd@lists.infradead.org +Description: + One of the following ASCII strings, representing the device + type: + + absent, ram, rom, nor, nand, dataflash, ubi, unknown + +What: /sys/class/mtd/mtdX/writesize +Date: April 2009 +KernelVersion: 2.6.29 +Contact: linux-mtd@lists.infradead.org +Description: + Minimal writable flash unit size. This will always be + a positive integer. + + In the case of NOR flash it is 1 (even though individual + bits can be cleared). + + In the case of NAND flash it is one NAND page (or a + half page, or a quarter page). + + In the case of ECC NOR, it is the ECC block size. diff --git a/Documentation/ABI/testing/sysfs-fs-ext4 b/Documentation/ABI/testing/sysfs-fs-ext4 index 4e79074de28..5fb709997d9 100644 --- a/Documentation/ABI/testing/sysfs-fs-ext4 +++ b/Documentation/ABI/testing/sysfs-fs-ext4 @@ -79,3 +79,13 @@ Description: This file is read-only and shows the number of kilobytes of data that have been written to this filesystem since it was mounted. + +What: /sys/fs/ext4/<disk>/inode_goal +Date: June 2008 +Contact: "Theodore Ts'o" <tytso@mit.edu> +Description: + Tuning parameter which (if non-zero) controls the goal + inode used by the inode allocator in p0reference to + all other allocation hueristics. This is intended for + debugging use only, and should be 0 on production + systems. diff --git a/Documentation/ABI/testing/sysfs-pps b/Documentation/ABI/testing/sysfs-pps new file mode 100644 index 00000000000..25028c7bc37 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-pps @@ -0,0 +1,73 @@ +What: /sys/class/pps/ +Date: February 2008 +Contact: Rodolfo Giometti <giometti@linux.it> +Description: + The /sys/class/pps/ directory will contain files and + directories that will provide a unified interface to + the PPS sources. + +What: /sys/class/pps/ppsX/ +Date: February 2008 +Contact: Rodolfo Giometti <giometti@linux.it> +Description: + The /sys/class/pps/ppsX/ directory is related to X-th + PPS source into the system. Each directory will + contain files to manage and control its PPS source. + +What: /sys/class/pps/ppsX/assert +Date: February 2008 +Contact: Rodolfo Giometti <giometti@linux.it> +Description: + The /sys/class/pps/ppsX/assert file reports the assert events + and the assert sequence number of the X-th source in the form: + + <secs>.<nsec>#<sequence> + + If the source has no assert events the content of this file + is empty. + +What: /sys/class/pps/ppsX/clear +Date: February 2008 +Contact: Rodolfo Giometti <giometti@linux.it> +Description: + The /sys/class/pps/ppsX/clear file reports the clear events + and the clear sequence number of the X-th source in the form: + + <secs>.<nsec>#<sequence> + + If the source has no clear events the content of this file + is empty. + +What: /sys/class/pps/ppsX/mode +Date: February 2008 +Contact: Rodolfo Giometti <giometti@linux.it> +Description: + The /sys/class/pps/ppsX/mode file reports the functioning + mode of the X-th source in hexadecimal encoding. + + Please, refer to linux/include/linux/pps.h for further + info. + +What: /sys/class/pps/ppsX/echo +Date: February 2008 +Contact: Rodolfo Giometti <giometti@linux.it> +Description: + The /sys/class/pps/ppsX/echo file reports if the X-th does + or does not support an "echo" function. + +What: /sys/class/pps/ppsX/name +Date: February 2008 +Contact: Rodolfo Giometti <giometti@linux.it> +Description: + The /sys/class/pps/ppsX/name file reports the name of the + X-th source. + +What: /sys/class/pps/ppsX/path +Date: February 2008 +Contact: Rodolfo Giometti <giometti@linux.it> +Description: + The /sys/class/pps/ppsX/path file reports the path name of + the device connected with the X-th source. + + If the source is not connected with any device the content + of this file is empty. diff --git a/Documentation/Changes b/Documentation/Changes index 664392481c8..6d0f1efc5bf 100644 --- a/Documentation/Changes +++ b/Documentation/Changes @@ -72,6 +72,13 @@ assembling the 16-bit boot code, removing the need for as86 to compile your kernel. This change does, however, mean that you need a recent release of binutils. +Perl +---- + +You will need perl 5 and the following modules: Getopt::Long, Getopt::Std, +File::Basename, and File::Find to build the kernel. + + System utilities ================ diff --git a/Documentation/DocBook/debugobjects.tmpl b/Documentation/DocBook/debugobjects.tmpl index 7f5f218015f..08ff908aa7a 100644 --- a/Documentation/DocBook/debugobjects.tmpl +++ b/Documentation/DocBook/debugobjects.tmpl @@ -106,7 +106,7 @@ number of errors are printk'ed including a full stack trace. </para> <para> - The statistics are available via debugfs/debug_objects/stats. + The statistics are available via /sys/kernel/debug/debug_objects/stats. They provide information about the number of warnings and the number of successful fixups along with information about the usage of the internal tracking objects and the state of the diff --git a/Documentation/DocBook/kernel-hacking.tmpl b/Documentation/DocBook/kernel-hacking.tmpl index a50d6cd5857..992e67e6be7 100644 --- a/Documentation/DocBook/kernel-hacking.tmpl +++ b/Documentation/DocBook/kernel-hacking.tmpl @@ -449,8 +449,8 @@ printk(KERN_INFO "i = %u\n", i); </para> <programlisting> -__u32 ipaddress; -printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); +__be32 ipaddress; +printk(KERN_INFO "my ip: %pI4\n", &ipaddress); </programlisting> <para> diff --git a/Documentation/DocBook/mac80211.tmpl b/Documentation/DocBook/mac80211.tmpl index e3698666357..f3f37f141db 100644 --- a/Documentation/DocBook/mac80211.tmpl +++ b/Documentation/DocBook/mac80211.tmpl @@ -184,8 +184,6 @@ usage should require reading the full document. !Finclude/net/mac80211.h ieee80211_ctstoself_get !Finclude/net/mac80211.h ieee80211_ctstoself_duration !Finclude/net/mac80211.h ieee80211_generic_frame_duration -!Finclude/net/mac80211.h ieee80211_get_hdrlen_from_skb -!Finclude/net/mac80211.h ieee80211_hdrlen !Finclude/net/mac80211.h ieee80211_wake_queue !Finclude/net/mac80211.h ieee80211_stop_queue !Finclude/net/mac80211.h ieee80211_wake_queues diff --git a/Documentation/DocBook/uio-howto.tmpl b/Documentation/DocBook/uio-howto.tmpl index 8f6e3b2403c..4d4ce0e61e4 100644 --- a/Documentation/DocBook/uio-howto.tmpl +++ b/Documentation/DocBook/uio-howto.tmpl @@ -25,6 +25,10 @@ <year>2006-2008</year> <holder>Hans-Jürgen Koch.</holder> </copyright> +<copyright> + <year>2009</year> + <holder>Red Hat Inc, Michael S. Tsirkin (mst@redhat.com)</holder> +</copyright> <legalnotice> <para> @@ -42,6 +46,13 @@ GPL version 2. <revhistory> <revision> + <revnumber>0.9</revnumber> + <date>2009-07-16</date> + <authorinitials>mst</authorinitials> + <revremark>Added generic pci driver + </revremark> + </revision> + <revision> <revnumber>0.8</revnumber> <date>2008-12-24</date> <authorinitials>hjk</authorinitials> @@ -809,6 +820,158 @@ framework to set up sysfs files for this region. Simply leave it alone. </chapter> +<chapter id="uio_pci_generic" xreflabel="Using Generic driver for PCI cards"> +<?dbhtml filename="uio_pci_generic.html"?> +<title>Generic PCI UIO driver</title> + <para> + The generic driver is a kernel module named uio_pci_generic. + It can work with any device compliant to PCI 2.3 (circa 2002) and + any compliant PCI Express device. Using this, you only need to + write the userspace driver, removing the need to write + a hardware-specific kernel module. + </para> + +<sect1 id="uio_pci_generic_binding"> +<title>Making the driver recognize the device</title> + <para> +Since the driver does not declare any device ids, it will not get loaded +automatically and will not automatically bind to any devices, you must load it +and allocate id to the driver yourself. For example: + <programlisting> + modprobe uio_pci_generic + echo "8086 10f5" > /sys/bus/pci/drivers/uio_pci_generic/new_id + </programlisting> + </para> + <para> +If there already is a hardware specific kernel driver for your device, the +generic driver still won't bind to it, in this case if you want to use the +generic driver (why would you?) you'll have to manually unbind the hardware +specific driver and bind the generic driver, like this: + <programlisting> + echo -n 0000:00:19.0 > /sys/bus/pci/drivers/e1000e/unbind + echo -n 0000:00:19.0 > /sys/bus/pci/drivers/uio_pci_generic/bind + </programlisting> + </para> + <para> +You can verify that the device has been bound to the driver +by looking for it in sysfs, for example like the following: + <programlisting> + ls -l /sys/bus/pci/devices/0000:00:19.0/driver + </programlisting> +Which if successful should print + <programlisting> + .../0000:00:19.0/driver -> ../../../bus/pci/drivers/uio_pci_generic + </programlisting> +Note that the generic driver will not bind to old PCI 2.2 devices. +If binding the device failed, run the following command: + <programlisting> + dmesg + </programlisting> +and look in the output for failure reasons + </para> +</sect1> + +<sect1 id="uio_pci_generic_internals"> +<title>Things to know about uio_pci_generic</title> + <para> +Interrupts are handled using the Interrupt Disable bit in the PCI command +register and Interrupt Status bit in the PCI status register. All devices +compliant to PCI 2.3 (circa 2002) and all compliant PCI Express devices should +support these bits. uio_pci_generic detects this support, and won't bind to +devices which do not support the Interrupt Disable Bit in the command register. + </para> + <para> +On each interrupt, uio_pci_generic sets the Interrupt Disable bit. +This prevents the device from generating further interrupts +until the bit is cleared. The userspace driver should clear this +bit before blocking and waiting for more interrupts. + </para> +</sect1> +<sect1 id="uio_pci_generic_userspace"> +<title>Writing userspace driver using uio_pci_generic</title> + <para> +Userspace driver can use pci sysfs interface, or the +libpci libray that wraps it, to talk to the device and to +re-enable interrupts by writing to the command register. + </para> +</sect1> +<sect1 id="uio_pci_generic_example"> +<title>Example code using uio_pci_generic</title> + <para> +Here is some sample userspace driver code using uio_pci_generic: +<programlisting> +#include <stdlib.h> +#include <stdio.h> +#include <unistd.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <fcntl.h> +#include <errno.h> + +int main() +{ + int uiofd; + int configfd; + int err; + int i; + unsigned icount; + unsigned char command_high; + + uiofd = open("/dev/uio0", O_RDONLY); + if (uiofd < 0) { + perror("uio open:"); + return errno; + } + configfd = open("/sys/class/uio/uio0/device/config", O_RDWR); + if (uiofd < 0) { + perror("config open:"); + return errno; + } + + /* Read and cache command value */ + err = pread(configfd, &command_high, 1, 5); + if (err != 1) { + perror("command config read:"); + return errno; + } + command_high &= ~0x4; + + for(i = 0;; ++i) { + /* Print out a message, for debugging. */ + if (i == 0) + fprintf(stderr, "Started uio test driver.\n"); + else + fprintf(stderr, "Interrupts: %d\n", icount); + + /****************************************/ + /* Here we got an interrupt from the + device. Do something to it. */ + /****************************************/ + + /* Re-enable interrupts. */ + err = pwrite(configfd, &command_high, 1, 5); + if (err != 1) { + perror("config write:"); + break; + } + + /* Wait for next interrupt. */ + err = read(uiofd, &icount, 4); + if (err != 4) { + perror("uio read:"); + break; + } + + } + return errno; +} + +</programlisting> + </para> +</sect1> + +</chapter> + <appendix id="app1"> <title>Further information</title> <itemizedlist> diff --git a/Documentation/PCI/pci-error-recovery.txt b/Documentation/PCI/pci-error-recovery.txt index 6650af43252..e83f2ea7641 100644 --- a/Documentation/PCI/pci-error-recovery.txt +++ b/Documentation/PCI/pci-error-recovery.txt @@ -4,15 +4,17 @@ February 2, 2006 Current document maintainer: - Linas Vepstas <linas@austin.ibm.com> + Linas Vepstas <linasvepstas@gmail.com> + updated by Richard Lary <rlary@us.ibm.com> + and Mike Mason <mmlnx@us.ibm.com> on 27-Jul-2009 Many PCI bus controllers are able to detect a variety of hardware PCI errors on the bus, such as parity errors on the data and address busses, as well as SERR and PERR errors. Some of the more advanced chipsets are able to deal with these errors; these include PCI-E chipsets, -and the PCI-host bridges found on IBM Power4 and Power5-based pSeries -boxes. A typical action taken is to disconnect the affected device, +and the PCI-host bridges found on IBM Power4, Power5 and Power6-based +pSeries boxes. A typical action taken is to disconnect the affected device, halting all I/O to it. The goal of a disconnection is to avoid system corruption; for example, to halt system memory corruption due to DMA's to "wild" addresses. Typically, a reconnection mechanism is also @@ -37,10 +39,11 @@ is forced by the need to handle multi-function devices, that is, devices that have multiple device drivers associated with them. In the first stage, each driver is allowed to indicate what type of reset it desires, the choices being a simple re-enabling of I/O -or requesting a hard reset (a full electrical #RST of the PCI card). -If any driver requests a full reset, that is what will be done. +or requesting a slot reset. -After a full reset and/or a re-enabling of I/O, all drivers are +If any driver requests a slot reset, that is what will be done. + +After a reset and/or a re-enabling of I/O, all drivers are again notified, so that they may then perform any device setup/config that may be required. After these have all completed, a final "resume normal operations" event is sent out. @@ -101,7 +104,7 @@ if it implements any, it must implement error_detected(). If a callback is not implemented, the corresponding feature is considered unsupported. For example, if mmio_enabled() and resume() aren't there, then it is assumed that the driver is not doing any direct recovery and requires -a reset. If link_reset() is not implemented, the card is assumed as +a slot reset. If link_reset() is not implemented, the card is assumed to not care about link resets. Typically a driver will want to know about a slot_reset(). @@ -111,7 +114,7 @@ sequence described below. STEP 0: Error Event ------------------- -PCI bus error is detect by the PCI hardware. On powerpc, the slot +A PCI bus error is detected by the PCI hardware. On powerpc, the slot is isolated, in that all I/O is blocked: all reads return 0xffffffff, all writes are ignored. @@ -139,7 +142,7 @@ The driver must return one of the following result codes: a chance to extract some diagnostic information (see mmio_enable, below). - PCI_ERS_RESULT_NEED_RESET: - Driver returns this if it can't recover without a hard + Driver returns this if it can't recover without a slot reset. - PCI_ERS_RESULT_DISCONNECT: Driver returns this if it doesn't want to recover at all. @@ -169,11 +172,11 @@ is STEP 6 (Permanent Failure). >>> The current powerpc implementation doesn't much care if the device >>> attempts I/O at this point, or not. I/O's will fail, returning ->>> a value of 0xff on read, and writes will be dropped. If the device ->>> driver attempts more than 10K I/O's to a frozen adapter, it will ->>> assume that the device driver has gone into an infinite loop, and ->>> it will panic the kernel. There doesn't seem to be any other ->>> way of stopping a device driver that insists on spinning on I/O. +>>> a value of 0xff on read, and writes will be dropped. If more than +>>> EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH +>>> assumes that the device driver has gone into an infinite loop +>>> and prints an error to syslog. A reboot is then required to +>>> get the device working again. STEP 2: MMIO Enabled ------------------- @@ -182,15 +185,14 @@ DMA), and then calls the mmio_enabled() callback on all affected device drivers. This is the "early recovery" call. IOs are allowed again, but DMA is -not (hrm... to be discussed, I prefer not), with some restrictions. This -is NOT a callback for the driver to start operations again, only to -peek/poke at the device, extract diagnostic information, if any, and -eventually do things like trigger a device local reset or some such, -but not restart operations. This is callback is made if all drivers on -a segment agree that they can try to recover and if no automatic link reset -was performed by the HW. If the platform can't just re-enable IOs without -a slot reset or a link reset, it wont call this callback, and instead -will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) +not, with some restrictions. This is NOT a callback for the driver to +start operations again, only to peek/poke at the device, extract diagnostic +information, if any, and eventually do things like trigger a device local +reset or some such, but not restart operations. This callback is made if +all drivers on a segment agree that they can try to recover and if no automatic +link reset was performed by the HW. If the platform can't just re-enable IOs +without a slot reset or a link reset, it will not call this callback, and +instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) >>> The following is proposed; no platform implements this yet: >>> Proposal: All I/O's should be done _synchronously_ from within @@ -228,9 +230,6 @@ proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations). If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform proceeds to STEP 4 (Slot Reset) ->>> The current powerpc implementation does not implement this callback. - - STEP 3: Link Reset ------------------ The platform resets the link, and then calls the link_reset() callback @@ -253,16 +252,33 @@ The platform then proceeds to either STEP 4 (Slot Reset) or STEP 5 >>> The current powerpc implementation does not implement this callback. - STEP 4: Slot Reset ------------------ -The platform performs a soft or hard reset of the device, and then -calls the slot_reset() callback. -A soft reset consists of asserting the adapter #RST line and then +In response to a return value of PCI_ERS_RESULT_NEED_RESET, the +the platform will peform a slot reset on the requesting PCI device(s). +The actual steps taken by a platform to perform a slot reset +will be platform-dependent. Upon completion of slot reset, the +platform will call the device slot_reset() callback. + +Powerpc platforms implement two levels of slot reset: +soft reset(default) and fundamental(optional) reset. + +Powerpc soft reset consists of asserting the adapter #RST line and then restoring the PCI BAR's and PCI configuration header to a state that is equivalent to what it would be after a fresh system power-on followed by power-on BIOS/system firmware initialization. +Soft reset is also known as hot-reset. + +Powerpc fundamental reset is supported by PCI Express cards only +and results in device's state machines, hardware logic, port states and +configuration registers to initialize to their default conditions. + +For most PCI devices, a soft reset will be sufficient for recovery. +Optional fundamental reset is provided to support a limited number +of PCI Express PCI devices for which a soft reset is not sufficient +for recovery. + If the platform supports PCI hotplug, then the reset might be performed by toggling the slot electrical power off/on. @@ -274,10 +290,12 @@ may result in hung devices, kernel panics, or silent data corruption. This call gives drivers the chance to re-initialize the hardware (re-download firmware, etc.). At this point, the driver may assume -that he card is in a fresh state and is fully functional. In -particular, interrupt generation should work normally. +that the card is in a fresh state and is fully functional. The slot +is unfrozen and the driver has full access to PCI config space, +memory mapped I/O space and DMA. Interrupts (Legacy, MSI, or MSI-X) +will also be available. -Drivers should not yet restart normal I/O processing operations +Drivers should not restart normal I/O processing operations at this point. If all device drivers report success on this callback, the platform will call resume() to complete the sequence, and let the driver restart normal I/O processing. @@ -302,11 +320,21 @@ driver performs device init only from PCI function 0: - PCI_ERS_RESULT_DISCONNECT Same as above. +Drivers for PCI Express cards that require a fundamental reset must +set the needs_freset bit in the pci_dev structure in their probe function. +For example, the QLogic qla2xxx driver sets the needs_freset bit for certain +PCI card types: + ++ /* Set EEH reset type to fundamental if required by hba */ ++ if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha)) ++ pdev->needs_freset = 1; ++ + Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent Failure). ->>> The current powerpc implementation does not currently try a ->>> power-cycle reset if the driver returned PCI_ERS_RESULT_DISCONNECT. +>>> The current powerpc implementation does not try a power-cycle +>>> reset if the driver returned PCI_ERS_RESULT_DISCONNECT. >>> However, it probably should. @@ -348,7 +376,7 @@ software errors. Conclusion; General Remarks --------------------------- -The way those callbacks are called is platform policy. A platform with +The way the callbacks are called is platform policy. A platform with no slot reset capability may want to just "ignore" drivers that can't recover (disconnect them) and try to let other cards on the same segment recover. Keep in mind that in most real life cases, though, there will @@ -361,8 +389,8 @@ That is, the recovery API only requires that: - There is no guarantee that interrupt delivery can proceed from any device on the segment starting from the error detection and until the -resume callback is sent, at which point interrupts are expected to be -fully operational. +slot_reset callback is called, at which point interrupts are expected +to be fully operational. - There is no guarantee that interrupt delivery is stopped, that is, a driver that gets an interrupt after detecting an error, or that detects @@ -381,16 +409,23 @@ anyway :) >>> Implementation details for the powerpc platform are discussed in >>> the file Documentation/powerpc/eeh-pci-error-recovery.txt ->>> As of this writing, there are six device drivers with patches ->>> implementing error recovery. Not all of these patches are in +>>> As of this writing, there is a growing list of device drivers with +>>> patches implementing error recovery. Not all of these patches are in >>> mainline yet. These may be used as "examples": >>> ->>> drivers/scsi/ipr.c ->>> drivers/scsi/sym53cxx_2 +>>> drivers/scsi/ipr +>>> drivers/scsi/sym53c8xx_2 +>>> drivers/scsi/qla2xxx +>>> drivers/scsi/lpfc +>>> drivers/next/bnx2.c >>> drivers/next/e100.c >>> drivers/net/e1000 +>>> drivers/net/e1000e >>> drivers/net/ixgb +>>> drivers/net/ixgbe +>>> drivers/net/cxgb3 >>> drivers/net/s2io.c +>>> drivers/net/qlge The End ------- diff --git a/Documentation/PCI/pcieaer-howto.txt b/Documentation/PCI/pcieaer-howto.txt index ddeb14beacc..be21001ab14 100644 --- a/Documentation/PCI/pcieaer-howto.txt +++ b/Documentation/PCI/pcieaer-howto.txt @@ -61,6 +61,10 @@ be initiated although firmwares have no _OSC support. To enable the walkaround, pls. add aerdriver.forceload=y to kernel boot parameter line when booting kernel. Note that forceload=n by default. +nosourceid, another parameter of type bool, can be used when broken +hardware (mostly chipsets) has root ports that cannot obtain the reporting +source ID. nosourceid=n by default. + 2.3 AER error output When a PCI-E AER error is captured, an error message will be outputed to console. If it's a correctable error, it is outputed as a warning. @@ -246,3 +250,24 @@ with the PCI Express AER Root driver? A: It could call the helper functions to enable AER in devices and cleanup uncorrectable status register. Pls. refer to section 3.3. + +4. Software error injection + +Debugging PCIE AER error recovery code is quite difficult because it +is hard to trigger real hardware errors. Software based error +injection can be used to fake various kinds of PCIE errors. + +First you should enable PCIE AER software error injection in kernel +configuration, that is, following item should be in your .config. + +CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m + +After reboot with new kernel or insert the module, a device file named +/dev/aer_inject should be created. + +Then, you need a user space tool named aer-inject, which can be gotten +from: + http://www.kernel.org/pub/linux/utils/pci/aer-inject/ + +More information about aer-inject can be found in the document comes +with its source code. diff --git a/Documentation/RCU/RTFP.txt b/Documentation/RCU/RTFP.txt index 9f711d2df91..d2b85237c76 100644 --- a/Documentation/RCU/RTFP.txt +++ b/Documentation/RCU/RTFP.txt @@ -743,3 +743,80 @@ Revised: RCU, realtime RCU, sleepable RCU, performance. " } + +@article{PaulEMcKenney2008RCUOSR +,author="Paul E. McKenney and Jonathan Walpole" +,title="Introducing technology into the {Linux} kernel: a case study" +,Year="2008" +,journal="SIGOPS Oper. Syst. Rev." +,volume="42" +,number="5" +,pages="4--17" +,issn="0163-5980" +,doi={http://doi.acm.org/10.1145/1400097.1400099} +,publisher="ACM" +,address="New York, NY, USA" +,annotation={ + Linux changed RCU to a far greater degree than RCU has changed Linux. +} +} + +@unpublished{PaulEMcKenney2008HierarchicalRCU +,Author="Paul E. McKenney" +,Title="Hierarchical {RCU}" +,month="November" +,day="3" +,year="2008" +,note="Available: +\url{http://lwn.net/Articles/305782/} +[Viewed November 6, 2008]" +,annotation=" + RCU with combining-tree-based grace-period detection, + permitting it to handle thousands of CPUs. +" +} + +@conference{PaulEMcKenney2009MaliciousURCU +,Author="Paul E. McKenney" +,Title="Using a Malicious User-Level {RCU} to Torture {RCU}-Based Algorithms" +,Booktitle="linux.conf.au 2009" +,month="January" +,year="2009" +,address="Hobart, Australia" +,note="Available: +\url{http://www.rdrop.com/users/paulmck/RCU/urcutorture.2009.01.22a.pdf} +[Viewed February 2, 2009]" +,annotation=" + Realtime RCU and torture-testing RCU uses. +" +} + +@unpublished{MathieuDesnoyers2009URCU +,Author="Mathieu Desnoyers" +,Title="[{RFC} git tree] Userspace {RCU} (urcu) for {Linux}" +,month="February" +,day="5" +,year="2009" +,note="Available: +\url{http://lkml.org/lkml/2009/2/5/572} +\url{git://lttng.org/userspace-rcu.git} +[Viewed February 20, 2009]" +,annotation=" + Mathieu Desnoyers's user-space RCU implementation. + git://lttng.org/userspace-rcu.git +" +} + +@unpublished{PaulEMcKenney2009BloatWatchRCU +,Author="Paul E. McKenney" +,Title="{RCU}: The {Bloatwatch} Edition" +,month="March" +,day="17" +,year="2009" +,note="Available: +\url{http://lwn.net/Articles/323929/} +[Viewed March 20, 2009]" +,annotation=" + Uniprocessor assumptions allow simplified RCU implementation. +" +} diff --git a/Documentation/RCU/UP.txt b/Documentation/RCU/UP.txt index aab4a9ec393..90ec5341ee9 100644 --- a/Documentation/RCU/UP.txt +++ b/Documentation/RCU/UP.txt @@ -2,14 +2,13 @@ RCU on Uniprocessor Systems A common misconception is that, on UP systems, the call_rcu() primitive -may immediately invoke its function, and that the synchronize_rcu() -primitive may return immediately. The basis of this misconception +may immediately invoke its function. The basis of this misconception is that since there is only one CPU, it should not be necessary to wait for anything else to get done, since there are no other CPUs for anything else to be happening on. Although this approach will -sort- -of- work a surprising amount of the time, it is a very bad idea in general. -This document presents three examples that demonstrate exactly how bad an -idea this is. +This document presents three examples that demonstrate exactly how bad +an idea this is. Example 1: softirq Suicide @@ -82,11 +81,18 @@ Quick Quiz #2: What locking restriction must RCU callbacks respect? Summary -Permitting call_rcu() to immediately invoke its arguments or permitting -synchronize_rcu() to immediately return breaks RCU, even on a UP system. -So do not do it! Even on a UP system, the RCU infrastructure -must- -respect grace periods, and -must- invoke callbacks from a known environment -in which no locks are held. +Permitting call_rcu() to immediately invoke its arguments breaks RCU, +even on a UP system. So do not do it! Even on a UP system, the RCU +infrastructure -must- respect grace periods, and -must- invoke callbacks +from a known environment in which no locks are held. + +It -is- safe for synchronize_sched() and synchronize_rcu_bh() to return +immediately on an UP system. It is also safe for synchronize_rcu() +to return immediately on UP systems, except when running preemptable +RCU. + +Quick Quiz #3: Why can't synchronize_rcu() return immediately on + UP systems running preemptable RCU? Answer to Quick Quiz #1: @@ -117,3 +123,13 @@ Answer to Quick Quiz #2: callbacks acquire locks directly. However, a great many RCU callbacks do acquire locks -indirectly-, for example, via the kfree() primitive. + +Answer to Quick Quiz #3: + Why can't synchronize_rcu() return immediately on UP systems + running preemptable RCU? + + Because some other task might have been preempted in the middle + of an RCU read-side critical section. If synchronize_rcu() + simply immediately returned, it would prematurely signal the + end of the grace period, which would come as a nasty shock to + that other thread when it started running again. diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt index accfe2f5247..51525a30e8b 100644 --- a/Documentation/RCU/checklist.txt +++ b/Documentation/RCU/checklist.txt @@ -11,7 +11,10 @@ over a rather long period of time, but improvements are always welcome! structure is updated more than about 10% of the time, then you should strongly consider some other approach, unless detailed performance measurements show that RCU is nonetheless - the right tool for the job. + the right tool for the job. Yes, you might think of RCU + as simply cutting overhead off of the readers and imposing it + on the writers. That is exactly why normal uses of RCU will + do much more reading than updating. Another exception is where performance is not an issue, and RCU provides a simpler implementation. An example of this situation @@ -240,10 +243,11 @@ over a rather long period of time, but improvements are always welcome! instead need to use synchronize_irq() or synchronize_sched(). 12. Any lock acquired by an RCU callback must be acquired elsewhere - with irq disabled, e.g., via spin_lock_irqsave(). Failing to - disable irq on a given acquisition of that lock will result in - deadlock as soon as the RCU callback happens to interrupt that - acquisition's critical section. + with softirq disabled, e.g., via spin_lock_irqsave(), + spin_lock_bh(), etc. Failing to disable irq on a given + acquisition of that lock will result in deadlock as soon as the + RCU callback happens to interrupt that acquisition's critical + section. 13. RCU callbacks can be and are executed in parallel. In many cases, the callback code simply wrappers around kfree(), so that this @@ -310,3 +314,9 @@ over a rather long period of time, but improvements are always welcome! Because these primitives only wait for pre-existing readers, it is the caller's responsibility to guarantee safety to any subsequent readers. + +16. The various RCU read-side primitives do -not- contain memory + barriers. The CPU (and in some cases, the compiler) is free + to reorder code into and out of RCU read-side critical sections. + It is the responsibility of the RCU update-side primitives to + deal with this. diff --git a/Documentation/RCU/rcu.txt b/Documentation/RCU/rcu.txt index 7aa2002ade7..2a23523ce47 100644 --- a/Documentation/RCU/rcu.txt +++ b/Documentation/RCU/rcu.txt @@ -36,7 +36,7 @@ o How can the updater tell when a grace period has completed executed in user mode, or executed in the idle loop, we can safely free up that item. - Preemptible variants of RCU (CONFIG_PREEMPT_RCU) get the + Preemptible variants of RCU (CONFIG_TREE_PREEMPT_RCU) get the same effect, but require that the readers manipulate CPU-local counters. These counters allow limited types of blocking within RCU read-side critical sections. SRCU also uses @@ -79,10 +79,10 @@ o I hear that RCU is patented? What is with that? o I hear that RCU needs work in order to support realtime kernels? This work is largely completed. Realtime-friendly RCU can be - enabled via the CONFIG_PREEMPT_RCU kernel configuration parameter. - However, work is in progress for enabling priority boosting of - preempted RCU read-side critical sections. This is needed if you - have CPU-bound realtime threads. + enabled via the CONFIG_TREE_PREEMPT_RCU kernel configuration + parameter. However, work is in progress for enabling priority + boosting of preempted RCU read-side critical sections. This is + needed if you have CPU-bound realtime threads. o Where can I find more information on RCU? diff --git a/Documentation/RCU/rcubarrier.txt b/Documentation/RCU/rcubarrier.txt index 909602d409b..e439a0edee2 100644 --- a/Documentation/RCU/rcubarrier.txt +++ b/Documentation/RCU/rcubarrier.txt @@ -170,6 +170,13 @@ module invokes call_rcu() from timers, you will need to first cancel all the timers, and only then invoke rcu_barrier() to wait for any remaining RCU callbacks to complete. +Of course, if you module uses call_rcu_bh(), you will need to invoke +rcu_barrier_bh() before unloading. Similarly, if your module uses +call_rcu_sched(), you will need to invoke rcu_barrier_sched() before +unloading. If your module uses call_rcu(), call_rcu_bh(), -and- +call_rcu_sched(), then you will need to invoke each of rcu_barrier(), +rcu_barrier_bh(), and rcu_barrier_sched(). + Implementing rcu_barrier() diff --git a/Documentation/RCU/rculist_nulls.txt b/Documentation/RCU/rculist_nulls.txt index 93cb28d05dc..18f9651ff23 100644 --- a/Documentation/RCU/rculist_nulls.txt +++ b/Documentation/RCU/rculist_nulls.txt @@ -83,11 +83,12 @@ not detect it missed following items in original chain. obj = kmem_cache_alloc(...); lock_chain(); // typically a spin_lock() obj->key = key; -atomic_inc(&obj->refcnt); /* * we need to make sure obj->key is updated before obj->next + * or obj->refcnt */ smp_wmb(); +atomic_set(&obj->refcnt, 1); hlist_add_head_rcu(&obj->obj_node, list); unlock_chain(); // typically a spin_unlock() @@ -159,6 +160,10 @@ out: obj = kmem_cache_alloc(cachep); lock_chain(); // typically a spin_lock() obj->key = key; +/* + * changes to obj->key must be visible before refcnt one + */ +smp_wmb(); atomic_set(&obj->refcnt, 1); /* * insert obj in RCU way (readers might be traversing chain) diff --git a/Documentation/RCU/torture.txt b/Documentation/RCU/torture.txt index a342b6e1cc1..9dba3bb90e6 100644 --- a/Documentation/RCU/torture.txt +++ b/Documentation/RCU/torture.txt @@ -76,8 +76,10 @@ torture_type The type of RCU to test: "rcu" for the rcu_read_lock() API, "rcu_sync" for rcu_read_lock() with synchronous reclamation, "rcu_bh" for the rcu_read_lock_bh() API, "rcu_bh_sync" for rcu_read_lock_bh() with synchronous reclamation, "srcu" for - the "srcu_read_lock()" API, and "sched" for the use of - preempt_disable() together with synchronize_sched(). + the "srcu_read_lock()" API, "sched" for the use of + preempt_disable() together with synchronize_sched(), + and "sched_expedited" for the use of preempt_disable() + with synchronize_sched_expedited(). verbose Enable debug printk()s. Default is disabled. @@ -162,6 +164,23 @@ of the "old" and "current" counters for the corresponding CPU. The "idx" value maps the "old" and "current" values to the underlying array, and is useful for debugging. +Similarly, sched_expedited RCU provides the following: + + sched_expedited-torture: rtc: d0000000016c1880 ver: 1090796 tfle: 0 rta: 1090796 rtaf: 0 rtf: 1090787 rtmbe: 0 nt: 27713319 + sched_expedited-torture: Reader Pipe: 12660320201 95875 0 0 0 0 0 0 0 0 0 + sched_expedited-torture: Reader Batch: 12660424885 0 0 0 0 0 0 0 0 0 0 + sched_expedited-torture: Free-Block Circulation: 1090795 1090795 1090794 1090793 1090792 1090791 1090790 1090789 1090788 1090787 0 + state: -1 / 0:0 3:0 4:0 + +As before, the first four lines are similar to those for RCU. +The last line shows the task-migration state. The first number is +-1 if synchronize_sched_expedited() is idle, -2 if in the process of +posting wakeups to the migration kthreads, and N when waiting on CPU N. +Each of the colon-separated fields following the "/" is a CPU:state pair. +Valid states are "0" for idle, "1" for waiting for quiescent state, +"2" for passed through quiescent state, and "3" when a race with a +CPU-hotplug event forces use of the synchronize_sched() primitive. + USAGE diff --git a/Documentation/RCU/trace.txt b/Documentation/RCU/trace.txt index 02cced183b2..187bbf10c92 100644 --- a/Documentation/RCU/trace.txt +++ b/Documentation/RCU/trace.txt @@ -191,8 +191,7 @@ rcu/rcuhier (which displays the struct rcu_node hierarchy). The output of "cat rcu/rcudata" looks as follows: -rcu: -rcu: +rcu_sched: 0 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=10951/1 dn=0 df=1101 of=0 ri=36 ql=0 b=10 1 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=16117/1 dn=0 df=1015 of=0 ri=0 ql=0 b=10 2 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=1445/1 dn=0 df=1839 of=0 ri=0 ql=0 b=10 @@ -306,7 +305,7 @@ comma-separated-variable spreadsheet format. The output of "cat rcu/rcugp" looks as follows: -rcu: completed=33062 gpnum=33063 +rcu_sched: completed=33062 gpnum=33063 rcu_bh: completed=464 gpnum=464 Again, this output is for both "rcu" and "rcu_bh". The fields are @@ -413,7 +412,7 @@ o Each element of the form "1/1 0:127 ^0" represents one struct The output of "cat rcu/rcu_pending" looks as follows: -rcu: +rcu_sched: 0 np=255892 qsp=53936 cbr=0 cng=14417 gpc=10033 gps=24320 nf=6445 nn=146741 1 np=261224 qsp=54638 cbr=0 cng=25723 gpc=16310 gps=2849 nf=5912 nn=155792 2 np=237496 qsp=49664 cbr=0 cng=2762 gpc=45478 gps=1762 nf=1201 nn=136629 diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt index 96170824a71..e41a7fecf0d 100644 --- a/Documentation/RCU/whatisRCU.txt +++ b/Documentation/RCU/whatisRCU.txt @@ -136,10 +136,10 @@ rcu_read_lock() Used by a reader to inform the reclaimer that the reader is entering an RCU read-side critical section. It is illegal to block while in an RCU read-side critical section, though - kernels built with CONFIG_PREEMPT_RCU can preempt RCU read-side - critical sections. Any RCU-protected data structure accessed - during an RCU read-side critical section is guaranteed to remain - unreclaimed for the full duration of that critical section. + kernels built with CONFIG_TREE_PREEMPT_RCU can preempt RCU + read-side critical sections. Any RCU-protected data structure + accessed during an RCU read-side critical section is guaranteed to + remain unreclaimed for the full duration of that critical section. Reference counts may be used in conjunction with RCU to maintain longer-term references to data structures. @@ -785,6 +785,7 @@ RCU pointer/list traversal: rcu_dereference list_for_each_entry_rcu hlist_for_each_entry_rcu + hlist_nulls_for_each_entry_rcu list_for_each_continue_rcu (to be deprecated in favor of new list_for_each_entry_continue_rcu) @@ -807,19 +808,23 @@ RCU: Critical sections Grace period Barrier rcu_read_lock synchronize_net rcu_barrier rcu_read_unlock synchronize_rcu + synchronize_rcu_expedited call_rcu bh: Critical sections Grace period Barrier rcu_read_lock_bh call_rcu_bh rcu_barrier_bh - rcu_read_unlock_bh + rcu_read_unlock_bh synchronize_rcu_bh + synchronize_rcu_bh_expedited sched: Critical sections Grace period Barrier - [preempt_disable] synchronize_sched rcu_barrier_sched - [and friends] call_rcu_sched + rcu_read_lock_sched synchronize_sched rcu_barrier_sched + rcu_read_unlock_sched call_rcu_sched + [preempt_disable] synchronize_sched_expedited + [and friends] SRCU: Critical sections Grace period Barrier @@ -827,6 +832,9 @@ SRCU: Critical sections Grace period Barrier srcu_read_lock synchronize_srcu N/A srcu_read_unlock +SRCU: Initialization/cleanup + init_srcu_struct + cleanup_srcu_struct See the comment headers in the source code (or the docbook generated from them) for more information. diff --git a/Documentation/SubmitChecklist b/Documentation/SubmitChecklist index ac5e0b2f109..78a9168ff37 100644 --- a/Documentation/SubmitChecklist +++ b/Documentation/SubmitChecklist @@ -54,7 +54,7 @@ kernel patches. CONFIG_PREEMPT. 14: If the patch affects IO/Disk, etc: has been tested with and without - CONFIG_LBD. + CONFIG_LBDAF. 15: All codepaths have been exercised with all lockdep features enabled. diff --git a/Documentation/accounting/getdelays.c b/Documentation/accounting/getdelays.c index 7ea231172c8..aa73e72fd79 100644 --- a/Documentation/accounting/getdelays.c +++ b/Documentation/accounting/getdelays.c @@ -246,7 +246,8 @@ void print_ioacct(struct taskstats *t) int main(int argc, char *argv[]) { - int c, rc, rep_len, aggr_len, len2, cmd_type; + int c, rc, rep_len, aggr_len, len2; + int cmd_type = TASKSTATS_CMD_ATTR_UNSPEC; __u16 id; __u32 mypid; diff --git a/Documentation/arm/SA1100/ADSBitsy b/Documentation/arm/SA1100/ADSBitsy index ab47c383390..7197a9e958e 100644 --- a/Documentation/arm/SA1100/ADSBitsy +++ b/Documentation/arm/SA1100/ADSBitsy @@ -40,4 +40,4 @@ Notes: mode, the timing is off so the image is corrupted. This will be fixed soon. -Any contribution can be sent to nico@cam.org and will be greatly welcome! +Any contribution can be sent to nico@fluxnic.net and will be greatly welcome! diff --git a/Documentation/arm/SA1100/Assabet b/Documentation/arm/SA1100/Assabet index 78bc1c1b04e..91f7ce7ba42 100644 --- a/Documentation/arm/SA1100/Assabet +++ b/Documentation/arm/SA1100/Assabet @@ -240,7 +240,7 @@ Then, rebooting the Assabet is just a matter of waiting for the login prompt. Nicolas Pitre -nico@cam.org +nico@fluxnic.net June 12, 2001 diff --git a/Documentation/arm/SA1100/Brutus b/Documentation/arm/SA1100/Brutus index 2254c8f0b32..b1cfd405dcc 100644 --- a/Documentation/arm/SA1100/Brutus +++ b/Documentation/arm/SA1100/Brutus @@ -60,7 +60,7 @@ little modifications. Any contribution is welcome. -Please send patches to nico@cam.org +Please send patches to nico@fluxnic.net Have Fun ! diff --git a/Documentation/arm/SA1100/GraphicsClient b/Documentation/arm/SA1100/GraphicsClient index 8fa7e8027ff..6c9c4f5a36e 100644 --- a/Documentation/arm/SA1100/GraphicsClient +++ b/Documentation/arm/SA1100/GraphicsClient @@ -4,7 +4,7 @@ For more details, contact Applied Data Systems or see http://www.applieddata.net/products.html The original Linux support for this product has been provided by -Nicolas Pitre <nico@cam.org>. Continued development work by +Nicolas Pitre <nico@fluxnic.net>. Continued development work by Woojung Huh <whuh@applieddata.net> It's currently possible to mount a root filesystem via NFS providing a @@ -94,5 +94,5 @@ Notes: mode, the timing is off so the image is corrupted. This will be fixed soon. -Any contribution can be sent to nico@cam.org and will be greatly welcome! +Any contribution can be sent to nico@fluxnic.net and will be greatly welcome! diff --git a/Documentation/arm/SA1100/GraphicsMaster b/Documentation/arm/SA1100/GraphicsMaster index dd28745ac52..ee7c6595f23 100644 --- a/Documentation/arm/SA1100/GraphicsMaster +++ b/Documentation/arm/SA1100/GraphicsMaster @@ -4,7 +4,7 @@ For more details, contact Applied Data Systems or see http://www.applieddata.net/products.html The original Linux support for this product has been provided by -Nicolas Pitre <nico@cam.org>. Continued development work by +Nicolas Pitre <nico@fluxnic.net>. Continued development work by Woojung Huh <whuh@applieddata.net> Use 'make graphicsmaster_config' before any 'make config'. @@ -50,4 +50,4 @@ Notes: mode, the timing is off so the image is corrupted. This will be fixed soon. -Any contribution can be sent to nico@cam.org and will be greatly welcome! +Any contribution can be sent to nico@fluxnic.net and will be greatly welcome! diff --git a/Documentation/arm/SA1100/Victor b/Documentation/arm/SA1100/Victor index 01e81fc4946..f938a29fdc2 100644 --- a/Documentation/arm/SA1100/Victor +++ b/Documentation/arm/SA1100/Victor @@ -9,7 +9,7 @@ Of course Victor is using Linux as its main operating system. The Victor implementation for Linux is maintained by Nicolas Pitre: nico@visuaide.com - nico@cam.org + nico@fluxnic.net For any comments, please feel free to contact me through the above addresses. diff --git a/Documentation/arm/Samsung-S3C24XX/CPUfreq.txt b/Documentation/arm/Samsung-S3C24XX/CPUfreq.txt new file mode 100644 index 00000000000..76b3a11e90b --- /dev/null +++ b/Documentation/arm/Samsung-S3C24XX/CPUfreq.txt @@ -0,0 +1,75 @@ + S3C24XX CPUfreq support + ======================= + +Introduction +------------ + + The S3C24XX series support a number of power saving systems, such as + the ability to change the core, memory and peripheral operating + frequencies. The core control is exported via the CPUFreq driver + which has a number of different manual or automatic controls over the + rate the core is running at. + + There are two forms of the driver depending on the specific CPU and + how the clocks are arranged. The first implementation used as single + PLL to feed the ARM, memory and peripherals via a series of dividers + and muxes and this is the implementation that is documented here. A + newer version where there is a seperate PLL and clock divider for the + ARM core is available as a seperate driver. + + +Layout +------ + + The code core manages the CPU specific drivers, any data that they + need to register and the interface to the generic drivers/cpufreq + system. Each CPU registers a driver to control the PLL, clock dividers + and anything else associated with it. Any board that wants to use this + framework needs to supply at least basic details of what is required. + + The core registers with drivers/cpufreq at init time if all the data + necessary has been supplied. + + +CPU support +----------- + + The support for each CPU depends on the facilities provided by the + SoC and the driver as each device has different PLL and clock chains + associated with it. + + +Slow Mode +--------- + + The SLOW mode where the PLL is turned off altogether and the + system is fed by the external crystal input is currently not + supported. + + +sysfs +----- + + The core code exports extra information via sysfs in the directory + devices/system/cpu/cpu0/arch-freq. + + +Board Support +------------- + + Each board that wants to use the cpufreq code must register some basic + information with the core driver to provide information about what the + board requires and any restrictions being placed on it. + + The board needs to supply information about whether it needs the IO bank + timings changing, any maximum frequency limits and information about the + SDRAM refresh rate. + + + + +Document Author +--------------- + +Ben Dooks, Copyright 2009 Simtec Electronics +Licensed under GPLv2 diff --git a/Documentation/arm/memory.txt b/Documentation/arm/memory.txt index 43cb1004d35..9d58c7c5edd 100644 --- a/Documentation/arm/memory.txt +++ b/Documentation/arm/memory.txt @@ -21,6 +21,8 @@ ffff8000 ffffffff copy_user_page / clear_user_page use. For SA11xx and Xscale, this is used to setup a minicache mapping. +ffff4000 ffffffff cache aliasing on ARMv6 and later CPUs. + ffff1000 ffff7fff Reserved. Platforms must not use this address range. diff --git a/Documentation/atomic_ops.txt b/Documentation/atomic_ops.txt index 4ef24501045..396bec3b74e 100644 --- a/Documentation/atomic_ops.txt +++ b/Documentation/atomic_ops.txt @@ -229,10 +229,10 @@ kernel. It is the use of atomic counters to implement reference counting, and it works such that once the counter falls to zero it can be guaranteed that no other entity can be accessing the object: -static void obj_list_add(struct obj *obj) +static void obj_list_add(struct obj *obj, struct list_head *head) { obj->active = 1; - list_add(&obj->list); + list_add(&obj->list, head); } static void obj_list_del(struct obj *obj) diff --git a/Documentation/block/data-integrity.txt b/Documentation/block/data-integrity.txt index e8ca040ba2c..2d735b0ae38 100644 --- a/Documentation/block/data-integrity.txt +++ b/Documentation/block/data-integrity.txt @@ -50,7 +50,7 @@ encouraged them to allow separation of the data and integrity metadata scatter-gather lists. The controller will interleave the buffers on write and split them on -read. This means that the Linux can DMA the data buffers to and from +read. This means that Linux can DMA the data buffers to and from host memory without changes to the page cache. Also, the 16-bit CRC checksum mandated by both the SCSI and SATA specs @@ -66,7 +66,7 @@ software RAID5). The IP checksum is weaker than the CRC in terms of detecting bit errors. However, the strength is really in the separation of the data -buffers and the integrity metadata. These two distinct buffers much +buffers and the integrity metadata. These two distinct buffers must match up for an I/O to complete. The separation of the data and integrity metadata buffers as well as diff --git a/Documentation/btmrvl.txt b/Documentation/btmrvl.txt new file mode 100644 index 00000000000..34916a46c09 --- /dev/null +++ b/Documentation/btmrvl.txt @@ -0,0 +1,119 @@ +======================================================================= + README for btmrvl driver +======================================================================= + + +All commands are used via debugfs interface. + +===================== +Set/get driver configurations: + +Path: /debug/btmrvl/config/ + +gpiogap=[n] +hscfgcmd + These commands are used to configure the host sleep parameters. + bit 8:0 -- Gap + bit 16:8 -- GPIO + + where GPIO is the pin number of GPIO used to wake up the host. + It could be any valid GPIO pin# (e.g. 0-7) or 0xff (SDIO interface + wakeup will be used instead). + + where Gap is the gap in milli seconds between wakeup signal and + wakeup event, or 0xff for special host sleep setting. + + Usage: + # Use SDIO interface to wake up the host and set GAP to 0x80: + echo 0xff80 > /debug/btmrvl/config/gpiogap + echo 1 > /debug/btmrvl/config/hscfgcmd + + # Use GPIO pin #3 to wake up the host and set GAP to 0xff: + echo 0x03ff > /debug/btmrvl/config/gpiogap + echo 1 > /debug/btmrvl/config/hscfgcmd + +psmode=[n] +pscmd + These commands are used to enable/disable auto sleep mode + + where the option is: + 1 -- Enable auto sleep mode + 0 -- Disable auto sleep mode + + Usage: + # Enable auto sleep mode + echo 1 > /debug/btmrvl/config/psmode + echo 1 > /debug/btmrvl/config/pscmd + + # Disable auto sleep mode + echo 0 > /debug/btmrvl/config/psmode + echo 1 > /debug/btmrvl/config/pscmd + + +hsmode=[n] +hscmd + These commands are used to enable host sleep or wake up firmware + + where the option is: + 1 -- Enable host sleep + 0 -- Wake up firmware + + Usage: + # Enable host sleep + echo 1 > /debug/btmrvl/config/hsmode + echo 1 > /debug/btmrvl/config/hscmd + + # Wake up firmware + echo 0 > /debug/btmrvl/config/hsmode + echo 1 > /debug/btmrvl/config/hscmd + + +====================== +Get driver status: + +Path: /debug/btmrvl/status/ + +Usage: + cat /debug/btmrvl/status/<args> + +where the args are: + +curpsmode + This command displays current auto sleep status. + +psstate + This command display the power save state. + +hsstate + This command display the host sleep state. + +txdnldrdy + This command displays the value of Tx download ready flag. + + +===================== + +Use hcitool to issue raw hci command, refer to hcitool manual + + Usage: Hcitool cmd <ogf> <ocf> [Parameters] + + Interface Control Command + hcitool cmd 0x3f 0x5b 0xf5 0x01 0x00 --Enable All interface + hcitool cmd 0x3f 0x5b 0xf5 0x01 0x01 --Enable Wlan interface + hcitool cmd 0x3f 0x5b 0xf5 0x01 0x02 --Enable BT interface + hcitool cmd 0x3f 0x5b 0xf5 0x00 0x00 --Disable All interface + hcitool cmd 0x3f 0x5b 0xf5 0x00 0x01 --Disable Wlan interface + hcitool cmd 0x3f 0x5b 0xf5 0x00 0x02 --Disable BT interface + +======================================================================= + + +SD8688 firmware: + +/lib/firmware/sd8688_helper.bin +/lib/firmware/sd8688.bin + + +The images can be downloaded from: + +git.infradead.org/users/dwmw2/linux-firmware.git/libertas/ diff --git a/Documentation/cdrom/packet-writing.txt b/Documentation/cdrom/packet-writing.txt index cf1f8126991..1c407778c8b 100644 --- a/Documentation/cdrom/packet-writing.txt +++ b/Documentation/cdrom/packet-writing.txt @@ -117,7 +117,7 @@ Using the pktcdvd debugfs interface To read pktcdvd device infos in human readable form, do: - # cat /debug/pktcdvd/pktcdvd[0-7]/info + # cat /sys/kernel/debug/pktcdvd/pktcdvd[0-7]/info For a description of the debugfs interface look into the file: diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt index f9ca389dddf..1d7e9784439 100644 --- a/Documentation/cgroups/cpusets.txt +++ b/Documentation/cgroups/cpusets.txt @@ -777,6 +777,18 @@ in cpuset directories: # /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4 # /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4 +To add a CPU to a cpuset, write the new list of CPUs including the +CPU to be added. To add 6 to the above cpuset: + +# /bin/echo 1-4,6 > cpus -> set cpus list to cpus 1,2,3,4,6 + +Similarly to remove a CPU from a cpuset, write the new list of CPUs +without the CPU to be removed. + +To remove all the CPUs: + +# /bin/echo "" > cpus -> clear cpus list + 2.3 Setting flags ----------------- diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 1a608877b14..23d1262c077 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -152,14 +152,19 @@ When swap is accounted, following files are added. usage of mem+swap is limited by memsw.limit_in_bytes. -Note: why 'mem+swap' rather than swap. +* why 'mem+swap' rather than swap. The global LRU(kswapd) can swap out arbitrary pages. Swap-out means to move account from memory to swap...there is no change in usage of -mem+swap. +mem+swap. In other words, when we want to limit the usage of swap without +affecting global LRU, mem+swap limit is better than just limiting swap from +OS point of view. -In other words, when we want to limit the usage of swap without affecting -global LRU, mem+swap limit is better than just limiting swap from OS point -of view. +* What happens when a cgroup hits memory.memsw.limit_in_bytes +When a cgroup his memory.memsw.limit_in_bytes, it's useless to do swap-out +in this cgroup. Then, swap-out will not be done by cgroup routine and file +caches are dropped. But as mentioned above, global LRU can do swapout memory +from it for sanity of the system's memory management state. You can't forbid +it by cgroup. 2.5 Reclaim @@ -204,6 +209,7 @@ We can alter the memory limit: NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, mega or gigabytes. +NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited). # cat /cgroups/0/memory.limit_in_bytes 4194304 diff --git a/Documentation/connector/Makefile b/Documentation/connector/Makefile index 8df1a7285a0..d98e4df98e2 100644 --- a/Documentation/connector/Makefile +++ b/Documentation/connector/Makefile @@ -9,3 +9,8 @@ hostprogs-y := ucon always := $(hostprogs-y) HOSTCFLAGS_ucon.o += -I$(objtree)/usr/include + +all: modules + +modules clean: + $(MAKE) -C ../.. SUBDIRS=$(PWD) $@ diff --git a/Documentation/connector/cn_test.c b/Documentation/connector/cn_test.c index 6977c178729..1711adc3337 100644 --- a/Documentation/connector/cn_test.c +++ b/Documentation/connector/cn_test.c @@ -1,7 +1,7 @@ /* * cn_test.c * - * 2004-2005 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * 2004+ Copyright (c) Evgeniy Polyakov <zbr@ioremap.net> * All rights reserved. * * This program is free software; you can redistribute it and/or modify @@ -19,6 +19,8 @@ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ +#define pr_fmt(fmt) "cn_test: " fmt + #include <linux/kernel.h> #include <linux/module.h> #include <linux/moduleparam.h> @@ -27,20 +29,25 @@ #include <linux/connector.h> -static struct cb_id cn_test_id = { 0x123, 0x456 }; +static struct cb_id cn_test_id = { CN_NETLINK_USERS + 3, 0x456 }; static char cn_test_name[] = "cn_test"; static struct sock *nls; static struct timer_list cn_test_timer; -void cn_test_callback(void *data) +static void cn_test_callback(struct cn_msg *msg) { - struct cn_msg *msg = (struct cn_msg *)data; - - printk("%s: %lu: idx=%x, val=%x, seq=%u, ack=%u, len=%d: %s.\n", - __func__, jiffies, msg->id.idx, msg->id.val, - msg->seq, msg->ack, msg->len, (char *)msg->data); + pr_info("%s: %lu: idx=%x, val=%x, seq=%u, ack=%u, len=%d: %s.\n", + __func__, jiffies, msg->id.idx, msg->id.val, + msg->seq, msg->ack, msg->len, + msg->len ? (char *)msg->data : ""); } +/* + * Do not remove this function even if no one is using it as + * this is an example of how to get notifications about new + * connector user registration + */ +#if 0 static int cn_test_want_notify(void) { struct cn_ctl_msg *ctl; @@ -57,9 +64,7 @@ static int cn_test_want_notify(void) skb = alloc_skb(size, GFP_ATOMIC); if (!skb) { - printk(KERN_ERR "Failed to allocate new skb with size=%u.\n", - size); - + pr_err("failed to allocate new skb with size=%u\n", size); return -ENOMEM; } @@ -108,15 +113,16 @@ static int cn_test_want_notify(void) //netlink_broadcast(nls, skb, 0, ctl->group, GFP_ATOMIC); netlink_unicast(nls, skb, 0, 0); - printk(KERN_INFO "Request was sent. Group=0x%x.\n", ctl->group); + pr_info("request was sent: group=0x%x\n", ctl->group); return 0; nlmsg_failure: - printk(KERN_ERR "Failed to send %u.%u\n", msg->seq, msg->ack); + pr_err("failed to send %u.%u\n", msg->seq, msg->ack); kfree_skb(skb); return -EINVAL; } +#endif static u32 cn_test_timer_counter; static void cn_test_timer_func(unsigned long __data) @@ -124,6 +130,8 @@ static void cn_test_timer_func(unsigned long __data) struct cn_msg *m; char data[32]; + pr_debug("%s: timer fired with data %lu\n", __func__, __data); + m = kzalloc(sizeof(*m) + sizeof(data), GFP_ATOMIC); if (m) { @@ -143,7 +151,7 @@ static void cn_test_timer_func(unsigned long __data) cn_test_timer_counter++; - mod_timer(&cn_test_timer, jiffies + HZ); + mod_timer(&cn_test_timer, jiffies + msecs_to_jiffies(1000)); } static int cn_test_init(void) @@ -161,8 +169,10 @@ static int cn_test_init(void) } setup_timer(&cn_test_timer, cn_test_timer_func, 0); - cn_test_timer.expires = jiffies + HZ; - add_timer(&cn_test_timer); + mod_timer(&cn_test_timer, jiffies + msecs_to_jiffies(1000)); + + pr_info("initialized with id={%u.%u}\n", + cn_test_id.idx, cn_test_id.val); return 0; @@ -187,5 +197,5 @@ module_init(cn_test_init); module_exit(cn_test_fini); MODULE_LICENSE("GPL"); -MODULE_AUTHOR("Evgeniy Polyakov <johnpol@2ka.mipt.ru>"); +MODULE_AUTHOR("Evgeniy Polyakov <zbr@ioremap.net>"); MODULE_DESCRIPTION("Connector's test module"); diff --git a/Documentation/connector/connector.txt b/Documentation/connector/connector.txt index ad6e0ba7b38..81e6bf6ead5 100644 --- a/Documentation/connector/connector.txt +++ b/Documentation/connector/connector.txt @@ -5,10 +5,10 @@ Kernel Connector. Kernel connector - new netlink based userspace <-> kernel space easy to use communication module. -Connector driver adds possibility to connect various agents using -netlink based network. One must register callback and -identifier. When driver receives special netlink message with -appropriate identifier, appropriate callback will be called. +The Connector driver makes it easy to connect various agents using a +netlink based network. One must register a callback and an identifier. +When the driver receives a special netlink message with the appropriate +identifier, the appropriate callback will be called. From the userspace point of view it's quite straightforward: @@ -17,10 +17,10 @@ From the userspace point of view it's quite straightforward: send(); recv(); -But if kernelspace want to use full power of such connections, driver -writer must create special sockets, must know about struct sk_buff -handling... Connector allows any kernelspace agents to use netlink -based networking for inter-process communication in a significantly +But if kernelspace wants to use the full power of such connections, the +driver writer must create special sockets, must know about struct sk_buff +handling, etc... The Connector driver allows any kernelspace agents to use +netlink based networking for inter-process communication in a significantly easier way: int cn_add_callback(struct cb_id *id, char *name, void (*callback) (void *)); @@ -32,15 +32,15 @@ struct cb_id __u32 val; }; -idx and val are unique identifiers which must be registered in -connector.h for in-kernel usage. void (*callback) (void *) - is a -callback function which will be called when message with above idx.val -will be received by connector core. Argument for that function must +idx and val are unique identifiers which must be registered in the +connector.h header for in-kernel usage. void (*callback) (void *) is a +callback function which will be called when a message with above idx.val +is received by the connector core. The argument for that function must be dereferenced to struct cn_msg *. struct cn_msg { - struct cb_id id; + struct cb_id id; __u32 seq; __u32 ack; @@ -55,92 +55,95 @@ Connector interfaces. int cn_add_callback(struct cb_id *id, char *name, void (*callback) (void *)); -Registers new callback with connector core. + Registers new callback with connector core. -struct cb_id *id - unique connector's user identifier. - It must be registered in connector.h for legal in-kernel users. -char *name - connector's callback symbolic name. -void (*callback) (void *) - connector's callback. + struct cb_id *id - unique connector's user identifier. + It must be registered in connector.h for legal in-kernel users. + char *name - connector's callback symbolic name. + void (*callback) (void *) - connector's callback. Argument must be dereferenced to struct cn_msg *. + void cn_del_callback(struct cb_id *id); -Unregisters new callback with connector core. + Unregisters new callback with connector core. + + struct cb_id *id - unique connector's user identifier. -struct cb_id *id - unique connector's user identifier. int cn_netlink_send(struct cn_msg *msg, u32 __groups, int gfp_mask); -Sends message to the specified groups. It can be safely called from -softirq context, but may silently fail under strong memory pressure. -If there are no listeners for given group -ESRCH can be returned. + Sends message to the specified groups. It can be safely called from + softirq context, but may silently fail under strong memory pressure. + If there are no listeners for given group -ESRCH can be returned. -struct cn_msg * - message header(with attached data). -u32 __group - destination group. + struct cn_msg * - message header(with attached data). + u32 __group - destination group. If __group is zero, then appropriate group will be searched through all registered connector users, and message will be delivered to the group which was created for user with the same ID as in msg. If __group is not zero, then message will be delivered to the specified group. -int gfp_mask - GFP mask. + int gfp_mask - GFP mask. -Note: When registering new callback user, connector core assigns -netlink group to the user which is equal to it's id.idx. + Note: When registering new callback user, connector core assigns + netlink group to the user which is equal to it's id.idx. /*****************************************/ Protocol description. /*****************************************/ -Current offers transport layer with fixed header. Recommended -protocol which uses such header is following: +The current framework offers a transport layer with fixed headers. The +recommended protocol which uses such a header is as following: msg->seq and msg->ack are used to determine message genealogy. When -someone sends message it puts there locally unique sequence and random -acknowledge numbers. Sequence number may be copied into +someone sends a message, they use a locally unique sequence and random +acknowledge number. The sequence number may be copied into nlmsghdr->nlmsg_seq too. -Sequence number is incremented with each message to be sent. +The sequence number is incremented with each message sent. -If we expect reply to our message, then sequence number in received -message MUST be the same as in original message, and acknowledge -number MUST be the same + 1. +If you expect a reply to the message, then the sequence number in the +received message MUST be the same as in the original message, and the +acknowledge number MUST be the same + 1. -If we receive message and it's sequence number is not equal to one we -are expecting, then it is new message. If we receive message and it's -sequence number is the same as one we are expecting, but it's -acknowledge is not equal acknowledge number in original message + 1, -then it is new message. +If we receive a message and its sequence number is not equal to one we +are expecting, then it is a new message. If we receive a message and +its sequence number is the same as one we are expecting, but its +acknowledge is not equal to the acknowledge number in the original +message + 1, then it is a new message. -Obviously, protocol header contains above id. +Obviously, the protocol header contains the above id. -connector allows event notification in the following form: kernel +The connector allows event notification in the following form: kernel driver or userspace process can ask connector to notify it when -selected id's will be turned on or off(registered or unregistered it's -callback). It is done by sending special command to connector -driver(it also registers itself with id={-1, -1}). +selected ids will be turned on or off (registered or unregistered its +callback). It is done by sending a special command to the connector +driver (it also registers itself with id={-1, -1}). -As example of usage Documentation/connector now contains cn_test.c - -testing module which uses connector to request notification and to -send messages. +As example of this usage can be found in the cn_test.c module which +uses the connector to request notification and to send messages. /*****************************************/ Reliability. /*****************************************/ -Netlink itself is not reliable protocol, that means that messages can +Netlink itself is not a reliable protocol. That means that messages can be lost due to memory pressure or process' receiving queue overflowed, -so caller is warned must be prepared. That is why struct cn_msg [main -connector's message header] contains u32 seq and u32 ack fields. +so caller is warned that it must be prepared. That is why the struct +cn_msg [main connector's message header] contains u32 seq and u32 ack +fields. /*****************************************/ Userspace usage. /*****************************************/ + 2.6.14 has a new netlink socket implementation, which by default does not -allow to send data to netlink groups other than 1. -So, if to use netlink socket (for example using connector) -with different group number userspace application must subscribe to -that group. It can be achieved by following pseudocode: +allow people to send data to netlink groups other than 1. +So, if you wish to use a netlink socket (for example using connector) +with a different group number, the userspace application must subscribe to +that group first. It can be achieved by the following pseudocode: s = socket(PF_NETLINK, SOCK_DGRAM, NETLINK_CONNECTOR); @@ -160,8 +163,8 @@ if (bind(s, (struct sockaddr *)&l_local, sizeof(struct sockaddr_nl)) == -1) { } Where 270 above is SOL_NETLINK, and 1 is a NETLINK_ADD_MEMBERSHIP socket -option. To drop multicast subscription one should call above socket option -with NETLINK_DROP_MEMBERSHIP parameter which is defined as 0. +option. To drop a multicast subscription, one should call the above socket +option with the NETLINK_DROP_MEMBERSHIP parameter which is defined as 0. 2.6.14 netlink code only allows to select a group which is less or equal to the maximum group number, which is used at netlink_kernel_create() time. diff --git a/Documentation/connector/ucon.c b/Documentation/connector/ucon.c index d738cde2a8d..4848db8c71f 100644 --- a/Documentation/connector/ucon.c +++ b/Documentation/connector/ucon.c @@ -1,7 +1,7 @@ /* * ucon.c * - * Copyright (c) 2004+ Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * Copyright (c) 2004+ Evgeniy Polyakov <zbr@ioremap.net> * * * This program is free software; you can redistribute it and/or modify @@ -30,18 +30,24 @@ #include <arpa/inet.h> +#include <stdbool.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <string.h> #include <errno.h> #include <time.h> +#include <getopt.h> #include <linux/connector.h> #define DEBUG #define NETLINK_CONNECTOR 11 +/* Hopefully your userspace connector.h matches this kernel */ +#define CN_TEST_IDX CN_NETLINK_USERS + 3 +#define CN_TEST_VAL 0x456 + #ifdef DEBUG #define ulog(f, a...) fprintf(stdout, f, ##a) #else @@ -83,6 +89,25 @@ static int netlink_send(int s, struct cn_msg *msg) return err; } +static void usage(void) +{ + printf( + "Usage: ucon [options] [output file]\n" + "\n" + "\t-h\tthis help screen\n" + "\t-s\tsend buffers to the test module\n" + "\n" + "The default behavior of ucon is to subscribe to the test module\n" + "and wait for state messages. Any ones received are dumped to the\n" + "specified output file (or stdout). The test module is assumed to\n" + "have an id of {%u.%u}\n" + "\n" + "If you get no output, then verify the cn_test module id matches\n" + "the expected id above.\n" + , CN_TEST_IDX, CN_TEST_VAL + ); +} + int main(int argc, char *argv[]) { int s; @@ -94,17 +119,34 @@ int main(int argc, char *argv[]) FILE *out; time_t tm; struct pollfd pfd; + bool send_msgs = false; - if (argc < 2) - out = stdout; - else { - out = fopen(argv[1], "a+"); + while ((s = getopt(argc, argv, "hs")) != -1) { + switch (s) { + case 's': + send_msgs = true; + break; + + case 'h': + usage(); + return 0; + + default: + /* getopt() outputs an error for us */ + usage(); + return 1; + } + } + + if (argc != optind) { + out = fopen(argv[optind], "a+"); if (!out) { ulog("Unable to open %s for writing: %s\n", argv[1], strerror(errno)); out = stdout; } - } + } else + out = stdout; memset(buf, 0, sizeof(buf)); @@ -115,9 +157,11 @@ int main(int argc, char *argv[]) } l_local.nl_family = AF_NETLINK; - l_local.nl_groups = 0x123; /* bitmask of requested groups */ + l_local.nl_groups = -1; /* bitmask of requested groups */ l_local.nl_pid = 0; + ulog("subscribing to %u.%u\n", CN_TEST_IDX, CN_TEST_VAL); + if (bind(s, (struct sockaddr *)&l_local, sizeof(struct sockaddr_nl)) == -1) { perror("bind"); close(s); @@ -130,15 +174,15 @@ int main(int argc, char *argv[]) setsockopt(s, SOL_NETLINK, NETLINK_ADD_MEMBERSHIP, &on, sizeof(on)); } #endif - if (0) { + if (send_msgs) { int i, j; memset(buf, 0, sizeof(buf)); data = (struct cn_msg *)buf; - data->id.idx = 0x123; - data->id.val = 0x456; + data->id.idx = CN_TEST_IDX; + data->id.val = CN_TEST_VAL; data->seq = seq++; data->ack = 0; data->len = 0; diff --git a/Documentation/cpu-freq/cpu-drivers.txt b/Documentation/cpu-freq/cpu-drivers.txt index 43c743903dd..75a58d14d3c 100644 --- a/Documentation/cpu-freq/cpu-drivers.txt +++ b/Documentation/cpu-freq/cpu-drivers.txt @@ -155,7 +155,7 @@ actual frequency must be determined using the following rules: - if relation==CPUFREQ_REL_H, try to select a new_freq lower than or equal target_freq. ("H for highest, but no higher than") -Here again the frequency table helper might assist you - see section 3 +Here again the frequency table helper might assist you - see section 2 for details. diff --git a/Documentation/cpu-freq/governors.txt b/Documentation/cpu-freq/governors.txt index ce73f3eb5dd..aed082f49d0 100644 --- a/Documentation/cpu-freq/governors.txt +++ b/Documentation/cpu-freq/governors.txt @@ -119,10 +119,6 @@ want the kernel to look at the CPU usage and to make decisions on what to do about the frequency. Typically this is set to values of around '10000' or more. It's default value is (cmp. with users-guide.txt): transition_latency * 1000 -The lowest value you can set is: -transition_latency * 100 or it may get restricted to a value where it -makes not sense for the kernel anymore to poll that often which depends -on your HZ config variable (HZ=1000: max=20000us, HZ=250: max=5000). Be aware that transition latency is in ns and sampling_rate is in us, so you get the same sysfs value by default. Sampling rate should always get adjusted considering the transition latency @@ -131,14 +127,20 @@ in the bash (as said, 1000 is default), do: echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) \ >ondemand/sampling_rate -show_sampling_rate_(min|max): THIS INTERFACE IS DEPRECATED, DON'T USE IT. -You can use wider ranges now and the general -cpuinfo_transition_latency variable (cmp. with user-guide.txt) can be -used to obtain exactly the same info: -show_sampling_rate_min = transtition_latency * 500 / 1000 -show_sampling_rate_max = transtition_latency * 500000 / 1000 -(divided by 1000 is to illustrate that sampling rate is in us and -transition latency is exported ns). +show_sampling_rate_min: +The sampling rate is limited by the HW transition latency: +transition_latency * 100 +Or by kernel restrictions: +If CONFIG_NO_HZ is set, the limit is 10ms fixed. +If CONFIG_NO_HZ is not set or no_hz=off boot parameter is used, the +limits depend on the CONFIG_HZ option: +HZ=1000: min=20000us (20ms) +HZ=250: min=80000us (80ms) +HZ=100: min=200000us (200ms) +The highest value of kernel and HW latency restrictions is shown and +used as the minimum sampling rate. + +show_sampling_rate_max: THIS INTERFACE IS DEPRECATED, DON'T USE IT. up_threshold: defines what the average CPU usage between the samplings of 'sampling_rate' needs to be for the kernel to make a decision on diff --git a/Documentation/cpu-freq/user-guide.txt b/Documentation/cpu-freq/user-guide.txt index 75f41193f3e..5d5f5fadd1c 100644 --- a/Documentation/cpu-freq/user-guide.txt +++ b/Documentation/cpu-freq/user-guide.txt @@ -31,7 +31,6 @@ Contents: 3. How to change the CPU cpufreq policy and/or speed 3.1 Preferred interface: sysfs -3.2 Deprecated interfaces diff --git a/Documentation/device-mapper/dm-log.txt b/Documentation/device-mapper/dm-log.txt new file mode 100644 index 00000000000..994dd75475a --- /dev/null +++ b/Documentation/device-mapper/dm-log.txt @@ -0,0 +1,54 @@ +Device-Mapper Logging +===================== +The device-mapper logging code is used by some of the device-mapper +RAID targets to track regions of the disk that are not consistent. +A region (or portion of the address space) of the disk may be +inconsistent because a RAID stripe is currently being operated on or +a machine died while the region was being altered. In the case of +mirrors, a region would be considered dirty/inconsistent while you +are writing to it because the writes need to be replicated for all +the legs of the mirror and may not reach the legs at the same time. +Once all writes are complete, the region is considered clean again. + +There is a generic logging interface that the device-mapper RAID +implementations use to perform logging operations (see +dm_dirty_log_type in include/linux/dm-dirty-log.h). Various different +logging implementations are available and provide different +capabilities. The list includes: + +Type Files +==== ===== +disk drivers/md/dm-log.c +core drivers/md/dm-log.c +userspace drivers/md/dm-log-userspace* include/linux/dm-log-userspace.h + +The "disk" log type +------------------- +This log implementation commits the log state to disk. This way, the +logging state survives reboots/crashes. + +The "core" log type +------------------- +This log implementation keeps the log state in memory. The log state +will not survive a reboot or crash, but there may be a small boost in +performance. This method can also be used if no storage device is +available for storing log state. + +The "userspace" log type +------------------------ +This log type simply provides a way to export the log API to userspace, +so log implementations can be done there. This is done by forwarding most +logging requests to userspace, where a daemon receives and processes the +request. + +The structure used for communication between kernel and userspace are +located in include/linux/dm-log-userspace.h. Due to the frequency, +diversity, and 2-way communication nature of the exchanges between +kernel and userspace, 'connector' is used as the interface for +communication. + +There are currently two userspace log implementations that leverage this +framework - "clustered_disk" and "clustered_core". These implementations +provide a cluster-coherent log for shared-storage. Device-mapper mirroring +can be used in a shared-storage environment when the cluster log implementations +are employed. diff --git a/Documentation/device-mapper/dm-queue-length.txt b/Documentation/device-mapper/dm-queue-length.txt new file mode 100644 index 00000000000..f4db2562175 --- /dev/null +++ b/Documentation/device-mapper/dm-queue-length.txt @@ -0,0 +1,39 @@ +dm-queue-length +=============== + +dm-queue-length is a path selector module for device-mapper targets, +which selects a path with the least number of in-flight I/Os. +The path selector name is 'queue-length'. + +Table parameters for each path: [<repeat_count>] + <repeat_count>: The number of I/Os to dispatch using the selected + path before switching to the next path. + If not given, internal default is used. To check + the default value, see the activated table. + +Status for each path: <status> <fail-count> <in-flight> + <status>: 'A' if the path is active, 'F' if the path is failed. + <fail-count>: The number of path failures. + <in-flight>: The number of in-flight I/Os on the path. + + +Algorithm +========= + +dm-queue-length increments/decrements 'in-flight' when an I/O is +dispatched/completed respectively. +dm-queue-length selects a path with the minimum 'in-flight'. + + +Examples +======== +In case that 2 paths (sda and sdb) are used with repeat_count == 128. + +# echo "0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128" \ + dmsetup create test +# +# dmsetup table +test: 0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128 +# +# dmsetup status +test: 0 10 multipath 2 0 0 0 1 1 E 0 2 1 8:0 A 0 0 8:16 A 0 0 diff --git a/Documentation/device-mapper/dm-service-time.txt b/Documentation/device-mapper/dm-service-time.txt new file mode 100644 index 00000000000..7d00668e97b --- /dev/null +++ b/Documentation/device-mapper/dm-service-time.txt @@ -0,0 +1,91 @@ +dm-service-time +=============== + +dm-service-time is a path selector module for device-mapper targets, +which selects a path with the shortest estimated service time for +the incoming I/O. + +The service time for each path is estimated by dividing the total size +of in-flight I/Os on a path with the performance value of the path. +The performance value is a relative throughput value among all paths +in a path-group, and it can be specified as a table argument. + +The path selector name is 'service-time'. + +Table parameters for each path: [<repeat_count> [<relative_throughput>]] + <repeat_count>: The number of I/Os to dispatch using the selected + path before switching to the next path. + If not given, internal default is used. To check + the default value, see the activated table. + <relative_throughput>: The relative throughput value of the path + among all paths in the path-group. + The valid range is 0-100. + If not given, minimum value '1' is used. + If '0' is given, the path isn't selected while + other paths having a positive value are available. + +Status for each path: <status> <fail-count> <in-flight-size> \ + <relative_throughput> + <status>: 'A' if the path is active, 'F' if the path is failed. + <fail-count>: The number of path failures. + <in-flight-size>: The size of in-flight I/Os on the path. + <relative_throughput>: The relative throughput value of the path + among all paths in the path-group. + + +Algorithm +========= + +dm-service-time adds the I/O size to 'in-flight-size' when the I/O is +dispatched and substracts when completed. +Basically, dm-service-time selects a path having minimum service time +which is calculated by: + + ('in-flight-size' + 'size-of-incoming-io') / 'relative_throughput' + +However, some optimizations below are used to reduce the calculation +as much as possible. + + 1. If the paths have the same 'relative_throughput', skip + the division and just compare the 'in-flight-size'. + + 2. If the paths have the same 'in-flight-size', skip the division + and just compare the 'relative_throughput'. + + 3. If some paths have non-zero 'relative_throughput' and others + have zero 'relative_throughput', ignore those paths with zero + 'relative_throughput'. + +If such optimizations can't be applied, calculate service time, and +compare service time. +If calculated service time is equal, the path having maximum +'relative_throughput' may be better. So compare 'relative_throughput' +then. + + +Examples +======== +In case that 2 paths (sda and sdb) are used with repeat_count == 128 +and sda has an average throughput 1GB/s and sdb has 4GB/s, +'relative_throughput' value may be '1' for sda and '4' for sdb. + +# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4" \ + dmsetup create test +# +# dmsetup table +test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4 +# +# dmsetup status +test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 1 8:16 A 0 0 4 + + +Or '2' for sda and '8' for sdb would be also true. + +# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8" \ + dmsetup create test +# +# dmsetup table +test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8 +# +# dmsetup status +test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 2 8:16 A 0 0 8 diff --git a/Documentation/dontdiff b/Documentation/dontdiff index 88519daab6e..e1efc400bed 100644 --- a/Documentation/dontdiff +++ b/Documentation/dontdiff @@ -152,7 +152,6 @@ piggy.gz piggyback pnmtologo ppc_defs.h* -promcon_tbl.c pss_boot.h qconf raid6altivec*.c diff --git a/Documentation/driver-model/device.txt b/Documentation/driver-model/device.txt index a7cbfff40d0..a124f3126b0 100644 --- a/Documentation/driver-model/device.txt +++ b/Documentation/driver-model/device.txt @@ -162,3 +162,35 @@ device_remove_file(dev,&dev_attr_power); The file name will be 'power' with a mode of 0644 (-rw-r--r--). +Word of warning: While the kernel allows device_create_file() and +device_remove_file() to be called on a device at any time, userspace has +strict expectations on when attributes get created. When a new device is +registered in the kernel, a uevent is generated to notify userspace (like +udev) that a new device is available. If attributes are added after the +device is registered, then userspace won't get notified and userspace will +not know about the new attributes. + +This is important for device driver that need to publish additional +attributes for a device at driver probe time. If the device driver simply +calls device_create_file() on the device structure passed to it, then +userspace will never be notified of the new attributes. Instead, it should +probably use class_create() and class->dev_attrs to set up a list of +desired attributes in the modules_init function, and then in the .probe() +hook, and then use device_create() to create a new device as a child +of the probed device. The new device will generate a new uevent and +properly advertise the new attributes to userspace. + +For example, if a driver wanted to add the following attributes: +struct device_attribute mydriver_attribs[] = { + __ATTR(port_count, 0444, port_count_show), + __ATTR(serial_number, 0444, serial_number_show), + NULL +}; + +Then in the module init function is would do: + mydriver_class = class_create(THIS_MODULE, "my_attrs"); + mydriver_class.dev_attr = mydriver_attribs; + +And assuming 'dev' is the struct device passed into the probe hook, the driver +probe function would do something like: + create_device(&mydriver_class, dev, chrdev, &private_data, "my_name"); diff --git a/Documentation/driver-model/driver.txt b/Documentation/driver-model/driver.txt index 82132169d47..60120fb3b96 100644 --- a/Documentation/driver-model/driver.txt +++ b/Documentation/driver-model/driver.txt @@ -207,8 +207,8 @@ Attributes ~~~~~~~~~~ struct driver_attribute { struct attribute attr; - ssize_t (*show)(struct device_driver *, char * buf, size_t count, loff_t off); - ssize_t (*store)(struct device_driver *, const char * buf, size_t count, loff_t off); + ssize_t (*show)(struct device_driver *driver, char *buf); + ssize_t (*store)(struct device_driver *, const char * buf, size_t count); }; Device drivers can export attributes via their sysfs directories. diff --git a/Documentation/dvb/get_dvb_firmware b/Documentation/dvb/get_dvb_firmware index 2f21ecd4c20..3d1b0ab70c8 100644 --- a/Documentation/dvb/get_dvb_firmware +++ b/Documentation/dvb/get_dvb_firmware @@ -25,7 +25,7 @@ use IO::Handle; "tda10046lifeview", "av7110", "dec2000t", "dec2540t", "dec3000s", "vp7041", "dibusb", "nxt2002", "nxt2004", "or51211", "or51132_qam", "or51132_vsb", "bluebird", - "opera1", "cx231xx", "cx18", "cx23885", "pvrusb2" ); + "opera1", "cx231xx", "cx18", "cx23885", "pvrusb2", "mpc718" ); # Check args syntax() if (scalar(@ARGV) != 1); @@ -112,7 +112,7 @@ sub tda10045 { sub tda10046 { my $sourcefile = "TT_PCI_2.19h_28_11_2006.zip"; - my $url = "http://technotrend-online.com/download/software/219/$sourcefile"; + my $url = "http://www.tt-download.com/download/updates/219/$sourcefile"; my $hash = "6a7e1e2f2644b162ff0502367553c72d"; my $outfile = "dvb-fe-tda10046.fw"; my $tmpdir = tempdir(DIR => "/tmp", CLEANUP => 1); @@ -129,8 +129,8 @@ sub tda10046 { } sub tda10046lifeview { - my $sourcefile = "Drv_2.11.02.zip"; - my $url = "http://www.lifeview.com.tw/drivers/pci_card/FlyDVB-T/$sourcefile"; + my $sourcefile = "7%5Cdrv_2.11.02.zip"; + my $url = "http://www.lifeview.hk/dbimages/document/$sourcefile"; my $hash = "1ea24dee4eea8fe971686981f34fd2e0"; my $outfile = "dvb-fe-tda10046.fw"; my $tmpdir = tempdir(DIR => "/tmp", CLEANUP => 1); @@ -317,7 +317,7 @@ sub nxt2002 { sub nxt2004 { my $sourcefile = "AVerTVHD_MCE_A180_Drv_v1.2.2.16.zip"; - my $url = "http://www.aver.com/support/Drivers/$sourcefile"; + my $url = "http://www.avermedia-usa.com/support/Drivers/$sourcefile"; my $hash = "111cb885b1e009188346d72acfed024c"; my $outfile = "dvb-fe-nxt2004.fw"; my $tmpdir = tempdir(DIR => "/tmp", CLEANUP => 1); @@ -381,6 +381,57 @@ sub cx18 { $allfiles; } +sub mpc718 { + my $archive = 'Yuan MPC718 TV Tuner Card 2.13.10.1016.zip'; + my $url = "ftp://ftp.work.acer-euro.com/desktop/aspire_idea510/vista/Drivers/$archive"; + my $fwfile = "dvb-cx18-mpc718-mt352.fw"; + my $tmpdir = tempdir(DIR => "/tmp", CLEANUP => 1); + + checkstandard(); + wgetfile($archive, $url); + unzip($archive, $tmpdir); + + my $sourcefile = "$tmpdir/Yuan MPC718 TV Tuner Card 2.13.10.1016/mpc718_32bit/yuanrap.sys"; + my $found = 0; + + open IN, '<', $sourcefile or die "Couldn't open $sourcefile to extract $fwfile data\n"; + binmode IN; + open OUT, '>', $fwfile; + binmode OUT; + { + # Block scope because we change the line terminator variable $/ + my $prevlen = 0; + my $currlen; + + # Buried in the data segment are 3 runs of almost identical + # register-value pairs that end in 0x5d 0x01 which is a "TUNER GO" + # command for the MT352. + # Pull out the middle run (because it's easy) of register-value + # pairs to make the "firmware" file. + + local $/ = "\x5d\x01"; # MT352 "TUNER GO" + + while (<IN>) { + $currlen = length($_); + if ($prevlen == $currlen && $currlen <= 64) { + chop; chop; # Get rid of "TUNER GO" + s/^\0\0//; # get rid of leading 00 00 if it's there + printf OUT "$_"; + $found = 1; + last; + } + $prevlen = $currlen; + } + } + close OUT; + close IN; + if (!$found) { + unlink $fwfile; + die "Couldn't find valid register-value sequence in $sourcefile for $fwfile\n"; + } + $fwfile; +} + sub cx23885 { my $url = "http://linuxtv.org/downloads/firmware/"; diff --git a/Documentation/fault-injection/fault-injection.txt b/Documentation/fault-injection/fault-injection.txt index 4bc374a1434..07930564079 100644 --- a/Documentation/fault-injection/fault-injection.txt +++ b/Documentation/fault-injection/fault-injection.txt @@ -29,16 +29,16 @@ o debugfs entries fault-inject-debugfs kernel module provides some debugfs entries for runtime configuration of fault-injection capabilities. -- /debug/fail*/probability: +- /sys/kernel/debug/fail*/probability: likelihood of failure injection, in percent. Format: <percent> Note that one-failure-per-hundred is a very high error rate for some testcases. Consider setting probability=100 and configure - /debug/fail*/interval for such testcases. + /sys/kernel/debug/fail*/interval for such testcases. -- /debug/fail*/interval: +- /sys/kernel/debug/fail*/interval: specifies the interval between failures, for calls to should_fail() that pass all the other tests. @@ -46,18 +46,18 @@ configuration of fault-injection capabilities. Note that if you enable this, by setting interval>1, you will probably want to set probability=100. -- /debug/fail*/times: +- /sys/kernel/debug/fail*/times: specifies how many times failures may happen at most. A value of -1 means "no limit". -- /debug/fail*/space: +- /sys/kernel/debug/fail*/space: specifies an initial resource "budget", decremented by "size" on each call to should_fail(,size). Failure injection is suppressed until "space" reaches zero. -- /debug/fail*/verbose +- /sys/kernel/debug/fail*/verbose Format: { 0 | 1 | 2 } specifies the verbosity of the messages when failure is @@ -65,17 +65,17 @@ configuration of fault-injection capabilities. log line per failure; '2' will print a call trace too -- useful to debug the problems revealed by fault injection. -- /debug/fail*/task-filter: +- /sys/kernel/debug/fail*/task-filter: Format: { 'Y' | 'N' } A value of 'N' disables filtering by process (default). Any positive value limits failures to only processes indicated by /proc/<pid>/make-it-fail==1. -- /debug/fail*/require-start: -- /debug/fail*/require-end: -- /debug/fail*/reject-start: -- /debug/fail*/reject-end: +- /sys/kernel/debug/fail*/require-start: +- /sys/kernel/debug/fail*/require-end: +- /sys/kernel/debug/fail*/reject-start: +- /sys/kernel/debug/fail*/reject-end: specifies the range of virtual addresses tested during stacktrace walking. Failure is injected only if some caller @@ -84,26 +84,26 @@ configuration of fault-injection capabilities. Default required range is [0,ULONG_MAX) (whole of virtual address space). Default rejected range is [0,0). -- /debug/fail*/stacktrace-depth: +- /sys/kernel/debug/fail*/stacktrace-depth: specifies the maximum stacktrace depth walked during search for a caller within [require-start,require-end) OR [reject-start,reject-end). -- /debug/fail_page_alloc/ignore-gfp-highmem: +- /sys/kernel/debug/fail_page_alloc/ignore-gfp-highmem: Format: { 'Y' | 'N' } default is 'N', setting it to 'Y' won't inject failures into highmem/user allocations. -- /debug/failslab/ignore-gfp-wait: -- /debug/fail_page_alloc/ignore-gfp-wait: +- /sys/kernel/debug/failslab/ignore-gfp-wait: +- /sys/kernel/debug/fail_page_alloc/ignore-gfp-wait: Format: { 'Y' | 'N' } default is 'N', setting it to 'Y' will inject failures only into non-sleep allocations (GFP_ATOMIC allocations). -- /debug/fail_page_alloc/min-order: +- /sys/kernel/debug/fail_page_alloc/min-order: specifies the minimum page allocation order to be injected failures. @@ -166,13 +166,13 @@ o Inject slab allocation failures into module init/exit code #!/bin/bash FAILTYPE=failslab -echo Y > /debug/$FAILTYPE/task-filter -echo 10 > /debug/$FAILTYPE/probability -echo 100 > /debug/$FAILTYPE/interval -echo -1 > /debug/$FAILTYPE/times -echo 0 > /debug/$FAILTYPE/space -echo 2 > /debug/$FAILTYPE/verbose -echo 1 > /debug/$FAILTYPE/ignore-gfp-wait +echo Y > /sys/kernel/debug/$FAILTYPE/task-filter +echo 10 > /sys/kernel/debug/$FAILTYPE/probability +echo 100 > /sys/kernel/debug/$FAILTYPE/interval +echo -1 > /sys/kernel/debug/$FAILTYPE/times +echo 0 > /sys/kernel/debug/$FAILTYPE/space +echo 2 > /sys/kernel/debug/$FAILTYPE/verbose +echo 1 > /sys/kernel/debug/$FAILTYPE/ignore-gfp-wait faulty_system() { @@ -217,20 +217,20 @@ then exit 1 fi -cat /sys/module/$module/sections/.text > /debug/$FAILTYPE/require-start -cat /sys/module/$module/sections/.data > /debug/$FAILTYPE/require-end +cat /sys/module/$module/sections/.text > /sys/kernel/debug/$FAILTYPE/require-start +cat /sys/module/$module/sections/.data > /sys/kernel/debug/$FAILTYPE/require-end -echo N > /debug/$FAILTYPE/task-filter -echo 10 > /debug/$FAILTYPE/probability -echo 100 > /debug/$FAILTYPE/interval -echo -1 > /debug/$FAILTYPE/times -echo 0 > /debug/$FAILTYPE/space -echo 2 > /debug/$FAILTYPE/verbose -echo 1 > /debug/$FAILTYPE/ignore-gfp-wait -echo 1 > /debug/$FAILTYPE/ignore-gfp-highmem -echo 10 > /debug/$FAILTYPE/stacktrace-depth +echo N > /sys/kernel/debug/$FAILTYPE/task-filter +echo 10 > /sys/kernel/debug/$FAILTYPE/probability +echo 100 > /sys/kernel/debug/$FAILTYPE/interval +echo -1 > /sys/kernel/debug/$FAILTYPE/times +echo 0 > /sys/kernel/debug/$FAILTYPE/space +echo 2 > /sys/kernel/debug/$FAILTYPE/verbose +echo 1 > /sys/kernel/debug/$FAILTYPE/ignore-gfp-wait +echo 1 > /sys/kernel/debug/$FAILTYPE/ignore-gfp-highmem +echo 10 > /sys/kernel/debug/$FAILTYPE/stacktrace-depth -trap "echo 0 > /debug/$FAILTYPE/probability" SIGINT SIGTERM EXIT +trap "echo 0 > /sys/kernel/debug/$FAILTYPE/probability" SIGINT SIGTERM EXIT echo "Injecting errors into the module $module... (interrupt to stop)" sleep 1000000 diff --git a/Documentation/fb/vesafb.txt b/Documentation/fb/vesafb.txt index ee277dd204b..950d5a658cb 100644 --- a/Documentation/fb/vesafb.txt +++ b/Documentation/fb/vesafb.txt @@ -95,7 +95,7 @@ There is no way to change the vesafb video mode and/or timings after booting linux. If you are not happy with the 60 Hz refresh rate, you have these options: - * configure and load the DOS-Tools for your the graphics board (if + * configure and load the DOS-Tools for the graphics board (if available) and boot linux with loadlin. * use a native driver (matroxfb/atyfb) instead if vesafb. If none is available, write a new one! diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index edb2f0b0761..fa75220f8d3 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -6,6 +6,49 @@ be removed from this file. --------------------------- +What: PRISM54 +When: 2.6.34 + +Why: prism54 FullMAC PCI / Cardbus devices used to be supported only by the + prism54 wireless driver. After Intersil stopped selling these + devices in preference for the newer more flexible SoftMAC devices + a SoftMAC device driver was required and prism54 did not support + them. The p54pci driver now exists and has been present in the kernel for + a while. This driver supports both SoftMAC devices and FullMAC devices. + The main difference between these devices was the amount of memory which + could be used for the firmware. The SoftMAC devices support a smaller + amount of memory. Because of this the SoftMAC firmware fits into FullMAC + devices's memory. p54pci supports not only PCI / Cardbus but also USB + and SPI. Since p54pci supports all devices prism54 supports + you will have a conflict. I'm not quite sure how distributions are + handling this conflict right now. prism54 was kept around due to + claims users may experience issues when using the SoftMAC driver. + Time has passed users have not reported issues. If you use prism54 + and for whatever reason you cannot use p54pci please let us know! + E-mail us at: linux-wireless@vger.kernel.org + + For more information see the p54 wiki page: + + http://wireless.kernel.org/en/users/Drivers/p54 + +Who: Luis R. Rodriguez <lrodriguez@atheros.com> + +--------------------------- + +What: IRQF_SAMPLE_RANDOM +Check: IRQF_SAMPLE_RANDOM +When: July 2009 + +Why: Many of IRQF_SAMPLE_RANDOM users are technically bogus as entropy + sources in the kernel's current entropy model. To resolve this, every + input point to the kernel's entropy pool needs to better document the + type of entropy source it actually is. This will be replaced with + additional add_*_randomness functions in drivers/char/random.c + +Who: Robin Getz <rgetz@blackfin.uclinux.org> & Matt Mackall <mpm@selenic.com> + +--------------------------- + What: The ieee80211_regdom module parameter When: March 2010 / desktop catchup @@ -192,24 +235,6 @@ Who: Len Brown <len.brown@intel.com> --------------------------- -What: libata spindown skipping and warning -When: Dec 2008 -Why: Some halt(8) implementations synchronize caches for and spin - down libata disks because libata didn't use to spin down disk on - system halt (only synchronized caches). - Spin down on system halt is now implemented. sysfs node - /sys/class/scsi_disk/h:c:i:l/manage_start_stop is present if - spin down support is available. - Because issuing spin down command to an already spun down disk - makes some disks spin up just to spin down again, libata tracks - device spindown status to skip the extra spindown command and - warn about it. - This is to give userspace tools the time to get updated and will - be removed after userspace is reasonably updated. -Who: Tejun Heo <htejun@gmail.com> - ---------------------------- - What: i386/x86_64 bzImage symlinks When: April 2010 @@ -221,31 +246,6 @@ Who: Thomas Gleixner <tglx@linutronix.de> --------------------------- What (Why): - - include/linux/netfilter_ipv4/ipt_TOS.h ipt_tos.h header files - (superseded by xt_TOS/xt_tos target & match) - - - "forwarding" header files like ipt_mac.h in - include/linux/netfilter_ipv4/ and include/linux/netfilter_ipv6/ - - - xt_CONNMARK match revision 0 - (superseded by xt_CONNMARK match revision 1) - - - xt_MARK target revisions 0 and 1 - (superseded by xt_MARK match revision 2) - - - xt_connmark match revision 0 - (superseded by xt_connmark match revision 1) - - - xt_conntrack match revision 0 - (superseded by xt_conntrack match revision 1) - - - xt_iprange match revision 0, - include/linux/netfilter_ipv4/ipt_iprange.h - (superseded by xt_iprange match revision 1) - - - xt_mark match revision 0 - (superseded by xt_mark match revision 1) - - xt_recent: the old ipt_recent proc dir (superseded by /proc/net/xt_recent) @@ -354,16 +354,6 @@ Who: Krzysztof Piotr Oledzki <ole@ans.pl> --------------------------- -What: i2c_attach_client(), i2c_detach_client(), i2c_driver->detach_client(), - i2c_adapter->client_register(), i2c_adapter->client_unregister -When: 2.6.30 -Check: i2c_attach_client i2c_detach_client -Why: Deprecated by the new (standard) device driver binding model. Use - i2c_driver->probe() and ->remove() instead. -Who: Jean Delvare <khali@linux-fr.org> - ---------------------------- - What: fscher and fscpos drivers When: June 2009 Why: Deprecated by the new fschmd driver. @@ -390,15 +380,6 @@ Who: Thomas Gleixner <tglx@linutronix.de> ----------------------------- -What: obsolete generic irq defines and typedefs -When: 2.6.30 -Why: The defines and typedefs (hw_interrupt_type, no_irq_type, irq_desc_t) - have been kept around for migration reasons. After more than two years - it's time to remove them finally -Who: Thomas Gleixner <tglx@linutronix.de> - ---------------------------- - What: fakephp and associated sysfs files in /sys/bus/pci/slots/ When: 2011 Why: In 2.6.27, the semantics of /sys/bus/pci/slots was redefined to @@ -444,3 +425,37 @@ What: CONFIG_RFKILL_INPUT When: 2.6.33 Why: Should be implemented in userspace, policy daemon. Who: Johannes Berg <johannes@sipsolutions.net> + +---------------------------- + +What: lock_policy_rwsem_* and unlock_policy_rwsem_* will not be + exported interface anymore. +When: 2.6.33 +Why: cpu_policy_rwsem has a new cleaner definition making it local to + cpufreq core and contained inside cpufreq.c. Other dependent + drivers should not use it in order to safely avoid lockdep issues. +Who: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> + +---------------------------- + +What: sound-slot/service-* module aliases and related clutters in + sound/sound_core.c +When: August 2010 +Why: OSS sound_core grabs all legacy minors (0-255) of SOUND_MAJOR + (14) and requests modules using custom sound-slot/service-* + module aliases. The only benefit of doing this is allowing + use of custom module aliases which might as well be considered + a bug at this point. This preemptive claiming prevents + alternative OSS implementations. + + Till the feature is removed, the kernel will be requesting + both sound-slot/service-* and the standard char-major-* module + aliases and allow turning off the pre-claiming selectively via + CONFIG_SOUND_OSS_CORE_PRECLAIM and soundcore.preclaim_oss + kernel parameter. + + After the transition phase is complete, both the custom module + aliases and switches to disable it will go away. This removal + will also allow making ALSA OSS emulation independent of + sound_core. The dependency will be broken then too. +Who: Tejun Heo <tj@kernel.org> diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX index 8dd6db76171..f15621ee559 100644 --- a/Documentation/filesystems/00-INDEX +++ b/Documentation/filesystems/00-INDEX @@ -66,6 +66,10 @@ mandatory-locking.txt - info on the Linux implementation of Sys V mandatory file locking. ncpfs.txt - info on Novell Netware(tm) filesystem using NCP protocol. +nfs41-server.txt + - info on the Linux server implementation of NFSv4 minor version 1. +nfs-rdma.txt + - how to install and setup the Linux NFS/RDMA client and server software. nfsroot.txt - short guide on setting up a diskless box with NFS root filesystem. nilfs2.txt diff --git a/Documentation/filesystems/9p.txt b/Documentation/filesystems/9p.txt index bf8080640eb..6208f55c44c 100644 --- a/Documentation/filesystems/9p.txt +++ b/Documentation/filesystems/9p.txt @@ -123,6 +123,9 @@ available from the same CVS repository. There are user and developer mailing lists available through the v9fs project on sourceforge (http://sourceforge.net/projects/v9fs). +A stand-alone version of the module (which should build for any 2.6 kernel) +is available via (http://github.com/ericvh/9p-sac/tree/master) + News and other information is maintained on SWiK (http://swik.net/v9fs). Bug reports may be issued through the kernel.org bugzilla diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 3120f8dd2c3..18b9d0ca063 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -109,27 +109,28 @@ prototypes: locking rules: All may block. - BKL s_lock s_umount -alloc_inode: no no no -destroy_inode: no -dirty_inode: no (must not sleep) -write_inode: no -drop_inode: no !!!inode_lock!!! -delete_inode: no -put_super: yes yes no -write_super: no yes read -sync_fs: no no read -freeze_fs: ? -unfreeze_fs: ? -statfs: no no no -remount_fs: yes yes maybe (see below) -clear_inode: no -umount_begin: yes no no -show_options: no (vfsmount->sem) -quota_read: no no no (see below) -quota_write: no no no (see below) - -->remount_fs() will have the s_umount lock if it's already mounted. + None have BKL + s_umount +alloc_inode: +destroy_inode: +dirty_inode: (must not sleep) +write_inode: +drop_inode: !!!inode_lock!!! +delete_inode: +put_super: write +write_super: read +sync_fs: read +freeze_fs: read +unfreeze_fs: read +statfs: no +remount_fs: maybe (see below) +clear_inode: +umount_begin: no +show_options: no (namespace_sem) +quota_read: no (see below) +quota_write: no (see below) + +->remount_fs() will have the s_umount exclusive lock if it's already mounted. When called from get_sb_single, it does NOT have the s_umount lock. ->quota_read() and ->quota_write() functions are both guaranteed to be the only ones operating on the quota file by the quota code (via @@ -187,7 +188,7 @@ readpages: no write_begin: no locks the page yes write_end: no yes, unlocks yes perform_write: no n/a yes -bmap: yes +bmap: no invalidatepage: no yes releasepage: no yes direct_IO: no diff --git a/Documentation/filesystems/afs.txt b/Documentation/filesystems/afs.txt index 12ad6c7f4e5..ffef91c4e0d 100644 --- a/Documentation/filesystems/afs.txt +++ b/Documentation/filesystems/afs.txt @@ -23,15 +23,13 @@ it does support include: (*) Security (currently only AFS kaserver and KerberosIV tickets). - (*) File reading. + (*) File reading and writing. (*) Automounting. -It does not yet support the following AFS features: - - (*) Write support. + (*) Local caching (via fscache). - (*) Local caching. +It does not yet support the following AFS features: (*) pioctl() system call. @@ -56,7 +54,7 @@ They permit the debugging messages to be turned on dynamically by manipulating the masks in the following files: /sys/module/af_rxrpc/parameters/debug - /sys/module/afs/parameters/debug + /sys/module/kafs/parameters/debug ===== @@ -66,9 +64,9 @@ USAGE When inserting the driver modules the root cell must be specified along with a list of volume location server IP addresses: - insmod af_rxrpc.o - insmod rxkad.o - insmod kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91 + modprobe af_rxrpc + modprobe rxkad + modprobe kafs rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91 The first module is the AF_RXRPC network protocol driver. This provides the RxRPC remote operation protocol and may also be accessed from userspace. See: @@ -81,7 +79,7 @@ is the actual filesystem driver for the AFS filesystem. Once the module has been loaded, more modules can be added by the following procedure: - echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells + echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells Where the parameters to the "add" command are the name of a cell and a list of volume location servers within that cell, with the latter separated by colons. @@ -101,7 +99,7 @@ The name of the volume can be suffixes with ".backup" or ".readonly" to specify connection to only volumes of those types. The name of the cell is optional, and if not given during a mount, then the -named volume will be looked up in the cell specified during insmod. +named volume will be looked up in the cell specified during modprobe. Additional cells can be added through /proc (see later section). @@ -163,14 +161,14 @@ THE CELL DATABASE The filesystem maintains an internal database of all the cells it knows and the IP addresses of the volume location servers for those cells. The cell to which -the system belongs is added to the database when insmod is performed by the +the system belongs is added to the database when modprobe is performed by the "rootcell=" argument or, if compiled in, using a "kafs.rootcell=" argument on the kernel command line. Further cells can be added by commands similar to the following: echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells - echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells + echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells No other cell database operations are available at this time. @@ -233,7 +231,7 @@ insmod /tmp/kafs.o rootcell=cambridge.redhat.com:172.16.18.91 mount -t afs \%root.afs. /afs mount -t afs \%cambridge.redhat.com:root.cell. /afs/cambridge.redhat.com/ -echo add grand.central.org 18.7.14.88:128.2.191.224 > /proc/fs/afs/cells +echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 > /proc/fs/afs/cells mount -t afs "#grand.central.org:root.cell." /afs/grand.central.org/ mount -t afs "#grand.central.org:root.archive." /afs/grand.central.org/archive mount -t afs "#grand.central.org:root.contrib." /afs/grand.central.org/contrib diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt index e055acb6b2d..67639f905f1 100644 --- a/Documentation/filesystems/ext2.txt +++ b/Documentation/filesystems/ext2.txt @@ -322,7 +322,7 @@ an upper limit on the block size imposed by the page size of the kernel, so 8kB blocks are only allowed on Alpha systems (and other architectures which support larger pages). -There is an upper limit of 32768 subdirectories in a single directory. +There is an upper limit of 32000 subdirectories in a single directory. There is a "soft" upper limit of about 10-15k files in a single directory with the current linear linked-list directory implementation. This limit diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index 608fdba97b7..7be02ac5fa3 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt @@ -235,6 +235,10 @@ minixdf Make 'df' act like Minix. debug Extra debugging information is sent to syslog. +abort Simulate the effects of calling ext4_abort() for + debugging purposes. This is normally used while + remounting a filesystem which is already mounted. + errors=remount-ro Remount the filesystem read-only on an error. errors=continue Keep going on a filesystem error. errors=panic Panic and halt the machine if an error occurs. diff --git a/Documentation/filesystems/gfs2-uevents.txt b/Documentation/filesystems/gfs2-uevents.txt new file mode 100644 index 00000000000..fd966dc9979 --- /dev/null +++ b/Documentation/filesystems/gfs2-uevents.txt @@ -0,0 +1,100 @@ + uevents and GFS2 + ================== + +During the lifetime of a GFS2 mount, a number of uevents are generated. +This document explains what the events are and what they are used +for (by gfs_controld in gfs2-utils). + +A list of GFS2 uevents +----------------------- + +1. ADD + +The ADD event occurs at mount time. It will always be the first +uevent generated by the newly created filesystem. If the mount +is successful, an ONLINE uevent will follow. If it is not successful +then a REMOVE uevent will follow. + +The ADD uevent has two environment variables: SPECTATOR=[0|1] +and RDONLY=[0|1] that specify the spectator status (a read-only mount +with no journal assigned), and read-only (with journal assigned) status +of the filesystem respectively. + +2. ONLINE + +The ONLINE uevent is generated after a successful mount or remount. It +has the same environment variables as the ADD uevent. The ONLINE +uevent, along with the two environment variables for spectator and +RDONLY are a relatively recent addition (2.6.32-rc+) and will not +be generated by older kernels. + +3. CHANGE + +The CHANGE uevent is used in two places. One is when reporting the +successful mount of the filesystem by the first node (FIRSTMOUNT=Done). +This is used as a signal by gfs_controld that it is then ok for other +nodes in the cluster to mount the filesystem. + +The other CHANGE uevent is used to inform of the completion +of journal recovery for one of the filesystems journals. It has +two environment variables, JID= which specifies the journal id which +has just been recovered, and RECOVERY=[Done|Failed] to indicate the +success (or otherwise) of the operation. These uevents are generated +for every journal recovered, whether it is during the initial mount +process or as the result of gfs_controld requesting a specific journal +recovery via the /sys/fs/gfs2/<fsname>/lock_module/recovery file. + +Because the CHANGE uevent was used (in early versions of gfs_controld) +without checking the environment variables to discover the state, we +cannot add any more functions to it without running the risk of +someone using an older version of the user tools and breaking their +cluster. For this reason the ONLINE uevent was used when adding a new +uevent for a successful mount or remount. + +4. OFFLINE + +The OFFLINE uevent is only generated due to filesystem errors and is used +as part of the "withdraw" mechanism. Currently this doesn't give any +information about what the error is, which is something that needs to +be fixed. + +5. REMOVE + +The REMOVE uevent is generated at the end of an unsuccessful mount +or at the end of a umount of the filesystem. All REMOVE uevents will +have been preceeded by at least an ADD uevent for the same fileystem, +and unlike the other uevents is generated automatically by the kernel's +kobject subsystem. + + +Information common to all GFS2 uevents (uevent environment variables) +---------------------------------------------------------------------- + +1. LOCKTABLE= + +The LOCKTABLE is a string, as supplied on the mount command +line (locktable=) or via fstab. It is used as a filesystem label +as well as providing the information for a lock_dlm mount to be +able to join the cluster. + +2. LOCKPROTO= + +The LOCKPROTO is a string, and its value depends on what is set +on the mount command line, or via fstab. It will be either +lock_nolock or lock_dlm. In the future other lock managers +may be supported. + +3. JOURNALID= + +If a journal is in use by the filesystem (journals are not +assigned for spectator mounts) then this will give the +numeric journal id in all GFS2 uevents. + +4. UUID= + +With recent versions of gfs2-utils, mkfs.gfs2 writes a UUID +into the filesystem superblock. If it exists, this will +be included in every uevent relating to the filesystem. + + + diff --git a/Documentation/filesystems/isofs.txt b/Documentation/filesystems/isofs.txt index 6973b980ca2..3c367c3b360 100644 --- a/Documentation/filesystems/isofs.txt +++ b/Documentation/filesystems/isofs.txt @@ -23,8 +23,13 @@ Mount options unique to the isofs filesystem. map=off Do not map non-Rock Ridge filenames to lower case map=normal Map non-Rock Ridge filenames to lower case map=acorn As map=normal but also apply Acorn extensions if present - mode=xxx Sets the permissions on files to xxx - dmode=xxx Sets the permissions on directories to xxx + mode=xxx Sets the permissions on files to xxx unless Rock Ridge + extensions set the permissions otherwise + dmode=xxx Sets the permissions on directories to xxx unless Rock Ridge + extensions set the permissions otherwise + overriderockperm Set permissions on files and directories according to + 'mode' and 'dmode' even though Rock Ridge extensions are + present. nojoliet Ignore Joliet extensions if they are present. norock Ignore Rock Ridge extensions if they are present. hide Completely strip hidden files from the file system. diff --git a/Documentation/filesystems/nfs.txt b/Documentation/filesystems/nfs.txt new file mode 100644 index 00000000000..f50f26ce6cd --- /dev/null +++ b/Documentation/filesystems/nfs.txt @@ -0,0 +1,98 @@ + +The NFS client +============== + +The NFS version 2 protocol was first documented in RFC1094 (March 1989). +Since then two more major releases of NFS have been published, with NFSv3 +being documented in RFC1813 (June 1995), and NFSv4 in RFC3530 (April +2003). + +The Linux NFS client currently supports all the above published versions, +and work is in progress on adding support for minor version 1 of the NFSv4 +protocol. + +The purpose of this document is to provide information on some of the +upcall interfaces that are used in order to provide the NFS client with +some of the information that it requires in order to fully comply with +the NFS spec. + +The DNS resolver +================ + +NFSv4 allows for one server to refer the NFS client to data that has been +migrated onto another server by means of the special "fs_locations" +attribute. See + http://tools.ietf.org/html/rfc3530#section-6 +and + http://tools.ietf.org/html/draft-ietf-nfsv4-referrals-00 + +The fs_locations information can take the form of either an ip address and +a path, or a DNS hostname and a path. The latter requires the NFS client to +do a DNS lookup in order to mount the new volume, and hence the need for an +upcall to allow userland to provide this service. + +Assuming that the user has the 'rpc_pipefs' filesystem mounted in the usual +/var/lib/nfs/rpc_pipefs, the upcall consists of the following steps: + + (1) The process checks the dns_resolve cache to see if it contains a + valid entry. If so, it returns that entry and exits. + + (2) If no valid entry exists, the helper script '/sbin/nfs_cache_getent' + (may be changed using the 'nfs.cache_getent' kernel boot parameter) + is run, with two arguments: + - the cache name, "dns_resolve" + - the hostname to resolve + + (3) After looking up the corresponding ip address, the helper script + writes the result into the rpc_pipefs pseudo-file + '/var/lib/nfs/rpc_pipefs/cache/dns_resolve/channel' + in the following (text) format: + + "<ip address> <hostname> <ttl>\n" + + Where <ip address> is in the usual IPv4 (123.456.78.90) or IPv6 + (ffee:ddcc:bbaa:9988:7766:5544:3322:1100, ffee::1100, ...) format. + <hostname> is identical to the second argument of the helper + script, and <ttl> is the 'time to live' of this cache entry (in + units of seconds). + + Note: If <ip address> is invalid, say the string "0", then a negative + entry is created, which will cause the kernel to treat the hostname + as having no valid DNS translation. + + + + +A basic sample /sbin/nfs_cache_getent +===================================== + +#!/bin/bash +# +ttl=600 +# +cut=/usr/bin/cut +getent=/usr/bin/getent +rpc_pipefs=/var/lib/nfs/rpc_pipefs +# +die() +{ + echo "Usage: $0 cache_name entry_name" + exit 1 +} + +[ $# -lt 2 ] && die +cachename="$1" +cache_path=${rpc_pipefs}/cache/${cachename}/channel + +case "${cachename}" in + dns_resolve) + name="$2" + result="$(${getent} hosts ${name} | ${cut} -f1 -d\ )" + [ -z "${result}" ] && result="0" + ;; + *) + die + ;; +esac +echo "${result} ${name} ${ttl}" >${cache_path} + diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index cd8717a3627..ffead13f944 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -5,11 +5,12 @@ Bodo Bauer <bb@ricochet.net> 2.4.x update Jorge Nerin <comandante@zaralinux.com> November 14 2000 -move /proc/sys Shen Feng <shen@cn.fujitsu.com> April 1 2009 +move /proc/sys Shen Feng <shen@cn.fujitsu.com> April 1 2009 ------------------------------------------------------------------------------ Version 1.3 Kernel version 2.2.12 Kernel version 2.4.0-test11-pre4 ------------------------------------------------------------------------------ +fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009 Table of Contents ----------------- @@ -116,7 +117,7 @@ The link self points to the process reading the file system. Each process subdirectory has the entries listed in Table 1-1. -Table 1-1: Process specific entries in /proc +Table 1-1: Process specific entries in /proc .............................................................................. File Content clear_refs Clears page referenced bits shown in smaps output @@ -134,46 +135,103 @@ Table 1-1: Process specific entries in /proc status Process status in human readable form wchan If CONFIG_KALLSYMS is set, a pre-decoded wchan stack Report full stack trace, enable via CONFIG_STACKTRACE - smaps Extension based on maps, the rss size for each mapped file + smaps a extension based on maps, showing the memory consumption of + each mapping .............................................................................. For example, to get the status information of a process, all you have to do is read the file /proc/PID/status: - >cat /proc/self/status - Name: cat - State: R (running) - Pid: 5452 - PPid: 743 + >cat /proc/self/status + Name: cat + State: R (running) + Tgid: 5452 + Pid: 5452 + PPid: 743 TracerPid: 0 (2.4) - Uid: 501 501 501 501 - Gid: 100 100 100 100 - Groups: 100 14 16 - VmSize: 1112 kB - VmLck: 0 kB - VmRSS: 348 kB - VmData: 24 kB - VmStk: 12 kB - VmExe: 8 kB - VmLib: 1044 kB - SigPnd: 0000000000000000 - SigBlk: 0000000000000000 - SigIgn: 0000000000000000 - SigCgt: 0000000000000000 - CapInh: 00000000fffffeff - CapPrm: 0000000000000000 - CapEff: 0000000000000000 - + Uid: 501 501 501 501 + Gid: 100 100 100 100 + FDSize: 256 + Groups: 100 14 16 + VmPeak: 5004 kB + VmSize: 5004 kB + VmLck: 0 kB + VmHWM: 476 kB + VmRSS: 476 kB + VmData: 156 kB + VmStk: 88 kB + VmExe: 68 kB + VmLib: 1412 kB + VmPTE: 20 kb + Threads: 1 + SigQ: 0/28578 + SigPnd: 0000000000000000 + ShdPnd: 0000000000000000 + SigBlk: 0000000000000000 + SigIgn: 0000000000000000 + SigCgt: 0000000000000000 + CapInh: 00000000fffffeff + CapPrm: 0000000000000000 + CapEff: 0000000000000000 + CapBnd: ffffffffffffffff + voluntary_ctxt_switches: 0 + nonvoluntary_ctxt_switches: 1 This shows you nearly the same information you would get if you viewed it with the ps command. In fact, ps uses the proc file system to obtain its -information. The statm file contains more detailed information about the -process memory usage. Its seven fields are explained in Table 1-2. The stat -file contains details information about the process itself. Its fields are -explained in Table 1-3. +information. But you get a more detailed view of the process by reading the +file /proc/PID/status. It fields are described in table 1-2. + +The statm file contains more detailed information about the process +memory usage. Its seven fields are explained in Table 1-3. The stat file +contains details information about the process itself. Its fields are +explained in Table 1-4. +Table 1-2: Contents of the statm files (as of 2.6.30-rc7) +.............................................................................. + Field Content + Name filename of the executable + State state (R is running, S is sleeping, D is sleeping + in an uninterruptible wait, Z is zombie, + T is traced or stopped) + Tgid thread group ID + Pid process id + PPid process id of the parent process + TracerPid PID of process tracing this process (0 if not) + Uid Real, effective, saved set, and file system UIDs + Gid Real, effective, saved set, and file system GIDs + FDSize number of file descriptor slots currently allocated + Groups supplementary group list + VmPeak peak virtual memory size + VmSize total program size + VmLck locked memory size + VmHWM peak resident set size ("high water mark") + VmRSS size of memory portions + VmData size of data, stack, and text segments + VmStk size of data, stack, and text segments + VmExe size of text segment + VmLib size of shared library code + VmPTE size of page table entries + Threads number of threads + SigQ number of signals queued/max. number for queue + SigPnd bitmap of pending signals for the thread + ShdPnd bitmap of shared pending signals for the process + SigBlk bitmap of blocked signals + SigIgn bitmap of ignored signals + SigCgt bitmap of catched signals + CapInh bitmap of inheritable capabilities + CapPrm bitmap of permitted capabilities + CapEff bitmap of effective capabilities + CapBnd bitmap of capabilities bounding set + Cpus_allowed mask of CPUs on which this process may run + Cpus_allowed_list Same as previous, but in "list format" + Mems_allowed mask of memory nodes allowed to this process + Mems_allowed_list Same as previous, but in "list format" + voluntary_ctxt_switches number of voluntary context switches + nonvoluntary_ctxt_switches number of non voluntary context switches +.............................................................................. -Table 1-2: Contents of the statm files (as of 2.6.8-rc3) +Table 1-3: Contents of the statm files (as of 2.6.8-rc3) .............................................................................. Field Content size total program size (pages) (same as VmSize in status) @@ -188,7 +246,7 @@ Table 1-2: Contents of the statm files (as of 2.6.8-rc3) .............................................................................. -Table 1-3: Contents of the stat files (as of 2.6.22-rc3) +Table 1-4: Contents of the stat files (as of 2.6.30-rc7) .............................................................................. Field Content pid process id @@ -222,10 +280,10 @@ Table 1-3: Contents of the stat files (as of 2.6.22-rc3) start_stack address of the start of the stack esp current value of ESP eip current value of EIP - pending bitmap of pending signals (obsolete) - blocked bitmap of blocked signals (obsolete) - sigign bitmap of ignored signals (obsolete) - sigcatch bitmap of catched signals (obsolete) + pending bitmap of pending signals + blocked bitmap of blocked signals + sigign bitmap of ignored signals + sigcatch bitmap of catched signals wchan address where process went to sleep 0 (place holder) 0 (place holder) @@ -234,19 +292,99 @@ Table 1-3: Contents of the stat files (as of 2.6.22-rc3) rt_priority realtime priority policy scheduling policy (man sched_setscheduler) blkio_ticks time spent waiting for block IO + gtime guest time of the task in jiffies + cgtime guest time of the task children in jiffies .............................................................................. +The /proc/PID/map file containing the currently mapped memory regions and +their access permissions. + +The format is: + +address perms offset dev inode pathname + +08048000-08049000 r-xp 00000000 03:00 8312 /opt/test +08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test +0804a000-0806b000 rw-p 00000000 00:00 0 [heap] +a7cb1000-a7cb2000 ---p 00000000 00:00 0 +a7cb2000-a7eb2000 rw-p 00000000 00:00 0 +a7eb2000-a7eb3000 ---p 00000000 00:00 0 +a7eb3000-a7ed5000 rw-p 00000000 00:00 0 +a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 +a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6 +a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6 +a800b000-a800e000 rw-p 00000000 00:00 0 +a800e000-a8022000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0 +a8022000-a8023000 r--p 00013000 03:00 14462 /lib/libpthread.so.0 +a8023000-a8024000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0 +a8024000-a8027000 rw-p 00000000 00:00 0 +a8027000-a8043000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2 +a8043000-a8044000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2 +a8044000-a8045000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2 +aff35000-aff4a000 rw-p 00000000 00:00 0 [stack] +ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso] + +where "address" is the address space in the process that it occupies, "perms" +is a set of permissions: + + r = read + w = write + x = execute + s = shared + p = private (copy on write) + +"offset" is the offset into the mapping, "dev" is the device (major:minor), and +"inode" is the inode on that device. 0 indicates that no inode is associated +with the memory region, as the case would be with BSS (uninitialized data). +The "pathname" shows the name associated file for this mapping. If the mapping +is not associated with a file: + + [heap] = the heap of the program + [stack] = the stack of the main process + [vdso] = the "virtual dynamic shared object", + the kernel system call handler + + or if empty, the mapping is anonymous. + + +The /proc/PID/smaps is an extension based on maps, showing the memory +consumption for each of the process's mappings. For each of mappings there +is a series of lines such as the following: + +08048000-080bc000 r-xp 00000000 03:02 13130 /bin/bash +Size: 1084 kB +Rss: 892 kB +Pss: 374 kB +Shared_Clean: 892 kB +Shared_Dirty: 0 kB +Private_Clean: 0 kB +Private_Dirty: 0 kB +Referenced: 892 kB +Swap: 0 kB +KernelPageSize: 4 kB +MMUPageSize: 4 kB + +The first of these lines shows the same information as is displayed for the +mapping in /proc/PID/maps. The remaining lines show the size of the mapping, +the amount of the mapping that is currently resident in RAM, the "proportional +set size†(divide each shared page by the number of processes sharing it), the +number of clean and dirty shared pages in the mapping, and the number of clean +and dirty private pages in the mapping. The "Referenced" indicates the amount +of memory currently marked as referenced or accessed. + +This file is only present if the CONFIG_MMU kernel configuration option is +enabled. 1.2 Kernel data --------------- Similar to the process entries, the kernel data files give information about the running kernel. The files used to obtain this information are contained in -/proc and are listed in Table 1-4. Not all of these will be present in your +/proc and are listed in Table 1-5. Not all of these will be present in your system. It depends on the kernel configuration and the loaded modules, which files are there, and which are missing. -Table 1-4: Kernel info in /proc +Table 1-5: Kernel info in /proc .............................................................................. File Content apm Advanced power management info @@ -283,6 +421,7 @@ Table 1-4: Kernel info in /proc rtc Real time clock scsi SCSI info (see text) slabinfo Slab pool info + softirqs softirq usage stat Overall statistics swaps Swap space utilization sys See chapter 2 @@ -597,6 +736,25 @@ on the kind of area : 0xffffffffa0017000-0xffffffffa0022000 45056 sys_init_module+0xc27/0x1d00 ... pages=10 vmalloc N0=10 +.............................................................................. + +softirqs: + +Provides counts of softirq handlers serviced since boot time, for each cpu. + +> cat /proc/softirqs + CPU0 CPU1 CPU2 CPU3 + HI: 0 0 0 0 + TIMER: 27166 27120 27097 27034 + NET_TX: 0 0 0 17 + NET_RX: 42 0 0 39 + BLOCK: 0 0 107 1121 + TASKLET: 0 0 0 290 + SCHED: 27035 26983 26971 26746 + HRTIMER: 0 0 0 0 + RCU: 1678 1769 2178 2250 + + 1.3 IDE devices in /proc/ide ---------------------------- @@ -614,10 +772,10 @@ IDE devices: More detailed information can be found in the controller specific subdirectories. These are named ide0, ide1 and so on. Each of these -directories contains the files shown in table 1-5. +directories contains the files shown in table 1-6. -Table 1-5: IDE controller info in /proc/ide/ide? +Table 1-6: IDE controller info in /proc/ide/ide? .............................................................................. File Content channel IDE channel (0 or 1) @@ -627,11 +785,11 @@ Table 1-5: IDE controller info in /proc/ide/ide? .............................................................................. Each device connected to a controller has a separate subdirectory in the -controllers directory. The files listed in table 1-6 are contained in these +controllers directory. The files listed in table 1-7 are contained in these directories. -Table 1-6: IDE device information +Table 1-7: IDE device information .............................................................................. File Content cache The cache @@ -673,12 +831,12 @@ the drive parameters: 1.4 Networking info in /proc/net -------------------------------- -The subdirectory /proc/net follows the usual pattern. Table 1-6 shows the +The subdirectory /proc/net follows the usual pattern. Table 1-8 shows the additional values you get for IP version 6 if you configure the kernel to -support this. Table 1-7 lists the files and their meaning. +support this. Table 1-9 lists the files and their meaning. -Table 1-6: IPv6 info in /proc/net +Table 1-8: IPv6 info in /proc/net .............................................................................. File Content udp6 UDP sockets (IPv6) @@ -693,7 +851,7 @@ Table 1-6: IPv6 info in /proc/net .............................................................................. -Table 1-7: Network info in /proc/net +Table 1-9: Network info in /proc/net .............................................................................. File Content arp Kernel ARP table @@ -817,10 +975,10 @@ The directory /proc/parport contains information about the parallel ports of your system. It has one subdirectory for each port, named after the port number (0,1,2,...). -These directories contain the four files shown in Table 1-8. +These directories contain the four files shown in Table 1-10. -Table 1-8: Files in /proc/parport +Table 1-10: Files in /proc/parport .............................................................................. File Content autoprobe Any IEEE-1284 device ID information that has been acquired. @@ -838,10 +996,10 @@ Table 1-8: Files in /proc/parport Information about the available and actually used tty's can be found in the directory /proc/tty.You'll find entries for drivers and line disciplines in -this directory, as shown in Table 1-9. +this directory, as shown in Table 1-11. -Table 1-9: Files in /proc/tty +Table 1-11: Files in /proc/tty .............................................................................. File Content drivers list of drivers and their usage @@ -883,6 +1041,7 @@ since the system first booted. For a quick look, simply cat the file: processes 2915 procs_running 1 procs_blocked 0 + softirq 183433 0 21755 12 39 1137 231 21459 2263 The very first "cpu" line aggregates the numbers in all of the other "cpuN" lines. These numbers identify the amount of time the CPU has spent performing @@ -918,6 +1077,11 @@ CPUs. The "procs_blocked" line gives the number of processes currently blocked, waiting for I/O to complete. +The "softirq" line gives counts of softirqs serviced since boot time, for each +of the possible system softirqs. The first column is the total of all +softirqs serviced; each subsequent column is the total for that particular +softirq. + 1.9 Ext4 file system parameters ------------------------------ @@ -926,9 +1090,9 @@ Information about mounted ext4 file systems can be found in /proc/fs/ext4. Each mounted filesystem will have a directory in /proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or /proc/fs/ext4/dm-0). The files in each per-device directory are shown -in Table 1-10, below. +in Table 1-12, below. -Table 1-10: Files in /proc/fs/ext4/<devname> +Table 1-12: Files in /proc/fs/ext4/<devname> .............................................................................. File Content mb_groups details of multiblock allocator buddy cache of free blocks diff --git a/Documentation/filesystems/seq_file.txt b/Documentation/filesystems/seq_file.txt index b843743aa0b..0d15ebccf5b 100644 --- a/Documentation/filesystems/seq_file.txt +++ b/Documentation/filesystems/seq_file.txt @@ -46,7 +46,7 @@ better to do. The file is seekable, in that one can do something like the following: dd if=/proc/sequence of=out1 count=1 - dd if=/proc/sequence skip=1 out=out2 count=1 + dd if=/proc/sequence skip=1 of=out2 count=1 Then concatenate the output files out1 and out2 and get the right result. Yes, it is a thoroughly useless module, but the point is to show diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt index 7e81e37c0b1..b245d524d56 100644 --- a/Documentation/filesystems/sysfs.txt +++ b/Documentation/filesystems/sysfs.txt @@ -23,7 +23,8 @@ interface. Using sysfs ~~~~~~~~~~~ -sysfs is always compiled in. You can access it by doing: +sysfs is always compiled in if CONFIG_SYSFS is defined. You can access +it by doing: mount -t sysfs sysfs /sys diff --git a/Documentation/firmware_class/README b/Documentation/firmware_class/README index c3480aa66ba..7eceaff63f5 100644 --- a/Documentation/firmware_class/README +++ b/Documentation/firmware_class/README @@ -77,7 +77,8 @@ seconds for the whole load operation. - request_firmware_nowait() is also provided for convenience in - non-user contexts. + user contexts to request firmware asynchronously, but can't be called + in atomic contexts. about in-kernel persistence: diff --git a/Documentation/flexible-arrays.txt b/Documentation/flexible-arrays.txt new file mode 100644 index 00000000000..84eb26808de --- /dev/null +++ b/Documentation/flexible-arrays.txt @@ -0,0 +1,99 @@ +Using flexible arrays in the kernel +Last updated for 2.6.31 +Jonathan Corbet <corbet@lwn.net> + +Large contiguous memory allocations can be unreliable in the Linux kernel. +Kernel programmers will sometimes respond to this problem by allocating +pages with vmalloc(). This solution not ideal, though. On 32-bit systems, +memory from vmalloc() must be mapped into a relatively small address space; +it's easy to run out. On SMP systems, the page table changes required by +vmalloc() allocations can require expensive cross-processor interrupts on +all CPUs. And, on all systems, use of space in the vmalloc() range +increases pressure on the translation lookaside buffer (TLB), reducing the +performance of the system. + +In many cases, the need for memory from vmalloc() can be eliminated by +piecing together an array from smaller parts; the flexible array library +exists to make this task easier. + +A flexible array holds an arbitrary (within limits) number of fixed-sized +objects, accessed via an integer index. Sparse arrays are handled +reasonably well. Only single-page allocations are made, so memory +allocation failures should be relatively rare. The down sides are that the +arrays cannot be indexed directly, individual object size cannot exceed the +system page size, and putting data into a flexible array requires a copy +operation. It's also worth noting that flexible arrays do no internal +locking at all; if concurrent access to an array is possible, then the +caller must arrange for appropriate mutual exclusion. + +The creation of a flexible array is done with: + + #include <linux/flex_array.h> + + struct flex_array *flex_array_alloc(int element_size, + unsigned int total, + gfp_t flags); + +The individual object size is provided by element_size, while total is the +maximum number of objects which can be stored in the array. The flags +argument is passed directly to the internal memory allocation calls. With +the current code, using flags to ask for high memory is likely to lead to +notably unpleasant side effects. + +Storing data into a flexible array is accomplished with a call to: + + int flex_array_put(struct flex_array *array, unsigned int element_nr, + void *src, gfp_t flags); + +This call will copy the data from src into the array, in the position +indicated by element_nr (which must be less than the maximum specified when +the array was created). If any memory allocations must be performed, flags +will be used. The return value is zero on success, a negative error code +otherwise. + +There might possibly be a need to store data into a flexible array while +running in some sort of atomic context; in this situation, sleeping in the +memory allocator would be a bad thing. That can be avoided by using +GFP_ATOMIC for the flags value, but, often, there is a better way. The +trick is to ensure that any needed memory allocations are done before +entering atomic context, using: + + int flex_array_prealloc(struct flex_array *array, unsigned int start, + unsigned int end, gfp_t flags); + +This function will ensure that memory for the elements indexed in the range +defined by start and end has been allocated. Thereafter, a +flex_array_put() call on an element in that range is guaranteed not to +block. + +Getting data back out of the array is done with: + + void *flex_array_get(struct flex_array *fa, unsigned int element_nr); + +The return value is a pointer to the data element, or NULL if that +particular element has never been allocated. + +Note that it is possible to get back a valid pointer for an element which +has never been stored in the array. Memory for array elements is allocated +one page at a time; a single allocation could provide memory for several +adjacent elements. The flexible array code does not know if a specific +element has been written; it only knows if the associated memory is +present. So a flex_array_get() call on an element which was never stored +in the array has the potential to return a pointer to random data. If the +caller does not have a separate way to know which elements were actually +stored, it might be wise, at least, to add GFP_ZERO to the flags argument +to ensure that all elements are zeroed. + +There is no way to remove a single element from the array. It is possible, +though, to remove all elements with a call to: + + void flex_array_free_parts(struct flex_array *array); + +This call frees all elements, but leaves the array itself in place. +Freeing the entire array is done with: + + void flex_array_free(struct flex_array *array); + +As of this writing, there are no users of flexible arrays in the mainline +kernel. The functions described here are also not exported to modules; +that will probably be fixed when somebody comes up with a need for it. diff --git a/Documentation/gcov.txt b/Documentation/gcov.txt new file mode 100644 index 00000000000..40ec6335276 --- /dev/null +++ b/Documentation/gcov.txt @@ -0,0 +1,253 @@ +Using gcov with the Linux kernel +================================ + +1. Introduction +2. Preparation +3. Customization +4. Files +5. Modules +6. Separated build and test machines +7. Troubleshooting +Appendix A: sample script: gather_on_build.sh +Appendix B: sample script: gather_on_test.sh + + +1. Introduction +=============== + +gcov profiling kernel support enables the use of GCC's coverage testing +tool gcov [1] with the Linux kernel. Coverage data of a running kernel +is exported in gcov-compatible format via the "gcov" debugfs directory. +To get coverage data for a specific file, change to the kernel build +directory and use gcov with the -o option as follows (requires root): + +# cd /tmp/linux-out +# gcov -o /sys/kernel/debug/gcov/tmp/linux-out/kernel spinlock.c + +This will create source code files annotated with execution counts +in the current directory. In addition, graphical gcov front-ends such +as lcov [2] can be used to automate the process of collecting data +for the entire kernel and provide coverage overviews in HTML format. + +Possible uses: + +* debugging (has this line been reached at all?) +* test improvement (how do I change my test to cover these lines?) +* minimizing kernel configurations (do I need this option if the + associated code is never run?) + +-- + +[1] http://gcc.gnu.org/onlinedocs/gcc/Gcov.html +[2] http://ltp.sourceforge.net/coverage/lcov.php + + +2. Preparation +============== + +Configure the kernel with: + + CONFIG_DEBUGFS=y + CONFIG_GCOV_KERNEL=y + +and to get coverage data for the entire kernel: + + CONFIG_GCOV_PROFILE_ALL=y + +Note that kernels compiled with profiling flags will be significantly +larger and run slower. Also CONFIG_GCOV_PROFILE_ALL may not be supported +on all architectures. + +Profiling data will only become accessible once debugfs has been +mounted: + + mount -t debugfs none /sys/kernel/debug + + +3. Customization +================ + +To enable profiling for specific files or directories, add a line +similar to the following to the respective kernel Makefile: + + For a single file (e.g. main.o): + GCOV_PROFILE_main.o := y + + For all files in one directory: + GCOV_PROFILE := y + +To exclude files from being profiled even when CONFIG_GCOV_PROFILE_ALL +is specified, use: + + GCOV_PROFILE_main.o := n + and: + GCOV_PROFILE := n + +Only files which are linked to the main kernel image or are compiled as +kernel modules are supported by this mechanism. + + +4. Files +======== + +The gcov kernel support creates the following files in debugfs: + + /sys/kernel/debug/gcov + Parent directory for all gcov-related files. + + /sys/kernel/debug/gcov/reset + Global reset file: resets all coverage data to zero when + written to. + + /sys/kernel/debug/gcov/path/to/compile/dir/file.gcda + The actual gcov data file as understood by the gcov + tool. Resets file coverage data to zero when written to. + + /sys/kernel/debug/gcov/path/to/compile/dir/file.gcno + Symbolic link to a static data file required by the gcov + tool. This file is generated by gcc when compiling with + option -ftest-coverage. + + +5. Modules +========== + +Kernel modules may contain cleanup code which is only run during +module unload time. The gcov mechanism provides a means to collect +coverage data for such code by keeping a copy of the data associated +with the unloaded module. This data remains available through debugfs. +Once the module is loaded again, the associated coverage counters are +initialized with the data from its previous instantiation. + +This behavior can be deactivated by specifying the gcov_persist kernel +parameter: + + gcov_persist=0 + +At run-time, a user can also choose to discard data for an unloaded +module by writing to its data file or the global reset file. + + +6. Separated build and test machines +==================================== + +The gcov kernel profiling infrastructure is designed to work out-of-the +box for setups where kernels are built and run on the same machine. In +cases where the kernel runs on a separate machine, special preparations +must be made, depending on where the gcov tool is used: + +a) gcov is run on the TEST machine + +The gcov tool version on the test machine must be compatible with the +gcc version used for kernel build. Also the following files need to be +copied from build to test machine: + +from the source tree: + - all C source files + headers + +from the build tree: + - all C source files + headers + - all .gcda and .gcno files + - all links to directories + +It is important to note that these files need to be placed into the +exact same file system location on the test machine as on the build +machine. If any of the path components is symbolic link, the actual +directory needs to be used instead (due to make's CURDIR handling). + +b) gcov is run on the BUILD machine + +The following files need to be copied after each test case from test +to build machine: + +from the gcov directory in sysfs: + - all .gcda files + - all links to .gcno files + +These files can be copied to any location on the build machine. gcov +must then be called with the -o option pointing to that directory. + +Example directory setup on the build machine: + + /tmp/linux: kernel source tree + /tmp/out: kernel build directory as specified by make O= + /tmp/coverage: location of the files copied from the test machine + + [user@build] cd /tmp/out + [user@build] gcov -o /tmp/coverage/tmp/out/init main.c + + +7. Troubleshooting +================== + +Problem: Compilation aborts during linker step. +Cause: Profiling flags are specified for source files which are not + linked to the main kernel or which are linked by a custom + linker procedure. +Solution: Exclude affected source files from profiling by specifying + GCOV_PROFILE := n or GCOV_PROFILE_basename.o := n in the + corresponding Makefile. + +Problem: Files copied from sysfs appear empty or incomplete. +Cause: Due to the way seq_file works, some tools such as cp or tar + may not correctly copy files from sysfs. +Solution: Use 'cat' to read .gcda files and 'cp -d' to copy links. + Alternatively use the mechanism shown in Appendix B. + + +Appendix A: gather_on_build.sh +============================== + +Sample script to gather coverage meta files on the build machine +(see 6a): +#!/bin/bash + +KSRC=$1 +KOBJ=$2 +DEST=$3 + +if [ -z "$KSRC" ] || [ -z "$KOBJ" ] || [ -z "$DEST" ]; then + echo "Usage: $0 <ksrc directory> <kobj directory> <output.tar.gz>" >&2 + exit 1 +fi + +KSRC=$(cd $KSRC; printf "all:\n\t@echo \${CURDIR}\n" | make -f -) +KOBJ=$(cd $KOBJ; printf "all:\n\t@echo \${CURDIR}\n" | make -f -) + +find $KSRC $KOBJ \( -name '*.gcno' -o -name '*.[ch]' -o -type l \) -a \ + -perm /u+r,g+r | tar cfz $DEST -P -T - + +if [ $? -eq 0 ] ; then + echo "$DEST successfully created, copy to test system and unpack with:" + echo " tar xfz $DEST -P" +else + echo "Could not create file $DEST" +fi + + +Appendix B: gather_on_test.sh +============================= + +Sample script to gather coverage data files on the test machine +(see 6b): + +#!/bin/bash -e + +DEST=$1 +GCDA=/sys/kernel/debug/gcov + +if [ -z "$DEST" ] ; then + echo "Usage: $0 <output.tar.gz>" >&2 + exit 1 +fi + +TEMPDIR=$(mktemp -d) +echo Collecting data.. +find $GCDA -type d -exec mkdir -p $TEMPDIR/\{\} \; +find $GCDA -name '*.gcda' -exec sh -c 'cat < $0 > '$TEMPDIR'/$0' {} \; +find $GCDA -name '*.gcno' -exec sh -c 'cp -d $0 '$TEMPDIR'/$0' {} \; +tar czf $DEST -C $TEMPDIR sys +rm -rf $TEMPDIR + +echo "$DEST successfully created, copy to build system and unpack with:" +echo " tar xfz $DEST" diff --git a/Documentation/hwmon/pcf8591 b/Documentation/hwmon/pcf8591 index 5628fcf4207..e76a7892f68 100644 --- a/Documentation/hwmon/pcf8591 +++ b/Documentation/hwmon/pcf8591 @@ -2,11 +2,11 @@ Kernel driver pcf8591 ===================== Supported chips: - * Philips PCF8591 + * Philips/NXP PCF8591 Prefix: 'pcf8591' Addresses scanned: I2C 0x48 - 0x4f - Datasheet: Publicly available at the Philips Semiconductor website - http://www.semiconductors.philips.com/pip/PCF8591P.html + Datasheet: Publicly available at the NXP website + http://www.nxp.com/pip/PCF8591_6.html Authors: Aurelien Jarno <aurelien@aurel32.net> @@ -16,9 +16,10 @@ Authors: Description ----------- + The PCF8591 is an 8-bit A/D and D/A converter (4 analog inputs and one -analog output) for the I2C bus produced by Philips Semiconductors. It -is designed to provide a byte I2C interface to up to 4 separate devices. +analog output) for the I2C bus produced by Philips Semiconductors (now NXP). +It is designed to provide a byte I2C interface to up to 4 separate devices. The PCF8591 has 4 analog inputs programmable as single-ended or differential inputs : @@ -58,8 +59,8 @@ Accessing PCF8591 via /sys interface ------------------------------------- ! Be careful ! -The PCF8591 is plainly impossible to detect ! Stupid chip. -So every chip with address in the interval [48..4f] is +The PCF8591 is plainly impossible to detect! Stupid chip. +So every chip with address in the interval [0x48..0x4f] is detected as PCF8591. If you have other chips in this address range, the workaround is to load this module after the one for your others chips. @@ -67,19 +68,20 @@ for your others chips. On detection (i.e. insmod, modprobe et al.), directories are being created for each detected PCF8591: -/sys/bus/devices/<0>-<1>/ +/sys/bus/i2c/devices/<0>-<1>/ where <0> is the bus the chip was detected on (e. g. i2c-0) and <1> the chip address ([48..4f]) Inside these directories, there are such files: -in0, in1, in2, in3, out0_enable, out0_output, name +in0_input, in1_input, in2_input, in3_input, out0_enable, out0_output, name Name contains chip name. -The in0, in1, in2 and in3 files are RO. Reading gives the value of the -corresponding channel. Depending on the current analog inputs configuration, -files in2 and/or in3 do not exist. Values range are from 0 to 255 for single -ended inputs and -128 to +127 for differential inputs (8-bit ADC). +The in0_input, in1_input, in2_input and in3_input files are RO. Reading gives +the value of the corresponding channel. Depending on the current analog inputs +configuration, files in2_input and in3_input may not exist. Values range +from 0 to 255 for single ended inputs and -128 to +127 for differential inputs +(8-bit ADC). The out0_enable file is RW. Reading gives "1" for analog output enabled and "0" for analog output disabled. Writing accepts "0" and "1" accordingly. diff --git a/Documentation/hwmon/tmp421 b/Documentation/hwmon/tmp421 new file mode 100644 index 00000000000..0cf07f82474 --- /dev/null +++ b/Documentation/hwmon/tmp421 @@ -0,0 +1,36 @@ +Kernel driver tmp421 +==================== + +Supported chips: + * Texas Instruments TMP421 + Prefix: 'tmp421' + Addresses scanned: I2C 0x2a, 0x4c, 0x4d, 0x4e and 0x4f + Datasheet: http://focus.ti.com/docs/prod/folders/print/tmp421.html + * Texas Instruments TMP422 + Prefix: 'tmp422' + Addresses scanned: I2C 0x2a, 0x4c, 0x4d, 0x4e and 0x4f + Datasheet: http://focus.ti.com/docs/prod/folders/print/tmp421.html + * Texas Instruments TMP423 + Prefix: 'tmp423' + Addresses scanned: I2C 0x2a, 0x4c, 0x4d, 0x4e and 0x4f + Datasheet: http://focus.ti.com/docs/prod/folders/print/tmp421.html + +Authors: + Andre Prendel <andre.prendel@gmx.de> + +Description +----------- + +This driver implements support for Texas Instruments TMP421, TMP422 +and TMP423 temperature sensor chips. These chips implement one local +and up to one (TMP421), up to two (TMP422) or up to three (TMP423) +remote sensors. Temperature is measured in degrees Celsius. The chips +are wired over I2C/SMBus and specified over a temperature range of -40 +to +125 degrees Celsius. Resolution for both the local and remote +channels is 0.0625 degree C. + +The chips support only temperature measurement. The driver exports +the temperature values via the following sysfs files: + +temp[1-4]_input +temp[2-4]_fault diff --git a/Documentation/i2c/instantiating-devices b/Documentation/i2c/instantiating-devices index b55ce57a84d..c740b7b4108 100644 --- a/Documentation/i2c/instantiating-devices +++ b/Documentation/i2c/instantiating-devices @@ -165,3 +165,47 @@ was done there. Two significant differences are: Once again, method 3 should be avoided wherever possible. Explicit device instantiation (methods 1 and 2) is much preferred for it is safer and faster. + + +Method 4: Instantiate from user-space +------------------------------------- + +In general, the kernel should know which I2C devices are connected and +what addresses they live at. However, in certain cases, it does not, so a +sysfs interface was added to let the user provide the information. This +interface is made of 2 attribute files which are created in every I2C bus +directory: new_device and delete_device. Both files are write only and you +must write the right parameters to them in order to properly instantiate, +respectively delete, an I2C device. + +File new_device takes 2 parameters: the name of the I2C device (a string) +and the address of the I2C device (a number, typically expressed in +hexadecimal starting with 0x, but can also be expressed in decimal.) + +File delete_device takes a single parameter: the address of the I2C +device. As no two devices can live at the same address on a given I2C +segment, the address is sufficient to uniquely identify the device to be +deleted. + +Example: +# echo eeprom 0x50 > /sys/class/i2c-adapter/i2c-3/new_device + +While this interface should only be used when in-kernel device declaration +can't be done, there is a variety of cases where it can be helpful: +* The I2C driver usually detects devices (method 3 above) but the bus + segment your device lives on doesn't have the proper class bit set and + thus detection doesn't trigger. +* The I2C driver usually detects devices, but your device lives at an + unexpected address. +* The I2C driver usually detects devices, but your device is not detected, + either because the detection routine is too strict, or because your + device is not officially supported yet but you know it is compatible. +* You are developing a driver on a test board, where you soldered the I2C + device yourself. + +This interface is a replacement for the force_* module parameters some I2C +drivers implement. Being implemented in i2c-core rather than in each +device driver individually, it is much more efficient, and also has the +advantage that you do not have to reload the driver to change a setting. +You can also instantiate the device before the driver is loaded or even +available, and you don't need to know what driver the device needs. diff --git a/Documentation/i2c/writing-clients b/Documentation/i2c/writing-clients index c1a06f989cf..7860aafb483 100644 --- a/Documentation/i2c/writing-clients +++ b/Documentation/i2c/writing-clients @@ -126,19 +126,9 @@ different) configuration information, as do drivers handling chip variants that can't be distinguished by protocol probing, or which need some board specific information to operate correctly. -Accordingly, the I2C stack now has two models for associating I2C devices -with their drivers: the original "legacy" model, and a newer one that's -fully compatible with the Linux 2.6 driver model. These models do not mix, -since the "legacy" model requires drivers to create "i2c_client" device -objects after SMBus style probing, while the Linux driver model expects -drivers to be given such device objects in their probe() routines. -The legacy model is deprecated now and will soon be removed, so we no -longer document it here. - - -Standard Driver Model Binding ("New Style") -------------------------------------------- +Device/Driver Binding +--------------------- System infrastructure, typically board-specific initialization code or boot firmware, reports what I2C devices exist. For example, there may be @@ -201,7 +191,7 @@ a given I2C bus. This is for example the case of hardware monitoring devices on a PC's SMBus. In that case, you may want to let your driver detect supported devices automatically. This is how the legacy model was working, and is now available as an extension to the standard -driver model (so that we can finally get rid of the legacy model.) +driver model. You simply have to define a detect callback which will attempt to identify supported devices (returning 0 for supported ones and -ENODEV diff --git a/Documentation/input/input.txt b/Documentation/input/input.txt index 686ee9932df..b93c08442e3 100644 --- a/Documentation/input/input.txt +++ b/Documentation/input/input.txt @@ -278,7 +278,7 @@ struct input_event { }; 'time' is the timestamp, it returns the time at which the event happened. -Type is for example EV_REL for relative moment, REL_KEY for a keypress or +Type is for example EV_REL for relative moment, EV_KEY for a keypress or release. More types are defined in include/linux/input.h. 'code' is event code, for example REL_X or KEY_BACKSPACE, again a complete diff --git a/Documentation/input/rotary-encoder.txt b/Documentation/input/rotary-encoder.txt index 435102a26d9..3a6aec40c0b 100644 --- a/Documentation/input/rotary-encoder.txt +++ b/Documentation/input/rotary-encoder.txt @@ -67,7 +67,12 @@ data with it. struct rotary_encoder_platform_data is declared in include/linux/rotary-encoder.h and needs to be filled with the number of steps the encoder has and can carry information about externally inverted -signals (because of used invertig buffer or other reasons). +signals (because of an inverting buffer or other reasons). The encoder +can be set up to deliver input information as either an absolute or relative +axes. For relative axes the input event returns +/-1 for each step. For +absolute axes the position of the encoder can either roll over between zero +and the number of steps or will clamp at the maximum and zero depending on +the configuration. Because GPIO to IRQ mapping is platform specific, this information must be given in seperately to the driver. See the example below. @@ -85,6 +90,8 @@ be given in seperately to the driver. See the example below. static struct rotary_encoder_platform_data my_rotary_encoder_info = { .steps = 24, .axis = ABS_X, + .relative_axis = false, + .rollover = false, .gpio_a = GPIO_ROTARY_A, .gpio_b = GPIO_ROTARY_B, .inverted_a = 0, diff --git a/Documentation/input/sentelic.txt b/Documentation/input/sentelic.txt new file mode 100644 index 00000000000..f7160a2fb6a --- /dev/null +++ b/Documentation/input/sentelic.txt @@ -0,0 +1,475 @@ +Copyright (C) 2002-2008 Sentelic Corporation. +Last update: Oct-31-2008 + +============================================================================== +* Finger Sensing Pad Intellimouse Mode(scrolling wheel, 4th and 5th buttons) +============================================================================== +A) MSID 4: Scrolling wheel mode plus Forward page(4th button) and Backward + page (5th button) +@1. Set sample rate to 200; +@2. Set sample rate to 200; +@3. Set sample rate to 80; +@4. Issuing the "Get device ID" command (0xF2) and waits for the response; +@5. FSP will respond 0x04. + +Packet 1 + Bit 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 +BYTE |---------------|BYTE |---------------|BYTE|---------------|BYTE|---------------| + 1 |Y|X|y|x|1|M|R|L| 2 |X|X|X|X|X|X|X|X| 3 |Y|Y|Y|Y|Y|Y|Y|Y| 4 | | |B|F|W|W|W|W| + |---------------| |---------------| |---------------| |---------------| + +Byte 1: Bit7 => Y overflow + Bit6 => X overflow + Bit5 => Y sign bit + Bit4 => X sign bit + Bit3 => 1 + Bit2 => Middle Button, 1 is pressed, 0 is not pressed. + Bit1 => Right Button, 1 is pressed, 0 is not pressed. + Bit0 => Left Button, 1 is pressed, 0 is not pressed. +Byte 2: X Movement(9-bit 2's complement integers) +Byte 3: Y Movement(9-bit 2's complement integers) +Byte 4: Bit3~Bit0 => the scrolling wheel's movement since the last data report. + valid values, -8 ~ +7 + Bit4 => 1 = 4th mouse button is pressed, Forward one page. + 0 = 4th mouse button is not pressed. + Bit5 => 1 = 5th mouse button is pressed, Backward one page. + 0 = 5th mouse button is not pressed. + +B) MSID 6: Horizontal and Vertical scrolling. +@ Set bit 1 in register 0x40 to 1 + +# FSP replaces scrolling wheel's movement as 4 bits to show horizontal and + vertical scrolling. + +Packet 1 + Bit 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 +BYTE |---------------|BYTE |---------------|BYTE|---------------|BYTE|---------------| + 1 |Y|X|y|x|1|M|R|L| 2 |X|X|X|X|X|X|X|X| 3 |Y|Y|Y|Y|Y|Y|Y|Y| 4 | | |B|F|l|r|u|d| + |---------------| |---------------| |---------------| |---------------| + +Byte 1: Bit7 => Y overflow + Bit6 => X overflow + Bit5 => Y sign bit + Bit4 => X sign bit + Bit3 => 1 + Bit2 => Middle Button, 1 is pressed, 0 is not pressed. + Bit1 => Right Button, 1 is pressed, 0 is not pressed. + Bit0 => Left Button, 1 is pressed, 0 is not pressed. +Byte 2: X Movement(9-bit 2's complement integers) +Byte 3: Y Movement(9-bit 2's complement integers) +Byte 4: Bit0 => the Vertical scrolling movement downward. + Bit1 => the Vertical scrolling movement upward. + Bit2 => the Vertical scrolling movement rightward. + Bit3 => the Vertical scrolling movement leftward. + Bit4 => 1 = 4th mouse button is pressed, Forward one page. + 0 = 4th mouse button is not pressed. + Bit5 => 1 = 5th mouse button is pressed, Backward one page. + 0 = 5th mouse button is not pressed. + +C) MSID 7: +# FSP uses 2 packets(8 Bytes) data to represent Absolute Position + so we have PACKET NUMBER to identify packets. + If PACKET NUMBER is 0, the packet is Packet 1. + If PACKET NUMBER is 1, the packet is Packet 2. + Please count this number in program. + +# MSID6 special packet will be enable at the same time when enable MSID 7. + +============================================================================== +* Absolute position for STL3886-G0. +============================================================================== +@ Set bit 2 or 3 in register 0x40 to 1 +@ Set bit 6 in register 0x40 to 1 + +Packet 1 (ABSOLUTE POSITION) + Bit 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 +BYTE |---------------|BYTE |---------------|BYTE|---------------|BYTE|---------------| + 1 |0|1|V|1|1|M|R|L| 2 |X|X|X|X|X|X|X|X| 3 |Y|Y|Y|Y|Y|Y|Y|Y| 4 |r|l|d|u|X|X|Y|Y| + |---------------| |---------------| |---------------| |---------------| + +Byte 1: Bit7~Bit6 => 00, Normal data packet + => 01, Absolute coordination packet + => 10, Notify packet + Bit5 => valid bit + Bit4 => 1 + Bit3 => 1 + Bit2 => Middle Button, 1 is pressed, 0 is not pressed. + Bit1 => Right Button, 1 is pressed, 0 is not pressed. + Bit0 => Left Button, 1 is pressed, 0 is not pressed. +Byte 2: X coordinate (xpos[9:2]) +Byte 3: Y coordinate (ypos[9:2]) +Byte 4: Bit1~Bit0 => Y coordinate (xpos[1:0]) + Bit3~Bit2 => X coordinate (ypos[1:0]) + Bit4 => scroll up + Bit5 => scroll down + Bit6 => scroll left + Bit7 => scroll right + +Notify Packet for G0 + Bit 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 +BYTE |---------------|BYTE |---------------|BYTE|---------------|BYTE|---------------| + 1 |1|0|0|1|1|M|R|L| 2 |C|C|C|C|C|C|C|C| 3 |M|M|M|M|M|M|M|M| 4 |0|0|0|0|0|0|0|0| + |---------------| |---------------| |---------------| |---------------| + +Byte 1: Bit7~Bit6 => 00, Normal data packet + => 01, Absolute coordination packet + => 10, Notify packet + Bit5 => 0 + Bit4 => 1 + Bit3 => 1 + Bit2 => Middle Button, 1 is pressed, 0 is not pressed. + Bit1 => Right Button, 1 is pressed, 0 is not pressed. + Bit0 => Left Button, 1 is pressed, 0 is not pressed. +Byte 2: Message Type => 0x5A (Enable/Disable status packet) + Mode Type => 0xA5 (Normal/Icon mode status) +Byte 3: Message Type => 0x00 (Disabled) + => 0x01 (Enabled) + Mode Type => 0x00 (Normal) + => 0x01 (Icon) +Byte 4: Bit7~Bit0 => Don't Care + +============================================================================== +* Absolute position for STL3888-A0. +============================================================================== +Packet 1 (ABSOLUTE POSITION) + Bit 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 +BYTE |---------------|BYTE |---------------|BYTE|---------------|BYTE|---------------| + 1 |0|1|V|A|1|L|0|1| 2 |X|X|X|X|X|X|X|X| 3 |Y|Y|Y|Y|Y|Y|Y|Y| 4 |x|x|y|y|X|X|Y|Y| + |---------------| |---------------| |---------------| |---------------| + +Byte 1: Bit7~Bit6 => 00, Normal data packet + => 01, Absolute coordination packet + => 10, Notify packet + Bit5 => Valid bit, 0 means that the coordinate is invalid or finger up. + When both fingers are up, the last two reports have zero valid + bit. + Bit4 => arc + Bit3 => 1 + Bit2 => Left Button, 1 is pressed, 0 is released. + Bit1 => 0 + Bit0 => 1 +Byte 2: X coordinate (xpos[9:2]) +Byte 3: Y coordinate (ypos[9:2]) +Byte 4: Bit1~Bit0 => Y coordinate (xpos[1:0]) + Bit3~Bit2 => X coordinate (ypos[1:0]) + Bit5~Bit4 => y1_g + Bit7~Bit6 => x1_g + +Packet 2 (ABSOLUTE POSITION) + Bit 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 +BYTE |---------------|BYTE |---------------|BYTE|---------------|BYTE|---------------| + 1 |0|1|V|A|1|R|1|0| 2 |X|X|X|X|X|X|X|X| 3 |Y|Y|Y|Y|Y|Y|Y|Y| 4 |x|x|y|y|X|X|Y|Y| + |---------------| |---------------| |---------------| |---------------| + +Byte 1: Bit7~Bit6 => 00, Normal data packet + => 01, Absolute coordinates packet + => 10, Notify packet + Bit5 => Valid bit, 0 means that the coordinate is invalid or finger up. + When both fingers are up, the last two reports have zero valid + bit. + Bit4 => arc + Bit3 => 1 + Bit2 => Right Button, 1 is pressed, 0 is released. + Bit1 => 1 + Bit0 => 0 +Byte 2: X coordinate (xpos[9:2]) +Byte 3: Y coordinate (ypos[9:2]) +Byte 4: Bit1~Bit0 => Y coordinate (xpos[1:0]) + Bit3~Bit2 => X coordinate (ypos[1:0]) + Bit5~Bit4 => y2_g + Bit7~Bit6 => x2_g + +Notify Packet for STL3888-A0 + Bit 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 +BYTE |---------------|BYTE |---------------|BYTE|---------------|BYTE|---------------| + 1 |1|0|1|P|1|M|R|L| 2 |C|C|C|C|C|C|C|C| 3 |0|0|F|F|0|0|0|i| 4 |r|l|d|u|0|0|0|0| + |---------------| |---------------| |---------------| |---------------| + +Byte 1: Bit7~Bit6 => 00, Normal data packet + => 01, Absolute coordination packet + => 10, Notify packet + Bit5 => 1 + Bit4 => when in absolute coordinates mode (valid when EN_PKT_GO is 1): + 0: left button is generated by the on-pad command + 1: left button is generated by the external button + Bit3 => 1 + Bit2 => Middle Button, 1 is pressed, 0 is not pressed. + Bit1 => Right Button, 1 is pressed, 0 is not pressed. + Bit0 => Left Button, 1 is pressed, 0 is not pressed. +Byte 2: Message Type => 0xB7 (Multi Finger, Multi Coordinate mode) +Byte 3: Bit7~Bit6 => Don't care + Bit5~Bit4 => Number of fingers + Bit3~Bit1 => Reserved + Bit0 => 1: enter gesture mode; 0: leaving gesture mode +Byte 4: Bit7 => scroll right button + Bit6 => scroll left button + Bit5 => scroll down button + Bit4 => scroll up button + * Note that if gesture and additional button (Bit4~Bit7) + happen at the same time, the button information will not + be sent. + Bit3~Bit0 => Reserved + +Sample sequence of Multi-finger, Multi-coordinate mode: + + notify packet (valid bit == 1), abs pkt 1, abs pkt 2, abs pkt 1, + abs pkt 2, ..., notify packet(valid bit == 0) + +============================================================================== +* FSP Enable/Disable packet +============================================================================== + Bit 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 +BYTE |---------------|BYTE |---------------|BYTE|---------------|BYTE|---------------| + 1 |Y|X|0|0|1|M|R|L| 2 |0|1|0|1|1|0|1|E| 3 | | | | | | | | | 4 | | | | | | | | | + |---------------| |---------------| |---------------| |---------------| + +FSP will send out enable/disable packet when FSP receive PS/2 enable/disable +command. Host will receive the packet which Middle, Right, Left button will +be set. The packet only use byte 0 and byte 1 as a pattern of original packet. +Ignore the other bytes of the packet. + +Byte 1: Bit7 => 0, Y overflow + Bit6 => 0, X overflow + Bit5 => 0, Y sign bit + Bit4 => 0, X sign bit + Bit3 => 1 + Bit2 => 1, Middle Button + Bit1 => 1, Right Button + Bit0 => 1, Left Button +Byte 2: Bit7~1 => (0101101b) + Bit0 => 1 = Enable + 0 = Disable +Byte 3: Don't care +Byte 4: Don't care (MOUSE ID 3, 4) +Byte 5~8: Don't care (Absolute packet) + +============================================================================== +* PS/2 Command Set +============================================================================== + +FSP supports basic PS/2 commanding set and modes, refer to following URL for +details about PS/2 commands: + +http://www.computer-engineering.org/index.php?title=PS/2_Mouse_Interface + +============================================================================== +* Programming Sequence for Determining Packet Parsing Flow +============================================================================== +1. Identify FSP by reading device ID(0x00) and version(0x01) register + +2. Determine number of buttons by reading status2 (0x0b) register + + buttons = reg[0x0b] & 0x30 + + if buttons == 0x30 or buttons == 0x20: + # two/four buttons + Refer to 'Finger Sensing Pad PS/2 Mouse Intellimouse' + section A for packet parsing detail(ignore byte 4, bit ~ 7) + elif buttons == 0x10: + # 6 buttons + Refer to 'Finger Sensing Pad PS/2 Mouse Intellimouse' + section B for packet parsing detail + elif buttons == 0x00: + # 6 buttons + Refer to 'Finger Sensing Pad PS/2 Mouse Intellimouse' + section A for packet parsing detail + +============================================================================== +* Programming Sequence for Register Reading/Writing +============================================================================== + +Register inversion requirement: + + Following values needed to be inverted(the '~' operator in C) before being +sent to FSP: + + 0xe9, 0xee, 0xf2 and 0xff. + +Register swapping requirement: + + Following values needed to have their higher 4 bits and lower 4 bits being +swapped before being sent to FSP: + + 10, 20, 40, 60, 80, 100 and 200. + +Register reading sequence: + + 1. send 0xf3 PS/2 command to FSP; + + 2. send 0x66 PS/2 command to FSP; + + 3. send 0x88 PS/2 command to FSP; + + 4. send 0xf3 PS/2 command to FSP; + + 5. if the register address being to read is not required to be + inverted(refer to the 'Register inversion requirement' section), + goto step 6 + + 5a. send 0x68 PS/2 command to FSP; + + 5b. send the inverted register address to FSP and goto step 8; + + 6. if the register address being to read is not required to be + swapped(refer to the 'Register swapping requirement' section), + goto step 7 + + 6a. send 0xcc PS/2 command to FSP; + + 6b. send the swapped register address to FSP and goto step 8; + + 7. send 0x66 PS/2 command to FSP; + + 7a. send the original register address to FSP and goto step 8; + + 8. send 0xe9(status request) PS/2 command to FSP; + + 9. the response read from FSP should be the requested register value. + +Register writing sequence: + + 1. send 0xf3 PS/2 command to FSP; + + 2. if the register address being to write is not required to be + inverted(refer to the 'Register inversion requirement' section), + goto step 3 + + 2a. send 0x74 PS/2 command to FSP; + + 2b. send the inverted register address to FSP and goto step 5; + + 3. if the register address being to write is not required to be + swapped(refer to the 'Register swapping requirement' section), + goto step 4 + + 3a. send 0x77 PS/2 command to FSP; + + 3b. send the swapped register address to FSP and goto step 5; + + 4. send 0x55 PS/2 command to FSP; + + 4a. send the register address to FSP and goto step 5; + + 5. send 0xf3 PS/2 command to FSP; + + 6. if the register value being to write is not required to be + inverted(refer to the 'Register inversion requirement' section), + goto step 7 + + 6a. send 0x47 PS/2 command to FSP; + + 6b. send the inverted register value to FSP and goto step 9; + + 7. if the register value being to write is not required to be + swapped(refer to the 'Register swapping requirement' section), + goto step 8 + + 7a. send 0x44 PS/2 command to FSP; + + 7b. send the swapped register value to FSP and goto step 9; + + 8. send 0x33 PS/2 command to FSP; + + 8a. send the register value to FSP; + + 9. the register writing sequence is completed. + +============================================================================== +* Register Listing +============================================================================== + +offset width default r/w name +0x00 bit7~bit0 0x01 RO device ID + +0x01 bit7~bit0 0xc0 RW version ID + +0x02 bit7~bit0 0x01 RO vendor ID + +0x03 bit7~bit0 0x01 RO product ID + +0x04 bit3~bit0 0x01 RW revision ID + +0x0b RO test mode status 1 + bit3 1 RO 0: rotate 180 degree, 1: no rotation + + bit5~bit4 RO number of buttons + 11 => 2, lbtn/rbtn + 10 => 4, lbtn/rbtn/scru/scrd + 01 => 6, lbtn/rbtn/scru/scrd/scrl/scrr + 00 => 6, lbtn/rbtn/scru/scrd/fbtn/bbtn + +0x0f RW register file page control + bit0 0 RW 1 to enable page 1 register files + +0x10 RW system control 1 + bit0 1 RW Reserved, must be 1 + bit1 0 RW Reserved, must be 0 + bit4 1 RW Reserved, must be 0 + bit5 0 RW register clock gating enable + 0: read only, 1: read/write enable + (Note that following registers does not require clock gating being + enabled prior to write: 05 06 07 08 09 0c 0f 10 11 12 16 17 18 23 2e + 40 41 42 43.) + +0x31 RW on-pad command detection + bit7 0 RW on-pad command left button down tag + enable + 0: disable, 1: enable + +0x34 RW on-pad command control 5 + bit4~bit0 0x05 RW XLO in 0s/4/1, so 03h = 0010.1b = 2.5 + (Note that position unit is in 0.5 scanline) + + bit7 0 RW on-pad tap zone enable + 0: disable, 1: enable + +0x35 RW on-pad command control 6 + bit4~bit0 0x1d RW XHI in 0s/4/1, so 19h = 1100.1b = 12.5 + (Note that position unit is in 0.5 scanline) + +0x36 RW on-pad command control 7 + bit4~bit0 0x04 RW YLO in 0s/4/1, so 03h = 0010.1b = 2.5 + (Note that position unit is in 0.5 scanline) + +0x37 RW on-pad command control 8 + bit4~bit0 0x13 RW YHI in 0s/4/1, so 11h = 1000.1b = 8.5 + (Note that position unit is in 0.5 scanline) + +0x40 RW system control 5 + bit1 0 RW FSP Intellimouse mode enable + 0: disable, 1: enable + + bit2 0 RW movement + abs. coordinate mode enable + 0: disable, 1: enable + (Note that this function has the functionality of bit 1 even when + bit 1 is not set. However, the format is different from that of bit 1. + In addition, when bit 1 and bit 2 are set at the same time, bit 2 will + override bit 1.) + + bit3 0 RW abs. coordinate only mode enable + 0: disable, 1: enable + (Note that this function has the functionality of bit 1 even when + bit 1 is not set. However, the format is different from that of bit 1. + In addition, when bit 1, bit 2 and bit 3 are set at the same time, + bit 3 will override bit 1 and 2.) + + bit5 0 RW auto switch enable + 0: disable, 1: enable + + bit6 0 RW G0 abs. + notify packet format enable + 0: disable, 1: enable + (Note that the absolute/relative coordinate output still depends on + bit 2 and 3. That is, if any of those bit is 1, host will receive + absolute coordinates; otherwise, host only receives packets with + relative coordinate.) + +0x43 RW on-pad control + bit0 0 RW on-pad control enable + 0: disable, 1: enable + (Note that if this bit is cleared, bit 3/5 will be ineffective) + + bit3 0 RW on-pad fix vertical scrolling enable + 0: disable, 1: enable + + bit5 0 RW on-pad fix horizontal scrolling enable + 0: disable, 1: enable diff --git a/Documentation/intel_txt.txt b/Documentation/intel_txt.txt new file mode 100644 index 00000000000..f40a1f03001 --- /dev/null +++ b/Documentation/intel_txt.txt @@ -0,0 +1,210 @@ +Intel(R) TXT Overview: +===================== + +Intel's technology for safer computing, Intel(R) Trusted Execution +Technology (Intel(R) TXT), defines platform-level enhancements that +provide the building blocks for creating trusted platforms. + +Intel TXT was formerly known by the code name LaGrande Technology (LT). + +Intel TXT in Brief: +o Provides dynamic root of trust for measurement (DRTM) +o Data protection in case of improper shutdown +o Measurement and verification of launched environment + +Intel TXT is part of the vPro(TM) brand and is also available some +non-vPro systems. It is currently available on desktop systems +based on the Q35, X38, Q45, and Q43 Express chipsets (e.g. Dell +Optiplex 755, HP dc7800, etc.) and mobile systems based on the GM45, +PM45, and GS45 Express chipsets. + +For more information, see http://www.intel.com/technology/security/. +This site also has a link to the Intel TXT MLE Developers Manual, +which has been updated for the new released platforms. + +Intel TXT has been presented at various events over the past few +years, some of which are: + LinuxTAG 2008: + http://www.linuxtag.org/2008/en/conf/events/vp-donnerstag/ + details.html?talkid=110 + TRUST2008: + http://www.trust2008.eu/downloads/Keynote-Speakers/ + 3_David-Grawrock_The-Front-Door-of-Trusted-Computing.pdf + IDF 2008, Shanghai: + http://inteldeveloperforum.com.edgesuite.net/shanghai_2008/ + aep/PROS003/index.html + IDFs 2006, 2007 (I'm not sure if/where they are online) + +Trusted Boot Project Overview: +============================= + +Trusted Boot (tboot) is an open source, pre- kernel/VMM module that +uses Intel TXT to perform a measured and verified launch of an OS +kernel/VMM. + +It is hosted on SourceForge at http://sourceforge.net/projects/tboot. +The mercurial source repo is available at http://www.bughost.org/ +repos.hg/tboot.hg. + +Tboot currently supports launching Xen (open source VMM/hypervisor +w/ TXT support since v3.2), and now Linux kernels. + + +Value Proposition for Linux or "Why should you care?" +===================================================== + +While there are many products and technologies that attempt to +measure or protect the integrity of a running kernel, they all +assume the kernel is "good" to begin with. The Integrity +Measurement Architecture (IMA) and Linux Integrity Module interface +are examples of such solutions. + +To get trust in the initial kernel without using Intel TXT, a +static root of trust must be used. This bases trust in BIOS +starting at system reset and requires measurement of all code +executed between system reset through the completion of the kernel +boot as well as data objects used by that code. In the case of a +Linux kernel, this means all of BIOS, any option ROMs, the +bootloader and the boot config. In practice, this is a lot of +code/data, much of which is subject to change from boot to boot +(e.g. changing NICs may change option ROMs). Without reference +hashes, these measurement changes are difficult to assess or +confirm as benign. This process also does not provide DMA +protection, memory configuration/alias checks and locks, crash +protection, or policy support. + +By using the hardware-based root of trust that Intel TXT provides, +many of these issues can be mitigated. Specifically: many +pre-launch components can be removed from the trust chain, DMA +protection is provided to all launched components, a large number +of platform configuration checks are performed and values locked, +protection is provided for any data in the event of an improper +shutdown, and there is support for policy-based execution/verification. +This provides a more stable measurement and a higher assurance of +system configuration and initial state than would be otherwise +possible. Since the tboot project is open source, source code for +almost all parts of the trust chain is available (excepting SMM and +Intel-provided firmware). + +How Does it Work? +================= + +o Tboot is an executable that is launched by the bootloader as + the "kernel" (the binary the bootloader executes). +o It performs all of the work necessary to determine if the + platform supports Intel TXT and, if so, executes the GETSEC[SENTER] + processor instruction that initiates the dynamic root of trust. + - If tboot determines that the system does not support Intel TXT + or is not configured correctly (e.g. the SINIT AC Module was + incorrect), it will directly launch the kernel with no changes + to any state. + - Tboot will output various information about its progress to the + terminal, serial port, and/or an in-memory log; the output + locations can be configured with a command line switch. +o The GETSEC[SENTER] instruction will return control to tboot and + tboot then verifies certain aspects of the environment (e.g. TPM NV + lock, e820 table does not have invalid entries, etc.). +o It will wake the APs from the special sleep state the GETSEC[SENTER] + instruction had put them in and place them into a wait-for-SIPI + state. + - Because the processors will not respond to an INIT or SIPI when + in the TXT environment, it is necessary to create a small VT-x + guest for the APs. When they run in this guest, they will + simply wait for the INIT-SIPI-SIPI sequence, which will cause + VMEXITs, and then disable VT and jump to the SIPI vector. This + approach seemed like a better choice than having to insert + special code into the kernel's MP wakeup sequence. +o Tboot then applies an (optional) user-defined launch policy to + verify the kernel and initrd. + - This policy is rooted in TPM NV and is described in the tboot + project. The tboot project also contains code for tools to + create and provision the policy. + - Policies are completely under user control and if not present + then any kernel will be launched. + - Policy action is flexible and can include halting on failures + or simply logging them and continuing. +o Tboot adjusts the e820 table provided by the bootloader to reserve + its own location in memory as well as to reserve certain other + TXT-related regions. +o As part of it's launch, tboot DMA protects all of RAM (using the + VT-d PMRs). Thus, the kernel must be booted with 'intel_iommu=on' + in order to remove this blanket protection and use VT-d's + page-level protection. +o Tboot will populate a shared page with some data about itself and + pass this to the Linux kernel as it transfers control. + - The location of the shared page is passed via the boot_params + struct as a physical address. +o The kernel will look for the tboot shared page address and, if it + exists, map it. +o As one of the checks/protections provided by TXT, it makes a copy + of the VT-d DMARs in a DMA-protected region of memory and verifies + them for correctness. The VT-d code will detect if the kernel was + launched with tboot and use this copy instead of the one in the + ACPI table. +o At this point, tboot and TXT are out of the picture until a + shutdown (S<n>) +o In order to put a system into any of the sleep states after a TXT + launch, TXT must first be exited. This is to prevent attacks that + attempt to crash the system to gain control on reboot and steal + data left in memory. + - The kernel will perform all of its sleep preparation and + populate the shared page with the ACPI data needed to put the + platform in the desired sleep state. + - Then the kernel jumps into tboot via the vector specified in the + shared page. + - Tboot will clean up the environment and disable TXT, then use the + kernel-provided ACPI information to actually place the platform + into the desired sleep state. + - In the case of S3, tboot will also register itself as the resume + vector. This is necessary because it must re-establish the + measured environment upon resume. Once the TXT environment + has been restored, it will restore the TPM PCRs and then + transfer control back to the kernel's S3 resume vector. + In order to preserve system integrity across S3, the kernel + provides tboot with a set of memory ranges (kernel + code/data/bss, S3 resume code, and AP trampoline) that tboot + will calculate a MAC (message authentication code) over and then + seal with the TPM. On resume and once the measured environment + has been re-established, tboot will re-calculate the MAC and + verify it against the sealed value. Tboot's policy determines + what happens if the verification fails. + +That's pretty much it for TXT support. + + +Configuring the System: +====================== + +This code works with 32bit, 32bit PAE, and 64bit (x86_64) kernels. + +In BIOS, the user must enable: TPM, TXT, VT-x, VT-d. Not all BIOSes +allow these to be individually enabled/disabled and the screens in +which to find them are BIOS-specific. + +grub.conf needs to be modified as follows: + title Linux 2.6.29-tip w/ tboot + root (hd0,0) + kernel /tboot.gz logging=serial,vga,memory + module /vmlinuz-2.6.29-tip intel_iommu=on ro + root=LABEL=/ rhgb console=ttyS0,115200 3 + module /initrd-2.6.29-tip.img + module /Q35_SINIT_17.BIN + +The kernel option for enabling Intel TXT support is found under the +Security top-level menu and is called "Enable Intel(R) Trusted +Execution Technology (TXT)". It is marked as EXPERIMENTAL and +depends on the generic x86 support (to allow maximum flexibility in +kernel build options), since the tboot code will detect whether the +platform actually supports Intel TXT and thus whether any of the +kernel code is executed. + +The Q35_SINIT_17.BIN file is what Intel TXT refers to as an +Authenticated Code Module. It is specific to the chipset in the +system and can also be found on the Trusted Boot site. It is an +(unencrypted) module signed by Intel that is used as part of the +DRTM process to verify and configure the system. It is signed +because it operates at a higher privilege level in the system than +any other macrocode and its correct operation is critical to the +establishment of the DRTM. The process for determining the correct +SINIT ACM for a system is documented in the SINIT-guide.txt file +that is on the tboot SourceForge site under the SINIT ACM downloads. diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt index 1f779a25c70..aafca0a8f66 100644 --- a/Documentation/ioctl/ioctl-number.txt +++ b/Documentation/ioctl/ioctl-number.txt @@ -121,6 +121,7 @@ Code Seq# Include File Comments 'c' 00-7F linux/comstats.h conflict! 'c' 00-7F linux/coda.h conflict! 'c' 80-9F arch/s390/include/asm/chsc.h +'c' A0-AF arch/x86/include/asm/msr.h 'd' 00-FF linux/char/drm/drm/h conflict! 'd' F0-FF linux/digi1.h 'e' all linux/digi1.h conflict! @@ -139,6 +140,7 @@ Code Seq# Include File Comments 'm' all linux/synclink.h conflict! 'm' 00-1F net/irda/irmod.h conflict! 'n' 00-7F linux/ncp_fs.h +'n' 80-8F linux/nilfs2_fs.h NILFS2 'n' E0-FF video/matrox.h matroxfb 'o' 00-1F fs/ocfs2/ocfs2_fs.h OCFS2 'o' 00-03 include/mtd/ubi-user.h conflict! (OCFS2 and UBI overlaps) @@ -149,6 +151,8 @@ Code Seq# Include File Comments 'p' 40-7F linux/nvram.h 'p' 80-9F user-space parport <mailto:tim@cyberelk.net> +'p' a1-a4 linux/pps.h LinuxPPS + <mailto:giometti@linux.it> 'q' 00-1F linux/serio.h 'q' 80-FF Internet PhoneJACK, Internet LineJACK <http://www.quicknet.net> @@ -189,7 +193,7 @@ Code Seq# Include File Comments 0xAD 00 Netfilter device in development: <mailto:rusty@rustcorp.com.au> 0xAE all linux/kvm.h Kernel-based Virtual Machine - <mailto:kvm-devel@lists.sourceforge.net> + <mailto:kvm@vger.kernel.org> 0xB0 all RATIO devices in development: <mailto:vgo@ratio.de> 0xB1 00-1F PPPoX <mailto:mostrows@styx.uwaterloo.ca> diff --git a/Documentation/isdn/00-INDEX b/Documentation/isdn/00-INDEX index f6010a53659..e87e336f590 100644 --- a/Documentation/isdn/00-INDEX +++ b/Documentation/isdn/00-INDEX @@ -14,25 +14,14 @@ README - general info on what you need and what to do for Linux ISDN. README.FAQ - general info for FAQ. -README.audio - - info for running audio over ISDN. -README.fax - - info for using Fax over ISDN. -README.gigaset - - info on the drivers for Siemens Gigaset ISDN adapters. -README.icn - - info on the ICN-ISDN-card and its driver. ->>>>>>> 93af7aca44f0e82e67bda10a0fb73d383edcc8bd:Documentation/isdn/00-INDEX README.HiSax - info on the HiSax driver which replaces the old teles. +README.act2000 + - info on driver for IBM ACT-2000 card. README.audio - info for running audio over ISDN. README.avmb1 - info on driver for AVM-B1 ISDN card. -README.act2000 - - info on driver for IBM ACT-2000 card. -README.eicon - - info on driver for Eicon active cards. README.concap - info on "CONCAP" encapsulation protocol interface used for X.25. README.diversion @@ -59,7 +48,3 @@ README.x25 - info for running X.25 over ISDN. syncPPP.FAQ - frequently asked questions about running PPP over ISDN. -README.hysdn - - info on driver for Hypercope active HYSDN cards -README.mISDN - - info on the Modular ISDN subsystem (mISDN). diff --git a/Documentation/ja_JP/SubmitChecklist b/Documentation/ja_JP/SubmitChecklist index 6c42e071d72..2df4576f117 100644 --- a/Documentation/ja_JP/SubmitChecklist +++ b/Documentation/ja_JP/SubmitChecklist @@ -75,7 +75,7 @@ Linux カーãƒãƒ«ãƒ‘ッãƒæŠ•ç¨¿è€…å‘ã‘ãƒã‚§ãƒƒã‚¯ãƒªã‚¹ãƒˆ ビルドã—ãŸä¸Šã€å‹•ä½œç¢ºèªã‚’è¡Œã£ã¦ãã ã•ã„。 14: ã‚‚ã—パッãƒãŒãƒ‡ã‚£ã‚¹ã‚¯ã®I/O性能ãªã©ã«å½±éŸ¿ã‚’与ãˆã‚‹ã‚ˆã†ã§ã‚ã‚Œã°ã€ - 'CONFIG_LBD'オプションを有効ã«ã—ãŸå ´åˆã¨ç„¡åŠ¹ã«ã—ãŸå ´åˆã®ä¸¡æ–¹ã§ + 'CONFIG_LBDAF'オプションを有効ã«ã—ãŸå ´åˆã¨ç„¡åŠ¹ã«ã—ãŸå ´åˆã®ä¸¡æ–¹ã§ テストを実施ã—ã¦ã¿ã¦ãã ã•ã„。 15: lockdepã®æ©Ÿèƒ½ã‚’å…¨ã¦æœ‰åŠ¹ã«ã—ãŸä¸Šã§ã€å…¨ã¦ã®ã‚³ãƒ¼ãƒ‰ãƒ‘スを評価ã—ã¦ãã ã•ã„。 diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index ad380063077..f45d0d8e71d 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -48,6 +48,7 @@ parameter is applicable: EFI EFI Partitioning (GPT) is enabled EIDE EIDE/ATAPI support is enabled. FB The frame buffer device is enabled. + GCOV GCOV profiling is enabled. HW Appropriate hardware is enabled. IA-64 IA-64 architecture is enabled. IMA Integrity measurement architecture is enabled. @@ -56,6 +57,7 @@ parameter is applicable: ISAPNP ISA PnP code is enabled. ISDN Appropriate ISDN support is enabled. JOY Appropriate joystick support is enabled. + KVM Kernel Virtual Machine support is enabled. LIBATA Libata driver is enabled LP Printer support is enabled. LOOP Loopback device support is enabled. @@ -228,14 +230,6 @@ and is between 256 and 4096 characters. It is defined in the file to assume that this machine's pmtimer latches its value and always returns good values. - acpi.power_nocheck= [HW,ACPI] - Format: 1/0 enable/disable the check of power state. - On some bogus BIOS the _PSC object/_STA object of - power resource can't return the correct device power - state. In such case it is unneccessary to check its - power state again in power transition. - 1 : disable the power state check - acpi_sci= [HW,ACPI] ACPI System Control Interrupt trigger mode Format: { level | edge | high | low } @@ -546,6 +540,10 @@ and is between 256 and 4096 characters. It is defined in the file console=brl,ttyS0 For now, only VisioBraille is supported. + consoleblank= [KNL] The console blank (screen saver) timeout in + seconds. Defaults to 10*60 = 10mins. A value of 0 + disables the blank timer. + coredump_filter= [KNL] Change the default value for /proc/<pid>/coredump_filter. @@ -792,6 +790,12 @@ and is between 256 and 4096 characters. It is defined in the file Format: off | on default: on + gcov_persist= [GCOV] When non-zero (default), profiling data for + kernel modules is saved and remains accessible via + debugfs, even when the module is unloaded/reloaded. + When zero, profiling data is discarded and associated + debugfs files are removed at module unload time. + gdth= [HW,SCSI] See header of drivers/scsi/gdth.c. @@ -995,6 +999,7 @@ and is between 256 and 4096 characters. It is defined in the file nomerge forcesac soft + pt [x86, IA64] io7= [HW] IO7 for Marvel based alpha systems See comment before marvel_specify_io7 in @@ -1094,6 +1099,44 @@ and is between 256 and 4096 characters. It is defined in the file kstack=N [X86] Print N words from the kernel stack in oops dumps. + kvm.ignore_msrs=[KVM] Ignore guest accesses to unhandled MSRs. + Default is 0 (don't ignore, but inject #GP) + + kvm.oos_shadow= [KVM] Disable out-of-sync shadow paging. + Default is 1 (enabled) + + kvm-amd.nested= [KVM,AMD] Allow nested virtualization in KVM/SVM. + Default is 0 (off) + + kvm-amd.npt= [KVM,AMD] Disable nested paging (virtualized MMU) + for all guests. + Default is 1 (enabled) if in 64bit or 32bit-PAE mode + + kvm-intel.bypass_guest_pf= + [KVM,Intel] Disables bypassing of guest page faults + on Intel chips. Default is 1 (enabled) + + kvm-intel.ept= [KVM,Intel] Disable extended page tables + (virtualized MMU) support on capable Intel chips. + Default is 1 (enabled) + + kvm-intel.emulate_invalid_guest_state= + [KVM,Intel] Enable emulation of invalid guest states + Default is 0 (disabled) + + kvm-intel.flexpriority= + [KVM,Intel] Disable FlexPriority feature (TPR shadow). + Default is 1 (enabled) + + kvm-intel.unrestricted_guest= + [KVM,Intel] Disable unrestricted guest feature + (virtualized real and unpaged mode) on capable + Intel chips. Default is 1 (enabled) + + kvm-intel.vpid= [KVM,Intel] Disable Virtual Processor Identification + feature (tagged TLBs) on capable Intel chips. + Default is 1 (enabled) + l2cr= [PPC] l3cr= [PPC] @@ -1111,6 +1154,10 @@ and is between 256 and 4096 characters. It is defined in the file libata.dma=4 Compact Flash DMA only Combinations also work, so libata.dma=3 enables DMA for disks and CDROMs, but not CFs. + + libata.ignore_hpa= [LIBATA] Ignore HPA limit + libata.ignore_hpa=0 keep BIOS limits (default) + libata.ignore_hpa=1 ignore limits, using full disk libata.noacpi [LIBATA] Disables use of ACPI in libata suspend/resume when set. @@ -1239,6 +1286,10 @@ and is between 256 and 4096 characters. It is defined in the file (machvec) in a generic kernel. Example: machvec=hpzx1_swiotlb + machtype= [Loongson] Share the same kernel image file between different + yeeloong laptop. + Example: machtype=lemote-yeeloong-2f-7inch + max_addr=nn[KMG] [KNL,BOOT,ia64] All physical memory greater than or equal to this physical address is ignored. @@ -1358,6 +1409,27 @@ and is between 256 and 4096 characters. It is defined in the file min_addr=nn[KMG] [KNL,BOOT,ia64] All physical memory below this physical address is ignored. + mini2440= [ARM,HW,KNL] + Format:[0..2][b][c][t] + Default: "0tb" + MINI2440 configuration specification: + 0 - The attached screen is the 3.5" TFT + 1 - The attached screen is the 7" TFT + 2 - The VGA Shield is attached (1024x768) + Leaving out the screen size parameter will not load + the TFT driver, and the framebuffer will be left + unconfigured. + b - Enable backlight. The TFT backlight pin will be + linked to the kernel VESA blanking code and a GPIO + LED. This parameter is not necessary when using the + VGA shield. + c - Enable the s3c camera interface. + t - Reserved for enabling touchscreen support. The + touchscreen support is not enabled in the mainstream + kernel as of 2.6.30, a preliminary port can be found + in the "bleeding edge" mini2440 support kernel at + http://repo.or.cz/w/linux-2.6/mini2440.git + mminit_loglevel= [KNL] When CONFIG_DEBUG_MEMORY_INIT is set, this parameter allows control of the logging verbosity for @@ -1399,6 +1471,16 @@ and is between 256 and 4096 characters. It is defined in the file mtdparts= [MTD] See drivers/mtd/cmdlinepart.c. + onenand.bdry= [HW,MTD] Flex-OneNAND Boundary Configuration + + Format: [die0_boundary][,die0_lock][,die1_boundary][,die1_lock] + + boundary - index of last SLC block on Flex-OneNAND. + The remaining blocks are configured as MLC blocks. + lock - Configure if Flex-OneNAND boundary should be locked. + Once locked, the boundary cannot be changed. + 1 indicates lock status, 0 indicates unlock status. + mtdset= [ARM] ARM/S3C2412 JIVE boot control @@ -1464,6 +1546,14 @@ and is between 256 and 4096 characters. It is defined in the file [NFS] set the TCP port on which the NFSv4 callback channel should listen. + nfs.cache_getent= + [NFS] sets the pathname to the program which is used + to update the NFS client cache entries. + + nfs.cache_getent_timeout= + [NFS] sets the timeout after which an attempt to + update a cache entry is deemed to have failed. + nfs.idmap_cache_timeout= [NFS] set the maximum lifetime for idmapper cache entries. @@ -1496,6 +1586,11 @@ and is between 256 and 4096 characters. It is defined in the file symbolic names: lapic and ioapic Example: nmi_watchdog=2 or nmi_watchdog=panic,lapic + netpoll.carrier_timeout= + [NET] Specifies amount of time (in seconds) that + netpoll should wait for a carrier. By default netpoll + waits 4 seconds. + no387 [BUGS=X86-32] Tells the kernel to use the 387 maths emulation library even if a 387 maths coprocessor is present. @@ -1685,8 +1780,8 @@ and is between 256 and 4096 characters. It is defined in the file oprofile.cpu_type= Force an oprofile cpu type This might be useful if you have an older oprofile userland or if you want common events. - Format: { archperfmon } - archperfmon: [X86] Force use of architectural + Format: { arch_perfmon } + arch_perfmon: [X86] Force use of architectural perfmon on Intel CPUs instead of the CPU specific event set. @@ -1765,6 +1860,9 @@ and is between 256 and 4096 characters. It is defined in the file root domains (aka PCI segments, in ACPI-speak). nommconf [X86] Disable use of MMCONFIG for PCI Configuration + check_enable_amd_mmconf [X86] check for and enable + properly configured MMIO access to PCI + config space on AMD family 10h CPU nomsi [MSI] If the PCI_MSI kernel config parameter is enabled, this kernel boot option can be used to disable the use of MSI interrupts system-wide. @@ -1854,6 +1952,12 @@ and is between 256 and 4096 characters. It is defined in the file PAGE_SIZE is used as alignment. PCI-PCI bridge can be specified, if resource windows need to be expanded. + ecrc= Enable/disable PCIe ECRC (transaction layer + end-to-end CRC checking). + bios: Use BIOS/firmware settings. This is the + the default. + off: Turn ECRC off + on: Turn ECRC on. pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power Management. @@ -1871,6 +1975,13 @@ and is between 256 and 4096 characters. It is defined in the file Format: { 0 | 1 } See arch/parisc/kernel/pdc_chassis.c + percpu_alloc= Select which percpu first chunk allocator to use. + Currently supported values are "embed" and "page". + Archs may support subset or none of the selections. + See comments in mm/percpu.c for details on each + allocator. This parameter is primarily for debugging + and performance comparison. + pf. [PARIDE] See Documentation/blockdev/paride.txt. @@ -2341,6 +2452,18 @@ and is between 256 and 4096 characters. It is defined in the file stifb= [HW] Format: bpp:<bpp1>[:<bpp2>[:<bpp3>...]] + sunrpc.min_resvport= + sunrpc.max_resvport= + [NFS,SUNRPC] + SunRPC servers often require that client requests + originate from a privileged port (i.e. a port in the + range 0 < portnr < 1024). + An administrator who wishes to reserve some of these + ports for other uses may adjust the range that the + kernel's sunrpc client considers to be privileged + using these two parameters to set the minimum and + maximum port values. + sunrpc.pool_mode= [NFS] Control how the NFS server code allocates CPUs to @@ -2357,6 +2480,15 @@ and is between 256 and 4096 characters. It is defined in the file pernode one pool for each NUMA node (equivalent to global on non-NUMA machines) + sunrpc.tcp_slot_table_entries= + sunrpc.udp_slot_table_entries= + [NFS,SUNRPC] + Sets the upper limit on the number of simultaneous + RPC calls that can be sent from the client to a + server. Increasing these values may allow you to + improve throughput, but will also increase the + amount of memory reserved for use by the client. + swiotlb= [IA-64] Number of I/O TLB slabs switches= [HW,M68k] @@ -2423,7 +2555,13 @@ and is between 256 and 4096 characters. It is defined in the file tp720= [HW,PS2] - trace_buf_size=nn[KMG] [ftrace] will set tracing buffer size. + trace_buf_size=nn[KMG] + [FTRACE] will set tracing buffer size. + + trace_event=[event-list] + [FTRACE] Set and start specified trace events in order + to facilitate early boot debugging. + See also Documentation/trace/events.txt trix= [HW,OSS] MediaTrix AudioTrix Pro Format: diff --git a/Documentation/keys.txt b/Documentation/keys.txt index b56aacc1fff..e4dbbdb1bd9 100644 --- a/Documentation/keys.txt +++ b/Documentation/keys.txt @@ -26,7 +26,7 @@ This document has the following sections: - Notes on accessing payload contents - Defining a key type - Request-key callback service - - Key access filesystem + - Garbage collection ============ @@ -113,6 +113,9 @@ Each key has a number of attributes: (*) Dead. The key's type was unregistered, and so the key is now useless. +Keys in the last three states are subject to garbage collection. See the +section on "Garbage collection". + ==================== KEY SERVICE OVERVIEW @@ -754,6 +757,26 @@ The keyctl syscall functions are: successful. + (*) Install the calling process's session keyring on its parent. + + long keyctl(KEYCTL_SESSION_TO_PARENT); + + This functions attempts to install the calling process's session keyring + on to the calling process's parent, replacing the parent's current session + keyring. + + The calling process must have the same ownership as its parent, the + keyring must have the same ownership as the calling process, the calling + process must have LINK permission on the keyring and the active LSM module + mustn't deny permission, otherwise error EPERM will be returned. + + Error ENOMEM will be returned if there was insufficient memory to complete + the operation, otherwise 0 will be returned to indicate success. + + The keyring will be replaced next time the parent process leaves the + kernel and resumes executing userspace. + + =============== KERNEL SERVICES =============== @@ -1231,3 +1254,17 @@ by executing: In this case, the program isn't required to actually attach the key to a ring; the rings are provided for reference. + + +================== +GARBAGE COLLECTION +================== + +Dead keys (for which the type has been removed) will be automatically unlinked +from those keyrings that point to them and deleted as soon as possible by a +background garbage collector. + +Similarly, revoked and expired keys will be garbage collected, but only after a +certain amount of time has passed. This time is set as a number of seconds in: + + /proc/sys/kernel/keys/gc_delay diff --git a/Documentation/kmemcheck.txt b/Documentation/kmemcheck.txt new file mode 100644 index 00000000000..363044609da --- /dev/null +++ b/Documentation/kmemcheck.txt @@ -0,0 +1,773 @@ +GETTING STARTED WITH KMEMCHECK +============================== + +Vegard Nossum <vegardno@ifi.uio.no> + + +Contents +======== +0. Introduction +1. Downloading +2. Configuring and compiling +3. How to use +3.1. Booting +3.2. Run-time enable/disable +3.3. Debugging +3.4. Annotating false positives +4. Reporting errors +5. Technical description + + +0. Introduction +=============== + +kmemcheck is a debugging feature for the Linux Kernel. More specifically, it +is a dynamic checker that detects and warns about some uses of uninitialized +memory. + +Userspace programmers might be familiar with Valgrind's memcheck. The main +difference between memcheck and kmemcheck is that memcheck works for userspace +programs only, and kmemcheck works for the kernel only. The implementations +are of course vastly different. Because of this, kmemcheck is not as accurate +as memcheck, but it turns out to be good enough in practice to discover real +programmer errors that the compiler is not able to find through static +analysis. + +Enabling kmemcheck on a kernel will probably slow it down to the extent that +the machine will not be usable for normal workloads such as e.g. an +interactive desktop. kmemcheck will also cause the kernel to use about twice +as much memory as normal. For this reason, kmemcheck is strictly a debugging +feature. + + +1. Downloading +============== + +kmemcheck can only be downloaded using git. If you want to write patches +against the current code, you should use the kmemcheck development branch of +the tip tree. It is also possible to use the linux-next tree, which also +includes the latest version of kmemcheck. + +Assuming that you've already cloned the linux-2.6.git repository, all you +have to do is add the -tip tree as a remote, like this: + + $ git remote add tip git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git + +To actually download the tree, fetch the remote: + + $ git fetch tip + +And to check out a new local branch with the kmemcheck code: + + $ git checkout -b kmemcheck tip/kmemcheck + +General instructions for the -tip tree can be found here: +http://people.redhat.com/mingo/tip.git/readme.txt + + +2. Configuring and compiling +============================ + +kmemcheck only works for the x86 (both 32- and 64-bit) platform. A number of +configuration variables must have specific settings in order for the kmemcheck +menu to even appear in "menuconfig". These are: + + o CONFIG_CC_OPTIMIZE_FOR_SIZE=n + + This option is located under "General setup" / "Optimize for size". + + Without this, gcc will use certain optimizations that usually lead to + false positive warnings from kmemcheck. An example of this is a 16-bit + field in a struct, where gcc may load 32 bits, then discard the upper + 16 bits. kmemcheck sees only the 32-bit load, and may trigger a + warning for the upper 16 bits (if they're uninitialized). + + o CONFIG_SLAB=y or CONFIG_SLUB=y + + This option is located under "General setup" / "Choose SLAB + allocator". + + o CONFIG_FUNCTION_TRACER=n + + This option is located under "Kernel hacking" / "Tracers" / "Kernel + Function Tracer" + + When function tracing is compiled in, gcc emits a call to another + function at the beginning of every function. This means that when the + page fault handler is called, the ftrace framework will be called + before kmemcheck has had a chance to handle the fault. If ftrace then + modifies memory that was tracked by kmemcheck, the result is an + endless recursive page fault. + + o CONFIG_DEBUG_PAGEALLOC=n + + This option is located under "Kernel hacking" / "Debug page memory + allocations". + +In addition, I highly recommend turning on CONFIG_DEBUG_INFO=y. This is also +located under "Kernel hacking". With this, you will be able to get line number +information from the kmemcheck warnings, which is extremely valuable in +debugging a problem. This option is not mandatory, however, because it slows +down the compilation process and produces a much bigger kernel image. + +Now the kmemcheck menu should be visible (under "Kernel hacking" / "kmemcheck: +trap use of uninitialized memory"). Here follows a description of the +kmemcheck configuration variables: + + o CONFIG_KMEMCHECK + + This must be enabled in order to use kmemcheck at all... + + o CONFIG_KMEMCHECK_[DISABLED | ENABLED | ONESHOT]_BY_DEFAULT + + This option controls the status of kmemcheck at boot-time. "Enabled" + will enable kmemcheck right from the start, "disabled" will boot the + kernel as normal (but with the kmemcheck code compiled in, so it can + be enabled at run-time after the kernel has booted), and "one-shot" is + a special mode which will turn kmemcheck off automatically after + detecting the first use of uninitialized memory. + + If you are using kmemcheck to actively debug a problem, then you + probably want to choose "enabled" here. + + The one-shot mode is mostly useful in automated test setups because it + can prevent floods of warnings and increase the chances of the machine + surviving in case something is really wrong. In other cases, the one- + shot mode could actually be counter-productive because it would turn + itself off at the very first error -- in the case of a false positive + too -- and this would come in the way of debugging the specific + problem you were interested in. + + If you would like to use your kernel as normal, but with a chance to + enable kmemcheck in case of some problem, it might be a good idea to + choose "disabled" here. When kmemcheck is disabled, most of the run- + time overhead is not incurred, and the kernel will be almost as fast + as normal. + + o CONFIG_KMEMCHECK_QUEUE_SIZE + + Select the maximum number of error reports to store in an internal + (fixed-size) buffer. Since errors can occur virtually anywhere and in + any context, we need a temporary storage area which is guaranteed not + to generate any other page faults when accessed. The queue will be + emptied as soon as a tasklet may be scheduled. If the queue is full, + new error reports will be lost. + + The default value of 64 is probably fine. If some code produces more + than 64 errors within an irqs-off section, then the code is likely to + produce many, many more, too, and these additional reports seldom give + any more information (the first report is usually the most valuable + anyway). + + This number might have to be adjusted if you are not using serial + console or similar to capture the kernel log. If you are using the + "dmesg" command to save the log, then getting a lot of kmemcheck + warnings might overflow the kernel log itself, and the earlier reports + will get lost in that way instead. Try setting this to 10 or so on + such a setup. + + o CONFIG_KMEMCHECK_SHADOW_COPY_SHIFT + + Select the number of shadow bytes to save along with each entry of the + error-report queue. These bytes indicate what parts of an allocation + are initialized, uninitialized, etc. and will be displayed when an + error is detected to help the debugging of a particular problem. + + The number entered here is actually the logarithm of the number of + bytes that will be saved. So if you pick for example 5 here, kmemcheck + will save 2^5 = 32 bytes. + + The default value should be fine for debugging most problems. It also + fits nicely within 80 columns. + + o CONFIG_KMEMCHECK_PARTIAL_OK + + This option (when enabled) works around certain GCC optimizations that + produce 32-bit reads from 16-bit variables where the upper 16 bits are + thrown away afterwards. + + The default value (enabled) is recommended. This may of course hide + some real errors, but disabling it would probably produce a lot of + false positives. + + o CONFIG_KMEMCHECK_BITOPS_OK + + This option silences warnings that would be generated for bit-field + accesses where not all the bits are initialized at the same time. This + may also hide some real bugs. + + This option is probably obsolete, or it should be replaced with + the kmemcheck-/bitfield-annotations for the code in question. The + default value is therefore fine. + +Now compile the kernel as usual. + + +3. How to use +============= + +3.1. Booting +============ + +First some information about the command-line options. There is only one +option specific to kmemcheck, and this is called "kmemcheck". It can be used +to override the default mode as chosen by the CONFIG_KMEMCHECK_*_BY_DEFAULT +option. Its possible settings are: + + o kmemcheck=0 (disabled) + o kmemcheck=1 (enabled) + o kmemcheck=2 (one-shot mode) + +If SLUB debugging has been enabled in the kernel, it may take precedence over +kmemcheck in such a way that the slab caches which are under SLUB debugging +will not be tracked by kmemcheck. In order to ensure that this doesn't happen +(even though it shouldn't by default), use SLUB's boot option "slub_debug", +like this: slub_debug=- + +In fact, this option may also be used for fine-grained control over SLUB vs. +kmemcheck. For example, if the command line includes "kmemcheck=1 +slub_debug=,dentry", then SLUB debugging will be used only for the "dentry" +slab cache, and with kmemcheck tracking all the other caches. This is advanced +usage, however, and is not generally recommended. + + +3.2. Run-time enable/disable +============================ + +When the kernel has booted, it is possible to enable or disable kmemcheck at +run-time. WARNING: This feature is still experimental and may cause false +positive warnings to appear. Therefore, try not to use this. If you find that +it doesn't work properly (e.g. you see an unreasonable amount of warnings), I +will be happy to take bug reports. + +Use the file /proc/sys/kernel/kmemcheck for this purpose, e.g.: + + $ echo 0 > /proc/sys/kernel/kmemcheck # disables kmemcheck + +The numbers are the same as for the kmemcheck= command-line option. + + +3.3. Debugging +============== + +A typical report will look something like this: + +WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024) +80000000000000000000000000000000000000000088ffff0000000000000000 + i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u + ^ + +Pid: 1856, comm: ntpdate Not tainted 2.6.29-rc5 #264 945P-A +RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190 +RSP: 0018:ffff88003cdf7d98 EFLAGS: 00210002 +RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009 +RDX: ffff88003e5d6018 RSI: ffff88003e5d6024 RDI: ffff88003cdf7e84 +RBP: ffff88003cdf7db8 R08: ffff88003e5d6000 R09: 0000000000000000 +R10: 0000000000000080 R11: 0000000000000000 R12: 000000000000000e +R13: ffff88003cdf7e78 R14: ffff88003d530710 R15: ffff88003d5a98c8 +FS: 0000000000000000(0000) GS:ffff880001982000(0063) knlGS:00000 +CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 +CR2: ffff88003f806ea0 CR3: 000000003c036000 CR4: 00000000000006a0 +DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 +DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 + [<ffffffff8104f04e>] dequeue_signal+0x8e/0x170 + [<ffffffff81050bd8>] get_signal_to_deliver+0x98/0x390 + [<ffffffff8100b87d>] do_notify_resume+0xad/0x7d0 + [<ffffffff8100c7b5>] int_signal+0x12/0x17 + [<ffffffffffffffff>] 0xffffffffffffffff + +The single most valuable information in this report is the RIP (or EIP on 32- +bit) value. This will help us pinpoint exactly which instruction that caused +the warning. + +If your kernel was compiled with CONFIG_DEBUG_INFO=y, then all we have to do +is give this address to the addr2line program, like this: + + $ addr2line -e vmlinux -i ffffffff8104ede8 + arch/x86/include/asm/string_64.h:12 + include/asm-generic/siginfo.h:287 + kernel/signal.c:380 + kernel/signal.c:410 + +The "-e vmlinux" tells addr2line which file to look in. IMPORTANT: This must +be the vmlinux of the kernel that produced the warning in the first place! If +not, the line number information will almost certainly be wrong. + +The "-i" tells addr2line to also print the line numbers of inlined functions. +In this case, the flag was very important, because otherwise, it would only +have printed the first line, which is just a call to memcpy(), which could be +called from a thousand places in the kernel, and is therefore not very useful. +These inlined functions would not show up in the stack trace above, simply +because the kernel doesn't load the extra debugging information. This +technique can of course be used with ordinary kernel oopses as well. + +In this case, it's the caller of memcpy() that is interesting, and it can be +found in include/asm-generic/siginfo.h, line 287: + +281 static inline void copy_siginfo(struct siginfo *to, struct siginfo *from) +282 { +283 if (from->si_code < 0) +284 memcpy(to, from, sizeof(*to)); +285 else +286 /* _sigchld is currently the largest know union member */ +287 memcpy(to, from, __ARCH_SI_PREAMBLE_SIZE + sizeof(from->_sifields._sigchld)); +288 } + +Since this was a read (kmemcheck usually warns about reads only, though it can +warn about writes to unallocated or freed memory as well), it was probably the +"from" argument which contained some uninitialized bytes. Following the chain +of calls, we move upwards to see where "from" was allocated or initialized, +kernel/signal.c, line 380: + +359 static void collect_signal(int sig, struct sigpending *list, siginfo_t *info) +360 { +... +367 list_for_each_entry(q, &list->list, list) { +368 if (q->info.si_signo == sig) { +369 if (first) +370 goto still_pending; +371 first = q; +... +377 if (first) { +378 still_pending: +379 list_del_init(&first->list); +380 copy_siginfo(info, &first->info); +381 __sigqueue_free(first); +... +392 } +393 } + +Here, it is &first->info that is being passed on to copy_siginfo(). The +variable "first" was found on a list -- passed in as the second argument to +collect_signal(). We continue our journey through the stack, to figure out +where the item on "list" was allocated or initialized. We move to line 410: + +395 static int __dequeue_signal(struct sigpending *pending, sigset_t *mask, +396 siginfo_t *info) +397 { +... +410 collect_signal(sig, pending, info); +... +414 } + +Now we need to follow the "pending" pointer, since that is being passed on to +collect_signal() as "list". At this point, we've run out of lines from the +"addr2line" output. Not to worry, we just paste the next addresses from the +kmemcheck stack dump, i.e.: + + [<ffffffff8104f04e>] dequeue_signal+0x8e/0x170 + [<ffffffff81050bd8>] get_signal_to_deliver+0x98/0x390 + [<ffffffff8100b87d>] do_notify_resume+0xad/0x7d0 + [<ffffffff8100c7b5>] int_signal+0x12/0x17 + + $ addr2line -e vmlinux -i ffffffff8104f04e ffffffff81050bd8 \ + ffffffff8100b87d ffffffff8100c7b5 + kernel/signal.c:446 + kernel/signal.c:1806 + arch/x86/kernel/signal.c:805 + arch/x86/kernel/signal.c:871 + arch/x86/kernel/entry_64.S:694 + +Remember that since these addresses were found on the stack and not as the +RIP value, they actually point to the _next_ instruction (they are return +addresses). This becomes obvious when we look at the code for line 446: + +422 int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info) +423 { +... +431 signr = __dequeue_signal(&tsk->signal->shared_pending, +432 mask, info); +433 /* +434 * itimer signal ? +435 * +436 * itimers are process shared and we restart periodic +437 * itimers in the signal delivery path to prevent DoS +438 * attacks in the high resolution timer case. This is +439 * compliant with the old way of self restarting +440 * itimers, as the SIGALRM is a legacy signal and only +441 * queued once. Changing the restart behaviour to +442 * restart the timer in the signal dequeue path is +443 * reducing the timer noise on heavy loaded !highres +444 * systems too. +445 */ +446 if (unlikely(signr == SIGALRM)) { +... +489 } + +So instead of looking at 446, we should be looking at 431, which is the line +that executes just before 446. Here we see that what we are looking for is +&tsk->signal->shared_pending. + +Our next task is now to figure out which function that puts items on this +"shared_pending" list. A crude, but efficient tool, is git grep: + + $ git grep -n 'shared_pending' kernel/ + ... + kernel/signal.c:828: pending = group ? &t->signal->shared_pending : &t->pending; + kernel/signal.c:1339: pending = group ? &t->signal->shared_pending : &t->pending; + ... + +There were more results, but none of them were related to list operations, +and these were the only assignments. We inspect the line numbers more closely +and find that this is indeed where items are being added to the list: + +816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t, +817 int group) +818 { +... +828 pending = group ? &t->signal->shared_pending : &t->pending; +... +851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN && +852 (is_si_special(info) || +853 info->si_code >= 0))); +854 if (q) { +855 list_add_tail(&q->list, &pending->list); +... +890 } + +and: + +1309 int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group) +1310 { +.... +1339 pending = group ? &t->signal->shared_pending : &t->pending; +1340 list_add_tail(&q->list, &pending->list); +.... +1347 } + +In the first case, the list element we are looking for, "q", is being returned +from the function __sigqueue_alloc(), which looks like an allocation function. +Let's take a look at it: + +187 static struct sigqueue *__sigqueue_alloc(struct task_struct *t, gfp_t flags, +188 int override_rlimit) +189 { +190 struct sigqueue *q = NULL; +191 struct user_struct *user; +192 +193 /* +194 * We won't get problems with the target's UID changing under us +195 * because changing it requires RCU be used, and if t != current, the +196 * caller must be holding the RCU readlock (by way of a spinlock) and +197 * we use RCU protection here +198 */ +199 user = get_uid(__task_cred(t)->user); +200 atomic_inc(&user->sigpending); +201 if (override_rlimit || +202 atomic_read(&user->sigpending) <= +203 t->signal->rlim[RLIMIT_SIGPENDING].rlim_cur) +204 q = kmem_cache_alloc(sigqueue_cachep, flags); +205 if (unlikely(q == NULL)) { +206 atomic_dec(&user->sigpending); +207 free_uid(user); +208 } else { +209 INIT_LIST_HEAD(&q->list); +210 q->flags = 0; +211 q->user = user; +212 } +213 +214 return q; +215 } + +We see that this function initializes q->list, q->flags, and q->user. It seems +that now is the time to look at the definition of "struct sigqueue", e.g.: + +14 struct sigqueue { +15 struct list_head list; +16 int flags; +17 siginfo_t info; +18 struct user_struct *user; +19 }; + +And, you might remember, it was a memcpy() on &first->info that caused the +warning, so this makes perfect sense. It also seems reasonable to assume that +it is the caller of __sigqueue_alloc() that has the responsibility of filling +out (initializing) this member. + +But just which fields of the struct were uninitialized? Let's look at +kmemcheck's report again: + +WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024) +80000000000000000000000000000000000000000088ffff0000000000000000 + i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u + ^ + +These first two lines are the memory dump of the memory object itself, and the +shadow bytemap, respectively. The memory object itself is in this case +&first->info. Just beware that the start of this dump is NOT the start of the +object itself! The position of the caret (^) corresponds with the address of +the read (ffff88003e4a2024). + +The shadow bytemap dump legend is as follows: + + i - initialized + u - uninitialized + a - unallocated (memory has been allocated by the slab layer, but has not + yet been handed off to anybody) + f - freed (memory has been allocated by the slab layer, but has been freed + by the previous owner) + +In order to figure out where (relative to the start of the object) the +uninitialized memory was located, we have to look at the disassembly. For +that, we'll need the RIP address again: + +RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190 + + $ objdump -d --no-show-raw-insn vmlinux | grep -C 8 ffffffff8104ede8: + ffffffff8104edc8: mov %r8,0x8(%r8) + ffffffff8104edcc: test %r10d,%r10d + ffffffff8104edcf: js ffffffff8104ee88 <__dequeue_signal+0x168> + ffffffff8104edd5: mov %rax,%rdx + ffffffff8104edd8: mov $0xc,%ecx + ffffffff8104eddd: mov %r13,%rdi + ffffffff8104ede0: mov $0x30,%eax + ffffffff8104ede5: mov %rdx,%rsi + ffffffff8104ede8: rep movsl %ds:(%rsi),%es:(%rdi) + ffffffff8104edea: test $0x2,%al + ffffffff8104edec: je ffffffff8104edf0 <__dequeue_signal+0xd0> + ffffffff8104edee: movsw %ds:(%rsi),%es:(%rdi) + ffffffff8104edf0: test $0x1,%al + ffffffff8104edf2: je ffffffff8104edf5 <__dequeue_signal+0xd5> + ffffffff8104edf4: movsb %ds:(%rsi),%es:(%rdi) + ffffffff8104edf5: mov %r8,%rdi + ffffffff8104edf8: callq ffffffff8104de60 <__sigqueue_free> + +As expected, it's the "rep movsl" instruction from the memcpy() that causes +the warning. We know about REP MOVSL that it uses the register RCX to count +the number of remaining iterations. By taking a look at the register dump +again (from the kmemcheck report), we can figure out how many bytes were left +to copy: + +RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009 + +By looking at the disassembly, we also see that %ecx is being loaded with the +value $0xc just before (ffffffff8104edd8), so we are very lucky. Keep in mind +that this is the number of iterations, not bytes. And since this is a "long" +operation, we need to multiply by 4 to get the number of bytes. So this means +that the uninitialized value was encountered at 4 * (0xc - 0x9) = 12 bytes +from the start of the object. + +We can now try to figure out which field of the "struct siginfo" that was not +initialized. This is the beginning of the struct: + +40 typedef struct siginfo { +41 int si_signo; +42 int si_errno; +43 int si_code; +44 +45 union { +.. +92 } _sifields; +93 } siginfo_t; + +On 64-bit, the int is 4 bytes long, so it must the the union member that has +not been initialized. We can verify this using gdb: + + $ gdb vmlinux + ... + (gdb) p &((struct siginfo *) 0)->_sifields + $1 = (union {...} *) 0x10 + +Actually, it seems that the union member is located at offset 0x10 -- which +means that gcc has inserted 4 bytes of padding between the members si_code +and _sifields. We can now get a fuller picture of the memory dump: + + _----------------------------=> si_code + / _--------------------=> (padding) + | / _------------=> _sifields(._kill._pid) + | | / _----=> _sifields(._kill._uid) + | | | / +-------|-------|-------|-------| +80000000000000000000000000000000000000000088ffff0000000000000000 + i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u + +This allows us to realize another important fact: si_code contains the value +0x80. Remember that x86 is little endian, so the first 4 bytes "80000000" are +really the number 0x00000080. With a bit of research, we find that this is +actually the constant SI_KERNEL defined in include/asm-generic/siginfo.h: + +144 #define SI_KERNEL 0x80 /* sent by the kernel from somewhere */ + +This macro is used in exactly one place in the x86 kernel: In send_signal() +in kernel/signal.c: + +816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t, +817 int group) +818 { +... +828 pending = group ? &t->signal->shared_pending : &t->pending; +... +851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN && +852 (is_si_special(info) || +853 info->si_code >= 0))); +854 if (q) { +855 list_add_tail(&q->list, &pending->list); +856 switch ((unsigned long) info) { +... +865 case (unsigned long) SEND_SIG_PRIV: +866 q->info.si_signo = sig; +867 q->info.si_errno = 0; +868 q->info.si_code = SI_KERNEL; +869 q->info.si_pid = 0; +870 q->info.si_uid = 0; +871 break; +... +890 } + +Not only does this match with the .si_code member, it also matches the place +we found earlier when looking for where siginfo_t objects are enqueued on the +"shared_pending" list. + +So to sum up: It seems that it is the padding introduced by the compiler +between two struct fields that is uninitialized, and this gets reported when +we do a memcpy() on the struct. This means that we have identified a false +positive warning. + +Normally, kmemcheck will not report uninitialized accesses in memcpy() calls +when both the source and destination addresses are tracked. (Instead, we copy +the shadow bytemap as well). In this case, the destination address clearly +was not tracked. We can dig a little deeper into the stack trace from above: + + arch/x86/kernel/signal.c:805 + arch/x86/kernel/signal.c:871 + arch/x86/kernel/entry_64.S:694 + +And we clearly see that the destination siginfo object is located on the +stack: + +782 static void do_signal(struct pt_regs *regs) +783 { +784 struct k_sigaction ka; +785 siginfo_t info; +... +804 signr = get_signal_to_deliver(&info, &ka, regs, NULL); +... +854 } + +And this &info is what eventually gets passed to copy_siginfo() as the +destination argument. + +Now, even though we didn't find an actual error here, the example is still a +good one, because it shows how one would go about to find out what the report +was all about. + + +3.4. Annotating false positives +=============================== + +There are a few different ways to make annotations in the source code that +will keep kmemcheck from checking and reporting certain allocations. Here +they are: + + o __GFP_NOTRACK_FALSE_POSITIVE + + This flag can be passed to kmalloc() or kmem_cache_alloc() (therefore + also to other functions that end up calling one of these) to indicate + that the allocation should not be tracked because it would lead to + a false positive report. This is a "big hammer" way of silencing + kmemcheck; after all, even if the false positive pertains to + particular field in a struct, for example, we will now lose the + ability to find (real) errors in other parts of the same struct. + + Example: + + /* No warnings will ever trigger on accessing any part of x */ + x = kmalloc(sizeof *x, GFP_KERNEL | __GFP_NOTRACK_FALSE_POSITIVE); + + o kmemcheck_bitfield_begin(name)/kmemcheck_bitfield_end(name) and + kmemcheck_annotate_bitfield(ptr, name) + + The first two of these three macros can be used inside struct + definitions to signal, respectively, the beginning and end of a + bitfield. Additionally, this will assign the bitfield a name, which + is given as an argument to the macros. + + Having used these markers, one can later use + kmemcheck_annotate_bitfield() at the point of allocation, to indicate + which parts of the allocation is part of a bitfield. + + Example: + + struct foo { + int x; + + kmemcheck_bitfield_begin(flags); + int flag_a:1; + int flag_b:1; + kmemcheck_bitfield_end(flags); + + int y; + }; + + struct foo *x = kmalloc(sizeof *x); + + /* No warnings will trigger on accessing the bitfield of x */ + kmemcheck_annotate_bitfield(x, flags); + + Note that kmemcheck_annotate_bitfield() can be used even before the + return value of kmalloc() is checked -- in other words, passing NULL + as the first argument is legal (and will do nothing). + + +4. Reporting errors +=================== + +As we have seen, kmemcheck will produce false positive reports. Therefore, it +is not very wise to blindly post kmemcheck warnings to mailing lists and +maintainers. Instead, I encourage maintainers and developers to find errors +in their own code. If you get a warning, you can try to work around it, try +to figure out if it's a real error or not, or simply ignore it. Most +developers know their own code and will quickly and efficiently determine the +root cause of a kmemcheck report. This is therefore also the most efficient +way to work with kmemcheck. + +That said, we (the kmemcheck maintainers) will always be on the lookout for +false positives that we can annotate and silence. So whatever you find, +please drop us a note privately! Kernel configs and steps to reproduce (if +available) are of course a great help too. + +Happy hacking! + + +5. Technical description +======================== + +kmemcheck works by marking memory pages non-present. This means that whenever +somebody attempts to access the page, a page fault is generated. The page +fault handler notices that the page was in fact only hidden, and so it calls +on the kmemcheck code to make further investigations. + +When the investigations are completed, kmemcheck "shows" the page by marking +it present (as it would be under normal circumstances). This way, the +interrupted code can continue as usual. + +But after the instruction has been executed, we should hide the page again, so +that we can catch the next access too! Now kmemcheck makes use of a debugging +feature of the processor, namely single-stepping. When the processor has +finished the one instruction that generated the memory access, a debug +exception is raised. From here, we simply hide the page again and continue +execution, this time with the single-stepping feature turned off. + +kmemcheck requires some assistance from the memory allocator in order to work. +The memory allocator needs to + + 1. Tell kmemcheck about newly allocated pages and pages that are about to + be freed. This allows kmemcheck to set up and tear down the shadow memory + for the pages in question. The shadow memory stores the status of each + byte in the allocation proper, e.g. whether it is initialized or + uninitialized. + + 2. Tell kmemcheck which parts of memory should be marked uninitialized. + There are actually a few more states, such as "not yet allocated" and + "recently freed". + +If a slab cache is set up using the SLAB_NOTRACK flag, it will never return +memory that can take page faults because of kmemcheck. + +If a slab cache is NOT set up using the SLAB_NOTRACK flag, callers can still +request memory with the __GFP_NOTRACK or __GFP_NOTRACK_FALSE_POSITIVE flags. +This does not prevent the page faults from occurring, however, but marks the +object in question as being initialized so that no warnings will ever be +produced for this object. + +Currently, the SLAB and SLUB allocators are supported by kmemcheck. diff --git a/Documentation/kmemleak.txt b/Documentation/kmemleak.txt index 0112da3b9ab..34f6638aa5a 100644 --- a/Documentation/kmemleak.txt +++ b/Documentation/kmemleak.txt @@ -16,13 +16,24 @@ Usage ----- CONFIG_DEBUG_KMEMLEAK in "Kernel hacking" has to be enabled. A kernel -thread scans the memory every 10 minutes (by default) and prints any new -unreferenced objects found. To trigger an intermediate scan and display -all the possible memory leaks: +thread scans the memory every 10 minutes (by default) and prints the +number of new unreferenced objects found. To display the details of all +the possible memory leaks: # mount -t debugfs nodev /sys/kernel/debug/ # cat /sys/kernel/debug/kmemleak +To trigger an intermediate memory scan: + + # echo scan > /sys/kernel/debug/kmemleak + +To clear the list of all current possible memory leaks: + + # echo clear > /sys/kernel/debug/kmemleak + +New leaks will then come up upon reading /sys/kernel/debug/kmemleak +again. + Note that the orphan objects are listed in the order they were allocated and one object at the beginning of the list may cause other subsequent objects to be reported as orphan. @@ -31,16 +42,24 @@ Memory scanning parameters can be modified at run-time by writing to the /sys/kernel/debug/kmemleak file. The following parameters are supported: off - disable kmemleak (irreversible) - stack=on - enable the task stacks scanning + stack=on - enable the task stacks scanning (default) stack=off - disable the tasks stacks scanning - scan=on - start the automatic memory scanning thread + scan=on - start the automatic memory scanning thread (default) scan=off - stop the automatic memory scanning thread - scan=<secs> - set the automatic memory scanning period in seconds (0 - to disable it) + scan=<secs> - set the automatic memory scanning period in seconds + (default 600, 0 to stop the automatic scanning) + scan - trigger a memory scan + clear - clear list of current memory leak suspects, done by + marking all current reported unreferenced objects grey + dump=<addr> - dump information about the object found at <addr> Kmemleak can also be disabled at boot-time by passing "kmemleak=off" on the kernel command line. +Memory may be allocated or freed before kmemleak is initialised and +these actions are stored in an early log buffer. The size of this buffer +is configured via the CONFIG_DEBUG_KMEMLEAK_EARLY_LOG_SIZE option. + Basic Algorithm --------------- @@ -77,6 +96,27 @@ avoid this, kmemleak can also store the number of values pointing to an address inside the block address range that need to be found so that the block is not considered a leak. One example is __vmalloc(). +Testing specific sections with kmemleak +--------------------------------------- + +Upon initial bootup your /sys/kernel/debug/kmemleak output page may be +quite extensive. This can also be the case if you have very buggy code +when doing development. To work around these situations you can use the +'clear' command to clear all reported unreferenced objects from the +/sys/kernel/debug/kmemleak output. By issuing a 'scan' after a 'clear' +you can find new unreferenced objects; this should help with testing +specific sections of code. + +To test a critical section on demand with a clean kmemleak do: + + # echo clear > /sys/kernel/debug/kmemleak + ... test your kernel or modules ... + # echo scan > /sys/kernel/debug/kmemleak + +Then as usual to get your report with: + + # cat /sys/kernel/debug/kmemleak + Kmemleak API ------------ diff --git a/Documentation/kprobes.txt b/Documentation/kprobes.txt index 1e7a769a10f..053037a1fe6 100644 --- a/Documentation/kprobes.txt +++ b/Documentation/kprobes.txt @@ -507,9 +507,9 @@ http://www.linuxsymposium.org/2006/linuxsymposium_procv2.pdf (pages 101-115) Appendix A: The kprobes debugfs interface With recent kernels (> 2.6.20) the list of registered kprobes is visible -under the /debug/kprobes/ directory (assuming debugfs is mounted at /debug). +under the /sys/kernel/debug/kprobes/ directory (assuming debugfs is mounted at //sys/kernel/debug). -/debug/kprobes/list: Lists all registered probes on the system +/sys/kernel/debug/kprobes/list: Lists all registered probes on the system c015d71a k vfs_read+0x0 c011a316 j do_fork+0x0 @@ -525,7 +525,7 @@ virtual addresses that correspond to modules that've been unloaded), such probes are marked with [GONE]. If the probe is temporarily disabled, such probes are marked with [DISABLED]. -/debug/kprobes/enabled: Turn kprobes ON/OFF forcibly. +/sys/kernel/debug/kprobes/enabled: Turn kprobes ON/OFF forcibly. Provides a knob to globally and forcibly turn registered kprobes ON or OFF. By default, all kprobes are enabled. By echoing "0" to this file, all diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt new file mode 100644 index 00000000000..5a4bc8cf6d0 --- /dev/null +++ b/Documentation/kvm/api.txt @@ -0,0 +1,759 @@ +The Definitive KVM (Kernel-based Virtual Machine) API Documentation +=================================================================== + +1. General description + +The kvm API is a set of ioctls that are issued to control various aspects +of a virtual machine. The ioctls belong to three classes + + - System ioctls: These query and set global attributes which affect the + whole kvm subsystem. In addition a system ioctl is used to create + virtual machines + + - VM ioctls: These query and set attributes that affect an entire virtual + machine, for example memory layout. In addition a VM ioctl is used to + create virtual cpus (vcpus). + + Only run VM ioctls from the same process (address space) that was used + to create the VM. + + - vcpu ioctls: These query and set attributes that control the operation + of a single virtual cpu. + + Only run vcpu ioctls from the same thread that was used to create the + vcpu. + +2. File descritpors + +The kvm API is centered around file descriptors. An initial +open("/dev/kvm") obtains a handle to the kvm subsystem; this handle +can be used to issue system ioctls. A KVM_CREATE_VM ioctl on this +handle will create a VM file descripror which can be used to issue VM +ioctls. A KVM_CREATE_VCPU ioctl on a VM fd will create a virtual cpu +and return a file descriptor pointing to it. Finally, ioctls on a vcpu +fd can be used to control the vcpu, including the important task of +actually running guest code. + +In general file descriptors can be migrated among processes by means +of fork() and the SCM_RIGHTS facility of unix domain socket. These +kinds of tricks are explicitly not supported by kvm. While they will +not cause harm to the host, their actual behavior is not guaranteed by +the API. The only supported use is one virtual machine per process, +and one vcpu per thread. + +3. Extensions + +As of Linux 2.6.22, the KVM ABI has been stabilized: no backward +incompatible change are allowed. However, there is an extension +facility that allows backward-compatible extensions to the API to be +queried and used. + +The extension mechanism is not based on on the Linux version number. +Instead, kvm defines extension identifiers and a facility to query +whether a particular extension identifier is available. If it is, a +set of ioctls is available for application use. + +4. API description + +This section describes ioctls that can be used to control kvm guests. +For each ioctl, the following information is provided along with a +description: + + Capability: which KVM extension provides this ioctl. Can be 'basic', + which means that is will be provided by any kernel that supports + API version 12 (see section 4.1), or a KVM_CAP_xyz constant, which + means availability needs to be checked with KVM_CHECK_EXTENSION + (see section 4.4). + + Architectures: which instruction set architectures provide this ioctl. + x86 includes both i386 and x86_64. + + Type: system, vm, or vcpu. + + Parameters: what parameters are accepted by the ioctl. + + Returns: the return value. General error numbers (EBADF, ENOMEM, EINVAL) + are not detailed, but errors with specific meanings are. + +4.1 KVM_GET_API_VERSION + +Capability: basic +Architectures: all +Type: system ioctl +Parameters: none +Returns: the constant KVM_API_VERSION (=12) + +This identifies the API version as the stable kvm API. It is not +expected that this number will change. However, Linux 2.6.20 and +2.6.21 report earlier versions; these are not documented and not +supported. Applications should refuse to run if KVM_GET_API_VERSION +returns a value other than 12. If this check passes, all ioctls +described as 'basic' will be available. + +4.2 KVM_CREATE_VM + +Capability: basic +Architectures: all +Type: system ioctl +Parameters: none +Returns: a VM fd that can be used to control the new virtual machine. + +The new VM has no virtual cpus and no memory. An mmap() of a VM fd +will access the virtual machine's physical address space; offset zero +corresponds to guest physical address zero. Use of mmap() on a VM fd +is discouraged if userspace memory allocation (KVM_CAP_USER_MEMORY) is +available. + +4.3 KVM_GET_MSR_INDEX_LIST + +Capability: basic +Architectures: x86 +Type: system +Parameters: struct kvm_msr_list (in/out) +Returns: 0 on success; -1 on error +Errors: + E2BIG: the msr index list is to be to fit in the array specified by + the user. + +struct kvm_msr_list { + __u32 nmsrs; /* number of msrs in entries */ + __u32 indices[0]; +}; + +This ioctl returns the guest msrs that are supported. The list varies +by kvm version and host processor, but does not change otherwise. The +user fills in the size of the indices array in nmsrs, and in return +kvm adjusts nmsrs to reflect the actual number of msrs and fills in +the indices array with their numbers. + +4.4 KVM_CHECK_EXTENSION + +Capability: basic +Architectures: all +Type: system ioctl +Parameters: extension identifier (KVM_CAP_*) +Returns: 0 if unsupported; 1 (or some other positive integer) if supported + +The API allows the application to query about extensions to the core +kvm API. Userspace passes an extension identifier (an integer) and +receives an integer that describes the extension availability. +Generally 0 means no and 1 means yes, but some extensions may report +additional information in the integer return value. + +4.5 KVM_GET_VCPU_MMAP_SIZE + +Capability: basic +Architectures: all +Type: system ioctl +Parameters: none +Returns: size of vcpu mmap area, in bytes + +The KVM_RUN ioctl (cf.) communicates with userspace via a shared +memory region. This ioctl returns the size of that region. See the +KVM_RUN documentation for details. + +4.6 KVM_SET_MEMORY_REGION + +Capability: basic +Architectures: all +Type: vm ioctl +Parameters: struct kvm_memory_region (in) +Returns: 0 on success, -1 on error + +struct kvm_memory_region { + __u32 slot; + __u32 flags; + __u64 guest_phys_addr; + __u64 memory_size; /* bytes */ +}; + +/* for kvm_memory_region::flags */ +#define KVM_MEM_LOG_DIRTY_PAGES 1UL + +This ioctl allows the user to create or modify a guest physical memory +slot. When changing an existing slot, it may be moved in the guest +physical memory space, or its flags may be modified. It may not be +resized. Slots may not overlap. + +The flags field supports just one flag, KVM_MEM_LOG_DIRTY_PAGES, which +instructs kvm to keep track of writes to memory within the slot. See +the KVM_GET_DIRTY_LOG ioctl. + +It is recommended to use the KVM_SET_USER_MEMORY_REGION ioctl instead +of this API, if available. This newer API allows placing guest memory +at specified locations in the host address space, yielding better +control and easy access. + +4.6 KVM_CREATE_VCPU + +Capability: basic +Architectures: all +Type: vm ioctl +Parameters: vcpu id (apic id on x86) +Returns: vcpu fd on success, -1 on error + +This API adds a vcpu to a virtual machine. The vcpu id is a small integer +in the range [0, max_vcpus). + +4.7 KVM_GET_DIRTY_LOG (vm ioctl) + +Capability: basic +Architectures: x86 +Type: vm ioctl +Parameters: struct kvm_dirty_log (in/out) +Returns: 0 on success, -1 on error + +/* for KVM_GET_DIRTY_LOG */ +struct kvm_dirty_log { + __u32 slot; + __u32 padding; + union { + void __user *dirty_bitmap; /* one bit per page */ + __u64 padding; + }; +}; + +Given a memory slot, return a bitmap containing any pages dirtied +since the last call to this ioctl. Bit 0 is the first page in the +memory slot. Ensure the entire structure is cleared to avoid padding +issues. + +4.8 KVM_SET_MEMORY_ALIAS + +Capability: basic +Architectures: x86 +Type: vm ioctl +Parameters: struct kvm_memory_alias (in) +Returns: 0 (success), -1 (error) + +struct kvm_memory_alias { + __u32 slot; /* this has a different namespace than memory slots */ + __u32 flags; + __u64 guest_phys_addr; + __u64 memory_size; + __u64 target_phys_addr; +}; + +Defines a guest physical address space region as an alias to another +region. Useful for aliased address, for example the VGA low memory +window. Should not be used with userspace memory. + +4.9 KVM_RUN + +Capability: basic +Architectures: all +Type: vcpu ioctl +Parameters: none +Returns: 0 on success, -1 on error +Errors: + EINTR: an unmasked signal is pending + +This ioctl is used to run a guest virtual cpu. While there are no +explicit parameters, there is an implicit parameter block that can be +obtained by mmap()ing the vcpu fd at offset 0, with the size given by +KVM_GET_VCPU_MMAP_SIZE. The parameter block is formatted as a 'struct +kvm_run' (see below). + +4.10 KVM_GET_REGS + +Capability: basic +Architectures: all +Type: vcpu ioctl +Parameters: struct kvm_regs (out) +Returns: 0 on success, -1 on error + +Reads the general purpose registers from the vcpu. + +/* x86 */ +struct kvm_regs { + /* out (KVM_GET_REGS) / in (KVM_SET_REGS) */ + __u64 rax, rbx, rcx, rdx; + __u64 rsi, rdi, rsp, rbp; + __u64 r8, r9, r10, r11; + __u64 r12, r13, r14, r15; + __u64 rip, rflags; +}; + +4.11 KVM_SET_REGS + +Capability: basic +Architectures: all +Type: vcpu ioctl +Parameters: struct kvm_regs (in) +Returns: 0 on success, -1 on error + +Writes the general purpose registers into the vcpu. + +See KVM_GET_REGS for the data structure. + +4.12 KVM_GET_SREGS + +Capability: basic +Architectures: x86 +Type: vcpu ioctl +Parameters: struct kvm_sregs (out) +Returns: 0 on success, -1 on error + +Reads special registers from the vcpu. + +/* x86 */ +struct kvm_sregs { + struct kvm_segment cs, ds, es, fs, gs, ss; + struct kvm_segment tr, ldt; + struct kvm_dtable gdt, idt; + __u64 cr0, cr2, cr3, cr4, cr8; + __u64 efer; + __u64 apic_base; + __u64 interrupt_bitmap[(KVM_NR_INTERRUPTS + 63) / 64]; +}; + +interrupt_bitmap is a bitmap of pending external interrupts. At most +one bit may be set. This interrupt has been acknowledged by the APIC +but not yet injected into the cpu core. + +4.13 KVM_SET_SREGS + +Capability: basic +Architectures: x86 +Type: vcpu ioctl +Parameters: struct kvm_sregs (in) +Returns: 0 on success, -1 on error + +Writes special registers into the vcpu. See KVM_GET_SREGS for the +data structures. + +4.14 KVM_TRANSLATE + +Capability: basic +Architectures: x86 +Type: vcpu ioctl +Parameters: struct kvm_translation (in/out) +Returns: 0 on success, -1 on error + +Translates a virtual address according to the vcpu's current address +translation mode. + +struct kvm_translation { + /* in */ + __u64 linear_address; + + /* out */ + __u64 physical_address; + __u8 valid; + __u8 writeable; + __u8 usermode; + __u8 pad[5]; +}; + +4.15 KVM_INTERRUPT + +Capability: basic +Architectures: x86 +Type: vcpu ioctl +Parameters: struct kvm_interrupt (in) +Returns: 0 on success, -1 on error + +Queues a hardware interrupt vector to be injected. This is only +useful if in-kernel local APIC is not used. + +/* for KVM_INTERRUPT */ +struct kvm_interrupt { + /* in */ + __u32 irq; +}; + +Note 'irq' is an interrupt vector, not an interrupt pin or line. + +4.16 KVM_DEBUG_GUEST + +Capability: basic +Architectures: none +Type: vcpu ioctl +Parameters: none) +Returns: -1 on error + +Support for this has been removed. Use KVM_SET_GUEST_DEBUG instead. + +4.17 KVM_GET_MSRS + +Capability: basic +Architectures: x86 +Type: vcpu ioctl +Parameters: struct kvm_msrs (in/out) +Returns: 0 on success, -1 on error + +Reads model-specific registers from the vcpu. Supported msr indices can +be obtained using KVM_GET_MSR_INDEX_LIST. + +struct kvm_msrs { + __u32 nmsrs; /* number of msrs in entries */ + __u32 pad; + + struct kvm_msr_entry entries[0]; +}; + +struct kvm_msr_entry { + __u32 index; + __u32 reserved; + __u64 data; +}; + +Application code should set the 'nmsrs' member (which indicates the +size of the entries array) and the 'index' member of each array entry. +kvm will fill in the 'data' member. + +4.18 KVM_SET_MSRS + +Capability: basic +Architectures: x86 +Type: vcpu ioctl +Parameters: struct kvm_msrs (in) +Returns: 0 on success, -1 on error + +Writes model-specific registers to the vcpu. See KVM_GET_MSRS for the +data structures. + +Application code should set the 'nmsrs' member (which indicates the +size of the entries array), and the 'index' and 'data' members of each +array entry. + +4.19 KVM_SET_CPUID + +Capability: basic +Architectures: x86 +Type: vcpu ioctl +Parameters: struct kvm_cpuid (in) +Returns: 0 on success, -1 on error + +Defines the vcpu responses to the cpuid instruction. Applications +should use the KVM_SET_CPUID2 ioctl if available. + + +struct kvm_cpuid_entry { + __u32 function; + __u32 eax; + __u32 ebx; + __u32 ecx; + __u32 edx; + __u32 padding; +}; + +/* for KVM_SET_CPUID */ +struct kvm_cpuid { + __u32 nent; + __u32 padding; + struct kvm_cpuid_entry entries[0]; +}; + +4.20 KVM_SET_SIGNAL_MASK + +Capability: basic +Architectures: x86 +Type: vcpu ioctl +Parameters: struct kvm_signal_mask (in) +Returns: 0 on success, -1 on error + +Defines which signals are blocked during execution of KVM_RUN. This +signal mask temporarily overrides the threads signal mask. Any +unblocked signal received (except SIGKILL and SIGSTOP, which retain +their traditional behaviour) will cause KVM_RUN to return with -EINTR. + +Note the signal will only be delivered if not blocked by the original +signal mask. + +/* for KVM_SET_SIGNAL_MASK */ +struct kvm_signal_mask { + __u32 len; + __u8 sigset[0]; +}; + +4.21 KVM_GET_FPU + +Capability: basic +Architectures: x86 +Type: vcpu ioctl +Parameters: struct kvm_fpu (out) +Returns: 0 on success, -1 on error + +Reads the floating point state from the vcpu. + +/* for KVM_GET_FPU and KVM_SET_FPU */ +struct kvm_fpu { + __u8 fpr[8][16]; + __u16 fcw; + __u16 fsw; + __u8 ftwx; /* in fxsave format */ + __u8 pad1; + __u16 last_opcode; + __u64 last_ip; + __u64 last_dp; + __u8 xmm[16][16]; + __u32 mxcsr; + __u32 pad2; +}; + +4.22 KVM_SET_FPU + +Capability: basic +Architectures: x86 +Type: vcpu ioctl +Parameters: struct kvm_fpu (in) +Returns: 0 on success, -1 on error + +Writes the floating point state to the vcpu. + +/* for KVM_GET_FPU and KVM_SET_FPU */ +struct kvm_fpu { + __u8 fpr[8][16]; + __u16 fcw; + __u16 fsw; + __u8 ftwx; /* in fxsave format */ + __u8 pad1; + __u16 last_opcode; + __u64 last_ip; + __u64 last_dp; + __u8 xmm[16][16]; + __u32 mxcsr; + __u32 pad2; +}; + +4.23 KVM_CREATE_IRQCHIP + +Capability: KVM_CAP_IRQCHIP +Architectures: x86, ia64 +Type: vm ioctl +Parameters: none +Returns: 0 on success, -1 on error + +Creates an interrupt controller model in the kernel. On x86, creates a virtual +ioapic, a virtual PIC (two PICs, nested), and sets up future vcpus to have a +local APIC. IRQ routing for GSIs 0-15 is set to both PIC and IOAPIC; GSI 16-23 +only go to the IOAPIC. On ia64, a IOSAPIC is created. + +4.24 KVM_IRQ_LINE + +Capability: KVM_CAP_IRQCHIP +Architectures: x86, ia64 +Type: vm ioctl +Parameters: struct kvm_irq_level +Returns: 0 on success, -1 on error + +Sets the level of a GSI input to the interrupt controller model in the kernel. +Requires that an interrupt controller model has been previously created with +KVM_CREATE_IRQCHIP. Note that edge-triggered interrupts require the level +to be set to 1 and then back to 0. + +struct kvm_irq_level { + union { + __u32 irq; /* GSI */ + __s32 status; /* not used for KVM_IRQ_LEVEL */ + }; + __u32 level; /* 0 or 1 */ +}; + +4.25 KVM_GET_IRQCHIP + +Capability: KVM_CAP_IRQCHIP +Architectures: x86, ia64 +Type: vm ioctl +Parameters: struct kvm_irqchip (in/out) +Returns: 0 on success, -1 on error + +Reads the state of a kernel interrupt controller created with +KVM_CREATE_IRQCHIP into a buffer provided by the caller. + +struct kvm_irqchip { + __u32 chip_id; /* 0 = PIC1, 1 = PIC2, 2 = IOAPIC */ + __u32 pad; + union { + char dummy[512]; /* reserving space */ + struct kvm_pic_state pic; + struct kvm_ioapic_state ioapic; + } chip; +}; + +4.26 KVM_SET_IRQCHIP + +Capability: KVM_CAP_IRQCHIP +Architectures: x86, ia64 +Type: vm ioctl +Parameters: struct kvm_irqchip (in) +Returns: 0 on success, -1 on error + +Sets the state of a kernel interrupt controller created with +KVM_CREATE_IRQCHIP from a buffer provided by the caller. + +struct kvm_irqchip { + __u32 chip_id; /* 0 = PIC1, 1 = PIC2, 2 = IOAPIC */ + __u32 pad; + union { + char dummy[512]; /* reserving space */ + struct kvm_pic_state pic; + struct kvm_ioapic_state ioapic; + } chip; +}; + +5. The kvm_run structure + +Application code obtains a pointer to the kvm_run structure by +mmap()ing a vcpu fd. From that point, application code can control +execution by changing fields in kvm_run prior to calling the KVM_RUN +ioctl, and obtain information about the reason KVM_RUN returned by +looking up structure members. + +struct kvm_run { + /* in */ + __u8 request_interrupt_window; + +Request that KVM_RUN return when it becomes possible to inject external +interrupts into the guest. Useful in conjunction with KVM_INTERRUPT. + + __u8 padding1[7]; + + /* out */ + __u32 exit_reason; + +When KVM_RUN has returned successfully (return value 0), this informs +application code why KVM_RUN has returned. Allowable values for this +field are detailed below. + + __u8 ready_for_interrupt_injection; + +If request_interrupt_window has been specified, this field indicates +an interrupt can be injected now with KVM_INTERRUPT. + + __u8 if_flag; + +The value of the current interrupt flag. Only valid if in-kernel +local APIC is not used. + + __u8 padding2[2]; + + /* in (pre_kvm_run), out (post_kvm_run) */ + __u64 cr8; + +The value of the cr8 register. Only valid if in-kernel local APIC is +not used. Both input and output. + + __u64 apic_base; + +The value of the APIC BASE msr. Only valid if in-kernel local +APIC is not used. Both input and output. + + union { + /* KVM_EXIT_UNKNOWN */ + struct { + __u64 hardware_exit_reason; + } hw; + +If exit_reason is KVM_EXIT_UNKNOWN, the vcpu has exited due to unknown +reasons. Further architecture-specific information is available in +hardware_exit_reason. + + /* KVM_EXIT_FAIL_ENTRY */ + struct { + __u64 hardware_entry_failure_reason; + } fail_entry; + +If exit_reason is KVM_EXIT_FAIL_ENTRY, the vcpu could not be run due +to unknown reasons. Further architecture-specific information is +available in hardware_entry_failure_reason. + + /* KVM_EXIT_EXCEPTION */ + struct { + __u32 exception; + __u32 error_code; + } ex; + +Unused. + + /* KVM_EXIT_IO */ + struct { +#define KVM_EXIT_IO_IN 0 +#define KVM_EXIT_IO_OUT 1 + __u8 direction; + __u8 size; /* bytes */ + __u16 port; + __u32 count; + __u64 data_offset; /* relative to kvm_run start */ + } io; + +If exit_reason is KVM_EXIT_IO_IN or KVM_EXIT_IO_OUT, then the vcpu has +executed a port I/O instruction which could not be satisfied by kvm. +data_offset describes where the data is located (KVM_EXIT_IO_OUT) or +where kvm expects application code to place the data for the next +KVM_RUN invocation (KVM_EXIT_IO_IN). Data format is a patcked array. + + struct { + struct kvm_debug_exit_arch arch; + } debug; + +Unused. + + /* KVM_EXIT_MMIO */ + struct { + __u64 phys_addr; + __u8 data[8]; + __u32 len; + __u8 is_write; + } mmio; + +If exit_reason is KVM_EXIT_MMIO or KVM_EXIT_IO_OUT, then the vcpu has +executed a memory-mapped I/O instruction which could not be satisfied +by kvm. The 'data' member contains the written data if 'is_write' is +true, and should be filled by application code otherwise. + + /* KVM_EXIT_HYPERCALL */ + struct { + __u64 nr; + __u64 args[6]; + __u64 ret; + __u32 longmode; + __u32 pad; + } hypercall; + +Unused. + + /* KVM_EXIT_TPR_ACCESS */ + struct { + __u64 rip; + __u32 is_write; + __u32 pad; + } tpr_access; + +To be documented (KVM_TPR_ACCESS_REPORTING). + + /* KVM_EXIT_S390_SIEIC */ + struct { + __u8 icptcode; + __u64 mask; /* psw upper half */ + __u64 addr; /* psw lower half */ + __u16 ipa; + __u32 ipb; + } s390_sieic; + +s390 specific. + + /* KVM_EXIT_S390_RESET */ +#define KVM_S390_RESET_POR 1 +#define KVM_S390_RESET_CLEAR 2 +#define KVM_S390_RESET_SUBSYSTEM 4 +#define KVM_S390_RESET_CPU_INIT 8 +#define KVM_S390_RESET_IPL 16 + __u64 s390_reset_flags; + +s390 specific. + + /* KVM_EXIT_DCR */ + struct { + __u32 dcrn; + __u32 data; + __u8 is_write; + } dcr; + +powerpc specific. + + /* Fix the size of the union. */ + char padding[256]; + }; +}; diff --git a/Documentation/laptops/thinkpad-acpi.txt b/Documentation/laptops/thinkpad-acpi.txt index 78e354b42f6..e2ddcdeb61b 100644 --- a/Documentation/laptops/thinkpad-acpi.txt +++ b/Documentation/laptops/thinkpad-acpi.txt @@ -36,8 +36,6 @@ detailed description): - Bluetooth enable and disable - video output switching, expansion control - ThinkLight on and off - - limited docking and undocking - - UltraBay eject - CMOS/UCMS control - LED control - ACPI sounds @@ -729,131 +727,6 @@ cannot be read or if it is unknown, thinkpad-acpi will report it as "off". It is impossible to know if the status returned through sysfs is valid. -Docking / undocking -- /proc/acpi/ibm/dock ------------------------------------------- - -Docking and undocking (e.g. with the X4 UltraBase) requires some -actions to be taken by the operating system to safely make or break -the electrical connections with the dock. - -The docking feature of this driver generates the following ACPI events: - - ibm/dock GDCK 00000003 00000001 -- eject request - ibm/dock GDCK 00000003 00000002 -- undocked - ibm/dock GDCK 00000000 00000003 -- docked - -NOTE: These events will only be generated if the laptop was docked -when originally booted. This is due to the current lack of support for -hot plugging of devices in the Linux ACPI framework. If the laptop was -booted while not in the dock, the following message is shown in the -logs: - - Mar 17 01:42:34 aero kernel: thinkpad_acpi: dock device not present - -In this case, no dock-related events are generated but the dock and -undock commands described below still work. They can be executed -manually or triggered by Fn key combinations (see the example acpid -configuration files included in the driver tarball package available -on the web site). - -When the eject request button on the dock is pressed, the first event -above is generated. The handler for this event should issue the -following command: - - echo undock > /proc/acpi/ibm/dock - -After the LED on the dock goes off, it is safe to eject the laptop. -Note: if you pressed this key by mistake, go ahead and eject the -laptop, then dock it back in. Otherwise, the dock may not function as -expected. - -When the laptop is docked, the third event above is generated. The -handler for this event should issue the following command to fully -enable the dock: - - echo dock > /proc/acpi/ibm/dock - -The contents of the /proc/acpi/ibm/dock file shows the current status -of the dock, as provided by the ACPI framework. - -The docking support in this driver does not take care of enabling or -disabling any other devices you may have attached to the dock. For -example, a CD drive plugged into the UltraBase needs to be disabled or -enabled separately. See the provided example acpid configuration files -for how this can be accomplished. - -There is no support yet for PCI devices that may be attached to a -docking station, e.g. in the ThinkPad Dock II. The driver currently -does not recognize, enable or disable such devices. This means that -the only docking stations currently supported are the X-series -UltraBase docks and "dumb" port replicators like the Mini Dock (the -latter don't need any ACPI support, actually). - - -UltraBay eject -- /proc/acpi/ibm/bay ------------------------------------- - -Inserting or ejecting an UltraBay device requires some actions to be -taken by the operating system to safely make or break the electrical -connections with the device. - -This feature generates the following ACPI events: - - ibm/bay MSTR 00000003 00000000 -- eject request - ibm/bay MSTR 00000001 00000000 -- eject lever inserted - -NOTE: These events will only be generated if the UltraBay was present -when the laptop was originally booted (on the X series, the UltraBay -is in the dock, so it may not be present if the laptop was undocked). -This is due to the current lack of support for hot plugging of devices -in the Linux ACPI framework. If the laptop was booted without the -UltraBay, the following message is shown in the logs: - - Mar 17 01:42:34 aero kernel: thinkpad_acpi: bay device not present - -In this case, no bay-related events are generated but the eject -command described below still works. It can be executed manually or -triggered by a hot key combination. - -Sliding the eject lever generates the first event shown above. The -handler for this event should take whatever actions are necessary to -shut down the device in the UltraBay (e.g. call idectl), then issue -the following command: - - echo eject > /proc/acpi/ibm/bay - -After the LED on the UltraBay goes off, it is safe to pull out the -device. - -When the eject lever is inserted, the second event above is -generated. The handler for this event should take whatever actions are -necessary to enable the UltraBay device (e.g. call idectl). - -The contents of the /proc/acpi/ibm/bay file shows the current status -of the UltraBay, as provided by the ACPI framework. - -EXPERIMENTAL warm eject support on the 600e/x, A22p and A3x (To use -this feature, you need to supply the experimental=1 parameter when -loading the module): - -These models do not have a button near the UltraBay device to request -a hot eject but rather require the laptop to be put to sleep -(suspend-to-ram) before the bay device is ejected or inserted). -The sequence of steps to eject the device is as follows: - - echo eject > /proc/acpi/ibm/bay - put the ThinkPad to sleep - remove the drive - resume from sleep - cat /proc/acpi/ibm/bay should show that the drive was removed - -On the A3x, both the UltraBay 2000 and UltraBay Plus devices are -supported. Use "eject2" instead of "eject" for the second bay. - -Note: the UltraBay eject support on the 600e/x, A22p and A3x is -EXPERIMENTAL and may not work as expected. USE WITH CAUTION! - - CMOS/UCMS control ----------------- @@ -920,7 +793,7 @@ The available commands are: echo '<LED number> off' >/proc/acpi/ibm/led echo '<LED number> blink' >/proc/acpi/ibm/led -The <LED number> range is 0 to 7. The set of LEDs that can be +The <LED number> range is 0 to 15. The set of LEDs that can be controlled varies from model to model. Here is the common ThinkPad mapping: @@ -932,6 +805,11 @@ mapping: 5 - UltraBase battery slot 6 - (unknown) 7 - standby + 8 - dock status 1 + 9 - dock status 2 + 10, 11 - (unknown) + 12 - thinkvantage + 13, 14, 15 - (unknown) All of the above can be turned on and off and can be made to blink. @@ -940,10 +818,12 @@ sysfs notes: The ThinkPad LED sysfs interface is described in detail by the LED class documentation, in Documentation/leds-class.txt. -The leds are named (in LED ID order, from 0 to 7): +The LEDs are named (in LED ID order, from 0 to 12): "tpacpi::power", "tpacpi:orange:batt", "tpacpi:green:batt", "tpacpi::dock_active", "tpacpi::bay_active", "tpacpi::dock_batt", -"tpacpi::unknown_led", "tpacpi::standby". +"tpacpi::unknown_led", "tpacpi::standby", "tpacpi::dock_status1", +"tpacpi::dock_status2", "tpacpi::unknown_led2", "tpacpi::unknown_led3", +"tpacpi::thinkvantage". Due to limitations in the sysfs LED class, if the status of the LED indicators cannot be read due to an error, thinkpad-acpi will report it as @@ -958,6 +838,12 @@ ThinkPad indicator LED should blink in hardware accelerated mode, use the "timer" trigger, and leave the delay_on and delay_off parameters set to zero (to request hardware acceleration autodetection). +LEDs that are known not to exist in a given ThinkPad model are not +made available through the sysfs interface. If you have a dock and you +notice there are LEDs listed for your ThinkPad that do not exist (and +are not in the dock), or if you notice that there are missing LEDs, +a report to ibm-acpi-devel@lists.sourceforge.net is appreciated. + ACPI sounds -- /proc/acpi/ibm/beep ---------------------------------- @@ -1156,17 +1042,19 @@ may not be distinct. Later Lenovo models that implement the ACPI display backlight brightness control methods have 16 levels, ranging from 0 to 15. -There are two interfaces to the firmware for direct brightness control, -EC and UCMS (or CMOS). To select which one should be used, use the -brightness_mode module parameter: brightness_mode=1 selects EC mode, -brightness_mode=2 selects UCMS mode, brightness_mode=3 selects EC -mode with NVRAM backing (so that brightness changes are remembered -across shutdown/reboot). +For IBM ThinkPads, there are two interfaces to the firmware for direct +brightness control, EC and UCMS (or CMOS). To select which one should be +used, use the brightness_mode module parameter: brightness_mode=1 selects +EC mode, brightness_mode=2 selects UCMS mode, brightness_mode=3 selects EC +mode with NVRAM backing (so that brightness changes are remembered across +shutdown/reboot). The driver tries to select which interface to use from a table of defaults for each ThinkPad model. If it makes a wrong choice, please report this as a bug, so that we can fix it. +Lenovo ThinkPads only support brightness_mode=2 (UCMS). + When display backlight brightness controls are available through the standard ACPI interface, it is best to use it instead of this direct ThinkPad-specific interface. The driver will disable its native @@ -1254,7 +1142,7 @@ Fan control and monitoring: fan speed, fan enable/disable procfs: /proc/acpi/ibm/fan sysfs device attributes: (hwmon "thinkpad") fan1_input, pwm1, - pwm1_enable + pwm1_enable, fan2_input sysfs hwmon driver attributes: fan_watchdog NOTE NOTE NOTE: fan control operations are disabled by default for @@ -1267,6 +1155,9 @@ from the hardware registers of the embedded controller. This is known to work on later R, T, X and Z series ThinkPads but may show a bogus value on other models. +Some Lenovo ThinkPads support a secondary fan. This fan cannot be +controlled separately, it shares the main fan control. + Fan levels: Most ThinkPad fans work in "levels" at the firmware interface. Level 0 @@ -1397,6 +1288,11 @@ hwmon device attribute fan1_input: which can take up to two minutes. May return rubbish on older ThinkPads. +hwmon device attribute fan2_input: + Fan tachometer reading, in RPM, for the secondary fan. + Available only on some ThinkPads. If the secondary fan is + not installed, will always read 0. + hwmon driver attribute fan_watchdog: Fan safety watchdog timer interval, in seconds. Minimum is 1 second, maximum is 120 seconds. 0 disables the watchdog. @@ -1555,3 +1451,7 @@ Sysfs interface changelog: 0x020300: hotkey enable/disable support removed, attributes hotkey_bios_enabled and hotkey_enable deprecated and marked for removal. + +0x020400: Marker for 16 LEDs support. Also, LEDs that are known + to not exist in a given model are not registered with + the LED sysfs class anymore. diff --git a/Documentation/leds-lp3944.txt b/Documentation/leds-lp3944.txt new file mode 100644 index 00000000000..c6eda18b15e --- /dev/null +++ b/Documentation/leds-lp3944.txt @@ -0,0 +1,50 @@ +Kernel driver lp3944 +==================== + + * National Semiconductor LP3944 Fun-light Chip + Prefix: 'lp3944' + Addresses scanned: None (see the Notes section below) + Datasheet: Publicly available at the National Semiconductor website + http://www.national.com/pf/LP/LP3944.html + +Authors: + Antonio Ospite <ospite@studenti.unina.it> + + +Description +----------- +The LP3944 is a helper chip that can drive up to 8 leds, with two programmable +DIM modes; it could even be used as a gpio expander but this driver assumes it +is used as a led controller. + +The DIM modes are used to set _blink_ patterns for leds, the pattern is +specified supplying two parameters: + - period: from 0s to 1.6s + - duty cycle: percentage of the period the led is on, from 0 to 100 + +Setting a led in DIM0 or DIM1 mode makes it blink according to the pattern. +See the datasheet for details. + +LP3944 can be found on Motorola A910 smartphone, where it drives the rgb +leds, the camera flash light and the lcds power. + + +Notes +----- +The chip is used mainly in embedded contexts, so this driver expects it is +registered using the i2c_board_info mechanism. + +To register the chip at address 0x60 on adapter 0, set the platform data +according to include/linux/leds-lp3944.h, set the i2c board info: + + static struct i2c_board_info __initdata a910_i2c_board_info[] = { + { + I2C_BOARD_INFO("lp3944", 0x60), + .platform_data = &a910_lp3944_leds, + }, + }; + +and register it in the platform init function + + i2c_register_board_info(0, a910_i2c_board_info, + ARRAY_SIZE(a910_i2c_board_info)); diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c index 9ebcd6ef361..950cde6d6e5 100644 --- a/Documentation/lguest/lguest.c +++ b/Documentation/lguest/lguest.c @@ -1,7 +1,9 @@ -/*P:100 This is the Launcher code, a simple program which lays out the - * "physical" memory for the new Guest by mapping the kernel image and - * the virtual devices, then opens /dev/lguest to tell the kernel - * about the Guest and control it. :*/ +/*P:100 + * This is the Launcher code, a simple program which lays out the "physical" + * memory for the new Guest by mapping the kernel image and the virtual + * devices, then opens /dev/lguest to tell the kernel about the Guest and + * control it. +:*/ #define _LARGEFILE64_SOURCE #define _GNU_SOURCE #include <stdio.h> @@ -46,13 +48,15 @@ #include "linux/virtio_rng.h" #include "linux/virtio_ring.h" #include "asm/bootparam.h" -/*L:110 We can ignore the 39 include files we need for this program, but I do - * want to draw attention to the use of kernel-style types. +/*L:110 + * We can ignore the 42 include files we need for this program, but I do want + * to draw attention to the use of kernel-style types. * * As Linus said, "C is a Spartan language, and so should your naming be." I * like these abbreviations, so we define them here. Note that u64 is always * unsigned long long, which works on all Linux systems: this means that we can - * use %llu in printf for any u64. */ + * use %llu in printf for any u64. + */ typedef unsigned long long u64; typedef uint32_t u32; typedef uint16_t u16; @@ -69,8 +73,10 @@ typedef uint8_t u8; /* This will occupy 3 pages: it must be a power of 2. */ #define VIRTQUEUE_NUM 256 -/*L:120 verbose is both a global flag and a macro. The C preprocessor allows - * this, and although I wouldn't recommend it, it works quite nicely here. */ +/*L:120 + * verbose is both a global flag and a macro. The C preprocessor allows + * this, and although I wouldn't recommend it, it works quite nicely here. + */ static bool verbose; #define verbose(args...) \ do { if (verbose) printf(args); } while(0) @@ -87,8 +93,7 @@ static int lguest_fd; static unsigned int __thread cpu_id; /* This is our list of devices. */ -struct device_list -{ +struct device_list { /* Counter to assign interrupt numbers. */ unsigned int next_irq; @@ -100,8 +105,7 @@ struct device_list /* A single linked list of devices. */ struct device *dev; - /* And a pointer to the last device for easy append and also for - * configuration appending. */ + /* And a pointer to the last device for easy append. */ struct device *lastdev; }; @@ -109,8 +113,7 @@ struct device_list static struct device_list devices; /* The device structure describes a single device. */ -struct device -{ +struct device { /* The linked-list pointer. */ struct device *next; @@ -135,8 +138,7 @@ struct device }; /* The virtqueue structure describes a queue attached to a device. */ -struct virtqueue -{ +struct virtqueue { struct virtqueue *next; /* Which device owns me. */ @@ -168,20 +170,24 @@ static char **main_args; /* The original tty settings to restore on exit. */ static struct termios orig_term; -/* We have to be careful with barriers: our devices are all run in separate +/* + * We have to be careful with barriers: our devices are all run in separate * threads and so we need to make sure that changes visible to the Guest happen - * in precise order. */ + * in precise order. + */ #define wmb() __asm__ __volatile__("" : : : "memory") #define mb() __asm__ __volatile__("" : : : "memory") -/* Convert an iovec element to the given type. +/* + * Convert an iovec element to the given type. * * This is a fairly ugly trick: we need to know the size of the type and * alignment requirement to check the pointer is kosher. It's also nice to * have the name of the type in case we report failure. * * Typing those three things all the time is cumbersome and error prone, so we - * have a macro which sets them all up and passes to the real function. */ + * have a macro which sets them all up and passes to the real function. + */ #define convert(iov, type) \ ((type *)_convert((iov), sizeof(type), __alignof__(type), #type)) @@ -198,8 +204,10 @@ static void *_convert(struct iovec *iov, size_t size, size_t align, /* Wrapper for the last available index. Makes it easier to change. */ #define lg_last_avail(vq) ((vq)->last_avail_idx) -/* The virtio configuration space is defined to be little-endian. x86 is - * little-endian too, but it's nice to be explicit so we have these helpers. */ +/* + * The virtio configuration space is defined to be little-endian. x86 is + * little-endian too, but it's nice to be explicit so we have these helpers. + */ #define cpu_to_le16(v16) (v16) #define cpu_to_le32(v32) (v32) #define cpu_to_le64(v64) (v64) @@ -241,11 +249,12 @@ static u8 *get_feature_bits(struct device *dev) + dev->num_vq * sizeof(struct lguest_vqconfig); } -/*L:100 The Launcher code itself takes us out into userspace, that scary place - * where pointers run wild and free! Unfortunately, like most userspace - * programs, it's quite boring (which is why everyone likes to hack on the - * kernel!). Perhaps if you make up an Lguest Drinking Game at this point, it - * will get you through this section. Or, maybe not. +/*L:100 + * The Launcher code itself takes us out into userspace, that scary place where + * pointers run wild and free! Unfortunately, like most userspace programs, + * it's quite boring (which is why everyone likes to hack on the kernel!). + * Perhaps if you make up an Lguest Drinking Game at this point, it will get + * you through this section. Or, maybe not. * * The Launcher sets up a big chunk of memory to be the Guest's "physical" * memory and stores it in "guest_base". In other words, Guest physical == @@ -253,7 +262,8 @@ static u8 *get_feature_bits(struct device *dev) * * This can be tough to get your head around, but usually it just means that we * use these trivial conversion functions when the Guest gives us it's - * "physical" addresses: */ + * "physical" addresses: + */ static void *from_guest_phys(unsigned long addr) { return guest_base + addr; @@ -268,7 +278,8 @@ static unsigned long to_guest_phys(const void *addr) * Loading the Kernel. * * We start with couple of simple helper routines. open_or_die() avoids - * error-checking code cluttering the callers: */ + * error-checking code cluttering the callers: + */ static int open_or_die(const char *name, int flags) { int fd = open(name, flags); @@ -283,12 +294,19 @@ static void *map_zeroed_pages(unsigned int num) int fd = open_or_die("/dev/zero", O_RDONLY); void *addr; - /* We use a private mapping (ie. if we write to the page, it will be - * copied). */ + /* + * We use a private mapping (ie. if we write to the page, it will be + * copied). + */ addr = mmap(NULL, getpagesize() * num, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE, fd, 0); if (addr == MAP_FAILED) err(1, "Mmaping %u pages of /dev/zero", num); + + /* + * One neat mmap feature is that you can close the fd, and it + * stays mapped. + */ close(fd); return addr; @@ -305,20 +323,24 @@ static void *get_pages(unsigned int num) return addr; } -/* This routine is used to load the kernel or initrd. It tries mmap, but if +/* + * This routine is used to load the kernel or initrd. It tries mmap, but if * that fails (Plan 9's kernel file isn't nicely aligned on page boundaries), - * it falls back to reading the memory in. */ + * it falls back to reading the memory in. + */ static void map_at(int fd, void *addr, unsigned long offset, unsigned long len) { ssize_t r; - /* We map writable even though for some segments are marked read-only. + /* + * We map writable even though for some segments are marked read-only. * The kernel really wants to be writable: it patches its own * instructions. * * MAP_PRIVATE means that the page won't be copied until a write is * done to it. This allows us to share untouched memory between - * Guests. */ + * Guests. + */ if (mmap(addr, len, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_FIXED|MAP_PRIVATE, fd, offset) != MAP_FAILED) return; @@ -329,7 +351,8 @@ static void map_at(int fd, void *addr, unsigned long offset, unsigned long len) err(1, "Reading offset %lu len %lu gave %zi", offset, len, r); } -/* This routine takes an open vmlinux image, which is in ELF, and maps it into +/* + * This routine takes an open vmlinux image, which is in ELF, and maps it into * the Guest memory. ELF = Embedded Linking Format, which is the format used * by all modern binaries on Linux including the kernel. * @@ -337,23 +360,28 @@ static void map_at(int fd, void *addr, unsigned long offset, unsigned long len) * address. We use the physical address; the Guest will map itself to the * virtual address. * - * We return the starting address. */ + * We return the starting address. + */ static unsigned long map_elf(int elf_fd, const Elf32_Ehdr *ehdr) { Elf32_Phdr phdr[ehdr->e_phnum]; unsigned int i; - /* Sanity checks on the main ELF header: an x86 executable with a - * reasonable number of correctly-sized program headers. */ + /* + * Sanity checks on the main ELF header: an x86 executable with a + * reasonable number of correctly-sized program headers. + */ if (ehdr->e_type != ET_EXEC || ehdr->e_machine != EM_386 || ehdr->e_phentsize != sizeof(Elf32_Phdr) || ehdr->e_phnum < 1 || ehdr->e_phnum > 65536U/sizeof(Elf32_Phdr)) errx(1, "Malformed elf header"); - /* An ELF executable contains an ELF header and a number of "program" + /* + * An ELF executable contains an ELF header and a number of "program" * headers which indicate which parts ("segments") of the program to - * load where. */ + * load where. + */ /* We read in all the program headers at once: */ if (lseek(elf_fd, ehdr->e_phoff, SEEK_SET) < 0) @@ -361,8 +389,10 @@ static unsigned long map_elf(int elf_fd, const Elf32_Ehdr *ehdr) if (read(elf_fd, phdr, sizeof(phdr)) != sizeof(phdr)) err(1, "Reading program headers"); - /* Try all the headers: there are usually only three. A read-only one, - * a read-write one, and a "note" section which we don't load. */ + /* + * Try all the headers: there are usually only three. A read-only one, + * a read-write one, and a "note" section which we don't load. + */ for (i = 0; i < ehdr->e_phnum; i++) { /* If this isn't a loadable segment, we ignore it */ if (phdr[i].p_type != PT_LOAD) @@ -380,13 +410,15 @@ static unsigned long map_elf(int elf_fd, const Elf32_Ehdr *ehdr) return ehdr->e_entry; } -/*L:150 A bzImage, unlike an ELF file, is not meant to be loaded. You're - * supposed to jump into it and it will unpack itself. We used to have to - * perform some hairy magic because the unpacking code scared me. +/*L:150 + * A bzImage, unlike an ELF file, is not meant to be loaded. You're supposed + * to jump into it and it will unpack itself. We used to have to perform some + * hairy magic because the unpacking code scared me. * * Fortunately, Jeremy Fitzhardinge convinced me it wasn't that hard and wrote * a small patch to jump over the tricky bits in the Guest, so now we just read - * the funky header so we know where in the file to load, and away we go! */ + * the funky header so we know where in the file to load, and away we go! + */ static unsigned long load_bzimage(int fd) { struct boot_params boot; @@ -394,8 +426,10 @@ static unsigned long load_bzimage(int fd) /* Modern bzImages get loaded at 1M. */ void *p = from_guest_phys(0x100000); - /* Go back to the start of the file and read the header. It should be - * a Linux boot header (see Documentation/x86/i386/boot.txt) */ + /* + * Go back to the start of the file and read the header. It should be + * a Linux boot header (see Documentation/x86/i386/boot.txt) + */ lseek(fd, 0, SEEK_SET); read(fd, &boot, sizeof(boot)); @@ -414,9 +448,11 @@ static unsigned long load_bzimage(int fd) return boot.hdr.code32_start; } -/*L:140 Loading the kernel is easy when it's a "vmlinux", but most kernels +/*L:140 + * Loading the kernel is easy when it's a "vmlinux", but most kernels * come wrapped up in the self-decompressing "bzImage" format. With a little - * work, we can load those, too. */ + * work, we can load those, too. + */ static unsigned long load_kernel(int fd) { Elf32_Ehdr hdr; @@ -433,24 +469,28 @@ static unsigned long load_kernel(int fd) return load_bzimage(fd); } -/* This is a trivial little helper to align pages. Andi Kleen hated it because +/* + * This is a trivial little helper to align pages. Andi Kleen hated it because * it calls getpagesize() twice: "it's dumb code." * * Kernel guys get really het up about optimization, even when it's not - * necessary. I leave this code as a reaction against that. */ + * necessary. I leave this code as a reaction against that. + */ static inline unsigned long page_align(unsigned long addr) { /* Add upwards and truncate downwards. */ return ((addr + getpagesize()-1) & ~(getpagesize()-1)); } -/*L:180 An "initial ram disk" is a disk image loaded into memory along with - * the kernel which the kernel can use to boot from without needing any - * drivers. Most distributions now use this as standard: the initrd contains - * the code to load the appropriate driver modules for the current machine. +/*L:180 + * An "initial ram disk" is a disk image loaded into memory along with the + * kernel which the kernel can use to boot from without needing any drivers. + * Most distributions now use this as standard: the initrd contains the code to + * load the appropriate driver modules for the current machine. * * Importantly, James Morris works for RedHat, and Fedora uses initrds for its - * kernels. He sent me this (and tells me when I break it). */ + * kernels. He sent me this (and tells me when I break it). + */ static unsigned long load_initrd(const char *name, unsigned long mem) { int ifd; @@ -462,12 +502,16 @@ static unsigned long load_initrd(const char *name, unsigned long mem) if (fstat(ifd, &st) < 0) err(1, "fstat() on initrd '%s'", name); - /* We map the initrd at the top of memory, but mmap wants it to be - * page-aligned, so we round the size up for that. */ + /* + * We map the initrd at the top of memory, but mmap wants it to be + * page-aligned, so we round the size up for that. + */ len = page_align(st.st_size); map_at(ifd, from_guest_phys(mem - len), 0, st.st_size); - /* Once a file is mapped, you can close the file descriptor. It's a - * little odd, but quite useful. */ + /* + * Once a file is mapped, you can close the file descriptor. It's a + * little odd, but quite useful. + */ close(ifd); verbose("mapped initrd %s size=%lu @ %p\n", name, len, (void*)mem-len); @@ -476,8 +520,10 @@ static unsigned long load_initrd(const char *name, unsigned long mem) } /*:*/ -/* Simple routine to roll all the commandline arguments together with spaces - * between them. */ +/* + * Simple routine to roll all the commandline arguments together with spaces + * between them. + */ static void concat(char *dst, char *args[]) { unsigned int i, len = 0; @@ -494,10 +540,12 @@ static void concat(char *dst, char *args[]) dst[len] = '\0'; } -/*L:185 This is where we actually tell the kernel to initialize the Guest. We +/*L:185 + * This is where we actually tell the kernel to initialize the Guest. We * saw the arguments it expects when we looked at initialize() in lguest_user.c: * the base of Guest "physical" memory, the top physical page to allow and the - * entry point for the Guest. */ + * entry point for the Guest. + */ static void tell_kernel(unsigned long start) { unsigned long args[] = { LHREQ_INITIALIZE, @@ -511,7 +559,7 @@ static void tell_kernel(unsigned long start) } /*:*/ -/* +/*L:200 * Device Handling. * * When the Guest gives us a buffer, it sends an array of addresses and sizes. @@ -522,20 +570,26 @@ static void tell_kernel(unsigned long start) static void *_check_pointer(unsigned long addr, unsigned int size, unsigned int line) { - /* We have to separately check addr and addr+size, because size could - * be huge and addr + size might wrap around. */ + /* + * We have to separately check addr and addr+size, because size could + * be huge and addr + size might wrap around. + */ if (addr >= guest_limit || addr + size >= guest_limit) errx(1, "%s:%i: Invalid address %#lx", __FILE__, line, addr); - /* We return a pointer for the caller's convenience, now we know it's - * safe to use. */ + /* + * We return a pointer for the caller's convenience, now we know it's + * safe to use. + */ return from_guest_phys(addr); } /* A macro which transparently hands the line number to the real function. */ #define check_pointer(addr,size) _check_pointer(addr, size, __LINE__) -/* Each buffer in the virtqueues is actually a chain of descriptors. This +/* + * Each buffer in the virtqueues is actually a chain of descriptors. This * function returns the next descriptor in the chain, or vq->vring.num if we're - * at the end. */ + * at the end. + */ static unsigned next_desc(struct vring_desc *desc, unsigned int i, unsigned int max) { @@ -556,7 +610,10 @@ static unsigned next_desc(struct vring_desc *desc, return next; } -/* This actually sends the interrupt for this virtqueue */ +/* + * This actually sends the interrupt for this virtqueue, if we've used a + * buffer. + */ static void trigger_irq(struct virtqueue *vq) { unsigned long buf[] = { LHREQ_IRQ, vq->config.irq }; @@ -576,12 +633,14 @@ static void trigger_irq(struct virtqueue *vq) err(1, "Triggering irq %i", vq->config.irq); } -/* This looks in the virtqueue and for the first available buffer, and converts +/* + * This looks in the virtqueue for the first available buffer, and converts * it to an iovec for convenient access. Since descriptors consist of some * number of output then some number of input descriptors, it's actually two * iovecs, but we pack them into one and note how many of each there were. * - * This function returns the descriptor number found. */ + * This function waits if necessary, and returns the descriptor number found. + */ static unsigned wait_for_vq_desc(struct virtqueue *vq, struct iovec iov[], unsigned int *out_num, unsigned int *in_num) @@ -590,17 +649,23 @@ static unsigned wait_for_vq_desc(struct virtqueue *vq, struct vring_desc *desc; u16 last_avail = lg_last_avail(vq); + /* There's nothing available? */ while (last_avail == vq->vring.avail->idx) { u64 event; - /* OK, tell Guest about progress up to now. */ + /* + * Since we're about to sleep, now is a good time to tell the + * Guest about what we've used up to now. + */ trigger_irq(vq); /* OK, now we need to know about added descriptors. */ vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY; - /* They could have slipped one in as we were doing that: make - * sure it's written, then check again. */ + /* + * They could have slipped one in as we were doing that: make + * sure it's written, then check again. + */ mb(); if (last_avail != vq->vring.avail->idx) { vq->vring.used->flags |= VRING_USED_F_NO_NOTIFY; @@ -620,8 +685,10 @@ static unsigned wait_for_vq_desc(struct virtqueue *vq, errx(1, "Guest moved used index from %u to %u", last_avail, vq->vring.avail->idx); - /* Grab the next descriptor number they're advertising, and increment - * the index we've seen. */ + /* + * Grab the next descriptor number they're advertising, and increment + * the index we've seen. + */ head = vq->vring.avail->ring[last_avail % vq->vring.num]; lg_last_avail(vq)++; @@ -636,8 +703,10 @@ static unsigned wait_for_vq_desc(struct virtqueue *vq, desc = vq->vring.desc; i = head; - /* If this is an indirect entry, then this buffer contains a descriptor - * table which we handle as if it's any normal descriptor chain. */ + /* + * If this is an indirect entry, then this buffer contains a descriptor + * table which we handle as if it's any normal descriptor chain. + */ if (desc[i].flags & VRING_DESC_F_INDIRECT) { if (desc[i].len % sizeof(struct vring_desc)) errx(1, "Invalid size for indirect buffer table"); @@ -656,8 +725,10 @@ static unsigned wait_for_vq_desc(struct virtqueue *vq, if (desc[i].flags & VRING_DESC_F_WRITE) (*in_num)++; else { - /* If it's an output descriptor, they're all supposed - * to come before any input descriptors. */ + /* + * If it's an output descriptor, they're all supposed + * to come before any input descriptors. + */ if (*in_num) errx(1, "Descriptor has out after in"); (*out_num)++; @@ -671,14 +742,19 @@ static unsigned wait_for_vq_desc(struct virtqueue *vq, return head; } -/* After we've used one of their buffers, we tell them about it. We'll then - * want to send them an interrupt, using trigger_irq(). */ +/* + * After we've used one of their buffers, we tell the Guest about it. Sometime + * later we'll want to send them an interrupt using trigger_irq(); note that + * wait_for_vq_desc() does that for us if it has to wait. + */ static void add_used(struct virtqueue *vq, unsigned int head, int len) { struct vring_used_elem *used; - /* The virtqueue contains a ring of used buffers. Get a pointer to the - * next entry in that used ring. */ + /* + * The virtqueue contains a ring of used buffers. Get a pointer to the + * next entry in that used ring. + */ used = &vq->vring.used->ring[vq->vring.used->idx % vq->vring.num]; used->id = head; used->len = len; @@ -698,9 +774,9 @@ static void add_used_and_trigger(struct virtqueue *vq, unsigned head, int len) /* * The Console * - * We associate some data with the console for our exit hack. */ -struct console_abort -{ + * We associate some data with the console for our exit hack. + */ +struct console_abort { /* How many times have they hit ^C? */ int count; /* When did they start? */ @@ -715,30 +791,35 @@ static void console_input(struct virtqueue *vq) struct console_abort *abort = vq->dev->priv; struct iovec iov[vq->vring.num]; - /* Make sure there's a descriptor waiting. */ + /* Make sure there's a descriptor available. */ head = wait_for_vq_desc(vq, iov, &out_num, &in_num); if (out_num) errx(1, "Output buffers in console in queue?"); - /* Read it in. */ + /* Read into it. This is where we usually wait. */ len = readv(STDIN_FILENO, iov, in_num); if (len <= 0) { /* Ran out of input? */ warnx("Failed to get console input, ignoring console."); - /* For simplicity, dying threads kill the whole Launcher. So - * just nap here. */ + /* + * For simplicity, dying threads kill the whole Launcher. So + * just nap here. + */ for (;;) pause(); } + /* Tell the Guest we used a buffer. */ add_used_and_trigger(vq, head, len); - /* Three ^C within one second? Exit. + /* + * Three ^C within one second? Exit. * * This is such a hack, but works surprisingly well. Each ^C has to * be in a buffer by itself, so they can't be too fast. But we check * that we get three within about a second, so they can't be too - * slow. */ + * slow. + */ if (len != 1 || ((char *)iov[0].iov_base)[0] != 3) { abort->count = 0; return; @@ -763,15 +844,23 @@ static void console_output(struct virtqueue *vq) unsigned int head, out, in; struct iovec iov[vq->vring.num]; + /* We usually wait in here, for the Guest to give us something. */ head = wait_for_vq_desc(vq, iov, &out, &in); if (in) errx(1, "Input buffers in console output queue?"); + + /* writev can return a partial write, so we loop here. */ while (!iov_empty(iov, out)) { int len = writev(STDOUT_FILENO, iov, out); if (len <= 0) err(1, "Write to stdout gave %i", len); iov_consume(iov, out, len); } + + /* + * We're finished with that buffer: if we're going to sleep, + * wait_for_vq_desc() will prod the Guest with an interrupt. + */ add_used(vq, head, 0); } @@ -791,15 +880,30 @@ static void net_output(struct virtqueue *vq) unsigned int head, out, in; struct iovec iov[vq->vring.num]; + /* We usually wait in here for the Guest to give us a packet. */ head = wait_for_vq_desc(vq, iov, &out, &in); if (in) errx(1, "Input buffers in net output queue?"); + /* + * Send the whole thing through to /dev/net/tun. It expects the exact + * same format: what a coincidence! + */ if (writev(net_info->tunfd, iov, out) < 0) errx(1, "Write to tun failed?"); + + /* + * Done with that one; wait_for_vq_desc() will send the interrupt if + * all packets are processed. + */ add_used(vq, head, 0); } -/* Will reading from this file descriptor block? */ +/* + * Handling network input is a bit trickier, because I've tried to optimize it. + * + * First we have a helper routine which tells is if from this file descriptor + * (ie. the /dev/net/tun device) will block: + */ static bool will_block(int fd) { fd_set fdset; @@ -809,8 +913,11 @@ static bool will_block(int fd) return select(fd+1, &fdset, NULL, NULL, &zero) != 1; } -/* This is where we handle packets coming in from the tun device to our - * Guest. */ +/* + * This handles packets coming in from the tun device to our Guest. Like all + * service routines, it gets called again as soon as it returns, so you don't + * see a while(1) loop here. + */ static void net_input(struct virtqueue *vq) { int len; @@ -818,21 +925,38 @@ static void net_input(struct virtqueue *vq) struct iovec iov[vq->vring.num]; struct net_info *net_info = vq->dev->priv; + /* + * Get a descriptor to write an incoming packet into. This will also + * send an interrupt if they're out of descriptors. + */ head = wait_for_vq_desc(vq, iov, &out, &in); if (out) errx(1, "Output buffers in net input queue?"); - /* Deliver interrupt now, since we're about to sleep. */ + /* + * If it looks like we'll block reading from the tun device, send them + * an interrupt. + */ if (vq->pending_used && will_block(net_info->tunfd)) trigger_irq(vq); + /* + * Read in the packet. This is where we normally wait (when there's no + * incoming network traffic). + */ len = readv(net_info->tunfd, iov, in); if (len <= 0) err(1, "Failed to read from tun."); + + /* + * Mark that packet buffer as used, but don't interrupt here. We want + * to wait until we've done as much work as we can. + */ add_used(vq, head, len); } +/*:*/ -/* This is the helper to create threads. */ +/* This is the helper to create threads: run the service routine in a loop. */ static int do_thread(void *_vq) { struct virtqueue *vq = _vq; @@ -842,8 +966,10 @@ static int do_thread(void *_vq) return 0; } -/* When a child dies, we kill our entire process group with SIGTERM. This - * also has the side effect that the shell restores the console for us! */ +/* + * When a child dies, we kill our entire process group with SIGTERM. This + * also has the side effect that the shell restores the console for us! + */ static void kill_launcher(int signal) { kill(0, SIGTERM); @@ -878,11 +1004,15 @@ static void reset_device(struct device *dev) signal(SIGCHLD, (void *)kill_launcher); } +/*L:216 + * This actually creates the thread which services the virtqueue for a device. + */ static void create_thread(struct virtqueue *vq) { - /* Create stack for thread and run it. Since stack grows - * upwards, we point the stack pointer to the end of this - * region. */ + /* + * Create stack for thread. Since the stack grows upwards, we point + * the stack pointer to the end of this region. + */ char *stack = malloc(32768); unsigned long args[] = { LHREQ_EVENTFD, vq->config.pfn*getpagesize(), 0 }; @@ -893,17 +1023,22 @@ static void create_thread(struct virtqueue *vq) err(1, "Creating eventfd"); args[2] = vq->eventfd; - /* Attach an eventfd to this virtqueue: it will go off - * when the Guest does an LHCALL_NOTIFY for this vq. */ + /* + * Attach an eventfd to this virtqueue: it will go off when the Guest + * does an LHCALL_NOTIFY for this vq. + */ if (write(lguest_fd, &args, sizeof(args)) != 0) err(1, "Attaching eventfd"); - /* CLONE_VM: because it has to access the Guest memory, and - * SIGCHLD so we get a signal if it dies. */ + /* + * CLONE_VM: because it has to access the Guest memory, and SIGCHLD so + * we get a signal if it dies. + */ vq->thread = clone(do_thread, stack + 32768, CLONE_VM | SIGCHLD, vq); if (vq->thread == (pid_t)-1) err(1, "Creating clone"); - /* We close our local copy, now the child has it. */ + + /* We close our local copy now the child has it. */ close(vq->eventfd); } @@ -955,7 +1090,10 @@ static void update_device_status(struct device *dev) } } -/* This is the generic routine we call when the Guest uses LHCALL_NOTIFY. */ +/*L:215 + * This is the generic routine we call when the Guest uses LHCALL_NOTIFY. In + * particular, it's used to notify us of device status changes during boot. + */ static void handle_output(unsigned long addr) { struct device *i; @@ -964,25 +1102,42 @@ static void handle_output(unsigned long addr) for (i = devices.dev; i; i = i->next) { struct virtqueue *vq; - /* Notifications to device descriptors update device status. */ + /* + * Notifications to device descriptors mean they updated the + * device status. + */ if (from_guest_phys(addr) == i->desc) { update_device_status(i); return; } - /* Devices *can* be used before status is set to DRIVER_OK. */ + /* + * Devices *can* be used before status is set to DRIVER_OK. + * The original plan was that they would never do this: they + * would always finish setting up their status bits before + * actually touching the virtqueues. In practice, we allowed + * them to, and they do (eg. the disk probes for partition + * tables as part of initialization). + * + * If we see this, we start the device: once it's running, we + * expect the device to catch all the notifications. + */ for (vq = i->vq; vq; vq = vq->next) { if (addr != vq->config.pfn*getpagesize()) continue; if (i->running) errx(1, "Notification on running %s", i->name); + /* This just calls create_thread() for each virtqueue */ start_device(i); return; } } - /* Early console write is done using notify on a nul-terminated string - * in Guest memory. */ + /* + * Early console write is done using notify on a nul-terminated string + * in Guest memory. It's also great for hacking debugging messages + * into a Guest. + */ if (addr >= guest_limit) errx(1, "Bad NOTIFY %#lx", addr); @@ -998,10 +1153,12 @@ static void handle_output(unsigned long addr) * routines to allocate and manage them. */ -/* The layout of the device page is a "struct lguest_device_desc" followed by a +/* + * The layout of the device page is a "struct lguest_device_desc" followed by a * number of virtqueue descriptors, then two sets of feature bits, then an * array of configuration bytes. This routine returns the configuration - * pointer. */ + * pointer. + */ static u8 *device_config(const struct device *dev) { return (void *)(dev->desc + 1) @@ -1009,9 +1166,11 @@ static u8 *device_config(const struct device *dev) + dev->feature_len * 2; } -/* This routine allocates a new "struct lguest_device_desc" from descriptor +/* + * This routine allocates a new "struct lguest_device_desc" from descriptor * table page just above the Guest's normal memory. It returns a pointer to - * that descriptor. */ + * that descriptor. + */ static struct lguest_device_desc *new_dev_desc(u16 type) { struct lguest_device_desc d = { .type = type }; @@ -1032,8 +1191,10 @@ static struct lguest_device_desc *new_dev_desc(u16 type) return memcpy(p, &d, sizeof(d)); } -/* Each device descriptor is followed by the description of its virtqueues. We - * specify how many descriptors the virtqueue is to have. */ +/* + * Each device descriptor is followed by the description of its virtqueues. We + * specify how many descriptors the virtqueue is to have. + */ static void add_virtqueue(struct device *dev, unsigned int num_descs, void (*service)(struct virtqueue *)) { @@ -1050,6 +1211,11 @@ static void add_virtqueue(struct device *dev, unsigned int num_descs, vq->next = NULL; vq->last_avail_idx = 0; vq->dev = dev; + + /* + * This is the routine the service thread will run, and its Process ID + * once it's running. + */ vq->service = service; vq->thread = (pid_t)-1; @@ -1061,10 +1227,12 @@ static void add_virtqueue(struct device *dev, unsigned int num_descs, /* Initialize the vring. */ vring_init(&vq->vring, num_descs, p, LGUEST_VRING_ALIGN); - /* Append virtqueue to this device's descriptor. We use + /* + * Append virtqueue to this device's descriptor. We use * device_config() to get the end of the device's current virtqueues; * we check that we haven't added any config or feature information - * yet, otherwise we'd be overwriting them. */ + * yet, otherwise we'd be overwriting them. + */ assert(dev->desc->config_len == 0 && dev->desc->feature_len == 0); memcpy(device_config(dev), &vq->config, sizeof(vq->config)); dev->num_vq++; @@ -1072,14 +1240,18 @@ static void add_virtqueue(struct device *dev, unsigned int num_descs, verbose("Virtqueue page %#lx\n", to_guest_phys(p)); - /* Add to tail of list, so dev->vq is first vq, dev->vq->next is - * second. */ + /* + * Add to tail of list, so dev->vq is first vq, dev->vq->next is + * second. + */ for (i = &dev->vq; *i; i = &(*i)->next); *i = vq; } -/* The first half of the feature bitmask is for us to advertise features. The - * second half is for the Guest to accept features. */ +/* + * The first half of the feature bitmask is for us to advertise features. The + * second half is for the Guest to accept features. + */ static void add_feature(struct device *dev, unsigned bit) { u8 *features = get_feature_bits(dev); @@ -1093,9 +1265,11 @@ static void add_feature(struct device *dev, unsigned bit) features[bit / CHAR_BIT] |= (1 << (bit % CHAR_BIT)); } -/* This routine sets the configuration fields for an existing device's +/* + * This routine sets the configuration fields for an existing device's * descriptor. It only works for the last device, but that's OK because that's - * how we use it. */ + * how we use it. + */ static void set_config(struct device *dev, unsigned len, const void *conf) { /* Check we haven't overflowed our single page. */ @@ -1105,12 +1279,18 @@ static void set_config(struct device *dev, unsigned len, const void *conf) /* Copy in the config information, and store the length. */ memcpy(device_config(dev), conf, len); dev->desc->config_len = len; + + /* Size must fit in config_len field (8 bits)! */ + assert(dev->desc->config_len == len); } -/* This routine does all the creation and setup of a new device, including - * calling new_dev_desc() to allocate the descriptor and device memory. +/* + * This routine does all the creation and setup of a new device, including + * calling new_dev_desc() to allocate the descriptor and device memory. We + * don't actually start the service threads until later. * - * See what I mean about userspace being boring? */ + * See what I mean about userspace being boring? + */ static struct device *new_device(const char *name, u16 type) { struct device *dev = malloc(sizeof(*dev)); @@ -1123,10 +1303,12 @@ static struct device *new_device(const char *name, u16 type) dev->num_vq = 0; dev->running = false; - /* Append to device list. Prepending to a single-linked list is + /* + * Append to device list. Prepending to a single-linked list is * easier, but the user expects the devices to be arranged on the bus * in command-line order. The first network device on the command line - * is eth0, the first block device /dev/vda, etc. */ + * is eth0, the first block device /dev/vda, etc. + */ if (devices.lastdev) devices.lastdev->next = dev; else @@ -1136,8 +1318,10 @@ static struct device *new_device(const char *name, u16 type) return dev; } -/* Our first setup routine is the console. It's a fairly simple device, but - * UNIX tty handling makes it uglier than it could be. */ +/* + * Our first setup routine is the console. It's a fairly simple device, but + * UNIX tty handling makes it uglier than it could be. + */ static void setup_console(void) { struct device *dev; @@ -1145,8 +1329,10 @@ static void setup_console(void) /* If we can save the initial standard input settings... */ if (tcgetattr(STDIN_FILENO, &orig_term) == 0) { struct termios term = orig_term; - /* Then we turn off echo, line buffering and ^C etc. We want a - * raw input stream to the Guest. */ + /* + * Then we turn off echo, line buffering and ^C etc: We want a + * raw input stream to the Guest. + */ term.c_lflag &= ~(ISIG|ICANON|ECHO); tcsetattr(STDIN_FILENO, TCSANOW, &term); } @@ -1157,10 +1343,12 @@ static void setup_console(void) dev->priv = malloc(sizeof(struct console_abort)); ((struct console_abort *)dev->priv)->count = 0; - /* The console needs two virtqueues: the input then the output. When + /* + * The console needs two virtqueues: the input then the output. When * they put something the input queue, we make sure we're listening to * stdin. When they put something in the output queue, we write it to - * stdout. */ + * stdout. + */ add_virtqueue(dev, VIRTQUEUE_NUM, console_input); add_virtqueue(dev, VIRTQUEUE_NUM, console_output); @@ -1168,7 +1356,8 @@ static void setup_console(void) } /*:*/ -/*M:010 Inter-guest networking is an interesting area. Simplest is to have a +/*M:010 + * Inter-guest networking is an interesting area. Simplest is to have a * --sharenet=<name> option which opens or creates a named pipe. This can be * used to send packets to another guest in a 1:1 manner. * @@ -1182,7 +1371,8 @@ static void setup_console(void) * multiple inter-guest channels behind one interface, although it would * require some manner of hotplugging new virtio channels. * - * Finally, we could implement a virtio network switch in the kernel. :*/ + * Finally, we could implement a virtio network switch in the kernel. +:*/ static u32 str2ip(const char *ipaddr) { @@ -1207,11 +1397,13 @@ static void str2mac(const char *macaddr, unsigned char mac[6]) mac[5] = m[5]; } -/* This code is "adapted" from libbridge: it attaches the Host end of the +/* + * This code is "adapted" from libbridge: it attaches the Host end of the * network device to the bridge device specified by the command line. * * This is yet another James Morris contribution (I'm an IP-level guy, so I - * dislike bridging), and I just try not to break it. */ + * dislike bridging), and I just try not to break it. + */ static void add_to_bridge(int fd, const char *if_name, const char *br_name) { int ifidx; @@ -1231,9 +1423,11 @@ static void add_to_bridge(int fd, const char *if_name, const char *br_name) err(1, "can't add %s to bridge %s", if_name, br_name); } -/* This sets up the Host end of the network device with an IP address, brings +/* + * This sets up the Host end of the network device with an IP address, brings * it up so packets will flow, the copies the MAC address into the hwaddr - * pointer. */ + * pointer. + */ static void configure_device(int fd, const char *tapif, u32 ipaddr) { struct ifreq ifr; @@ -1260,10 +1454,12 @@ static int get_tun_device(char tapif[IFNAMSIZ]) /* Start with this zeroed. Messy but sure. */ memset(&ifr, 0, sizeof(ifr)); - /* We open the /dev/net/tun device and tell it we want a tap device. A + /* + * We open the /dev/net/tun device and tell it we want a tap device. A * tap device is like a tun device, only somehow different. To tell * the truth, I completely blundered my way through this code, but it - * works now! */ + * works now! + */ netfd = open_or_die("/dev/net/tun", O_RDWR); ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR; strcpy(ifr.ifr_name, "tap%d"); @@ -1274,18 +1470,22 @@ static int get_tun_device(char tapif[IFNAMSIZ]) TUN_F_CSUM|TUN_F_TSO4|TUN_F_TSO6|TUN_F_TSO_ECN) != 0) err(1, "Could not set features for tun device"); - /* We don't need checksums calculated for packets coming in this - * device: trust us! */ + /* + * We don't need checksums calculated for packets coming in this + * device: trust us! + */ ioctl(netfd, TUNSETNOCSUM, 1); memcpy(tapif, ifr.ifr_name, IFNAMSIZ); return netfd; } -/*L:195 Our network is a Host<->Guest network. This can either use bridging or +/*L:195 + * Our network is a Host<->Guest network. This can either use bridging or * routing, but the principle is the same: it uses the "tun" device to inject * packets into the Host as if they came in from a normal network card. We - * just shunt packets between the Guest and the tun device. */ + * just shunt packets between the Guest and the tun device. + */ static void setup_tun_net(char *arg) { struct device *dev; @@ -1302,13 +1502,14 @@ static void setup_tun_net(char *arg) dev = new_device("net", VIRTIO_ID_NET); dev->priv = net_info; - /* Network devices need a receive and a send queue, just like - * console. */ + /* Network devices need a recv and a send queue, just like console. */ add_virtqueue(dev, VIRTQUEUE_NUM, net_input); add_virtqueue(dev, VIRTQUEUE_NUM, net_output); - /* We need a socket to perform the magic network ioctls to bring up the - * tap interface, connect to the bridge etc. Any socket will do! */ + /* + * We need a socket to perform the magic network ioctls to bring up the + * tap interface, connect to the bridge etc. Any socket will do! + */ ipfd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP); if (ipfd < 0) err(1, "opening IP socket"); @@ -1362,39 +1563,31 @@ static void setup_tun_net(char *arg) verbose("device %u: tun %s: %s\n", devices.device_num, tapif, arg); } - -/* Our block (disk) device should be really simple: the Guest asks for a block - * number and we read or write that position in the file. Unfortunately, that - * was amazingly slow: the Guest waits until the read is finished before - * running anything else, even if it could have been doing useful work. - * - * We could use async I/O, except it's reputed to suck so hard that characters - * actually go missing from your code when you try to use it. - * - * So we farm the I/O out to thread, and communicate with it via a pipe. */ +/*:*/ /* This hangs off device->priv. */ -struct vblk_info -{ +struct vblk_info { /* The size of the file. */ off64_t len; /* The file descriptor for the file. */ int fd; - /* IO thread listens on this file descriptor [0]. */ - int workpipe[2]; - - /* IO thread writes to this file descriptor to mark it done, then - * Launcher triggers interrupt to Guest. */ - int done_fd; }; /*L:210 * The Disk * - * Remember that the block device is handled by a separate I/O thread. We head - * straight into the core of that thread here: + * The disk only has one virtqueue, so it only has one thread. It is really + * simple: the Guest asks for a block number and we read or write that position + * in the file. + * + * Before we serviced each virtqueue in a separate thread, that was unacceptably + * slow: the Guest waits until the read is finished before running anything + * else, even if it could have been doing useful work. + * + * We could have used async I/O, except it's reputed to suck so hard that + * characters actually go missing from your code when you try to use it. */ static void blk_request(struct virtqueue *vq) { @@ -1406,47 +1599,64 @@ static void blk_request(struct virtqueue *vq) struct iovec iov[vq->vring.num]; off64_t off; - /* Get the next request. */ + /* + * Get the next request, where we normally wait. It triggers the + * interrupt to acknowledge previously serviced requests (if any). + */ head = wait_for_vq_desc(vq, iov, &out_num, &in_num); - /* Every block request should contain at least one output buffer + /* + * Every block request should contain at least one output buffer * (detailing the location on disk and the type of request) and one - * input buffer (to hold the result). */ + * input buffer (to hold the result). + */ if (out_num == 0 || in_num == 0) errx(1, "Bad virtblk cmd %u out=%u in=%u", head, out_num, in_num); out = convert(&iov[0], struct virtio_blk_outhdr); in = convert(&iov[out_num+in_num-1], u8); + /* + * For historical reasons, block operations are expressed in 512 byte + * "sectors". + */ off = out->sector * 512; - /* The block device implements "barriers", where the Guest indicates + /* + * The block device implements "barriers", where the Guest indicates * that it wants all previous writes to occur before this write. We * don't have a way of asking our kernel to do a barrier, so we just - * synchronize all the data in the file. Pretty poor, no? */ + * synchronize all the data in the file. Pretty poor, no? + */ if (out->type & VIRTIO_BLK_T_BARRIER) fdatasync(vblk->fd); - /* In general the virtio block driver is allowed to try SCSI commands. - * It'd be nice if we supported eject, for example, but we don't. */ + /* + * In general the virtio block driver is allowed to try SCSI commands. + * It'd be nice if we supported eject, for example, but we don't. + */ if (out->type & VIRTIO_BLK_T_SCSI_CMD) { fprintf(stderr, "Scsi commands unsupported\n"); *in = VIRTIO_BLK_S_UNSUPP; wlen = sizeof(*in); } else if (out->type & VIRTIO_BLK_T_OUT) { - /* Write */ - - /* Move to the right location in the block file. This can fail - * if they try to write past end. */ + /* + * Write + * + * Move to the right location in the block file. This can fail + * if they try to write past end. + */ if (lseek64(vblk->fd, off, SEEK_SET) != off) err(1, "Bad seek to sector %llu", out->sector); ret = writev(vblk->fd, iov+1, out_num-1); verbose("WRITE to sector %llu: %i\n", out->sector, ret); - /* Grr... Now we know how long the descriptor they sent was, we + /* + * Grr... Now we know how long the descriptor they sent was, we * make sure they didn't try to write over the end of the block - * file (possibly extending it). */ + * file (possibly extending it). + */ if (ret > 0 && off + ret > vblk->len) { /* Trim it back to the correct length */ ftruncate64(vblk->fd, vblk->len); @@ -1456,10 +1666,12 @@ static void blk_request(struct virtqueue *vq) wlen = sizeof(*in); *in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR); } else { - /* Read */ - - /* Move to the right location in the block file. This can fail - * if they try to read past end. */ + /* + * Read + * + * Move to the right location in the block file. This can fail + * if they try to read past end. + */ if (lseek64(vblk->fd, off, SEEK_SET) != off) err(1, "Bad seek to sector %llu", out->sector); @@ -1474,13 +1686,16 @@ static void blk_request(struct virtqueue *vq) } } - /* OK, so we noted that it was pretty poor to use an fdatasync as a + /* + * OK, so we noted that it was pretty poor to use an fdatasync as a * barrier. But Christoph Hellwig points out that we need a sync * *afterwards* as well: "Barriers specify no reordering to the front - * or the back." And Jens Axboe confirmed it, so here we are: */ + * or the back." And Jens Axboe confirmed it, so here we are: + */ if (out->type & VIRTIO_BLK_T_BARRIER) fdatasync(vblk->fd); + /* Finished that request. */ add_used(vq, head, wlen); } @@ -1491,7 +1706,7 @@ static void setup_block_file(const char *filename) struct vblk_info *vblk; struct virtio_blk_config conf; - /* The device responds to return from I/O thread. */ + /* Creat the device. */ dev = new_device("block", VIRTIO_ID_BLOCK); /* The device has one virtqueue, where the Guest places requests. */ @@ -1510,27 +1725,32 @@ static void setup_block_file(const char *filename) /* Tell Guest how many sectors this device has. */ conf.capacity = cpu_to_le64(vblk->len / 512); - /* Tell Guest not to put in too many descriptors at once: two are used - * for the in and out elements. */ + /* + * Tell Guest not to put in too many descriptors at once: two are used + * for the in and out elements. + */ add_feature(dev, VIRTIO_BLK_F_SEG_MAX); conf.seg_max = cpu_to_le32(VIRTQUEUE_NUM - 2); - set_config(dev, sizeof(conf), &conf); + /* Don't try to put whole struct: we have 8 bit limit. */ + set_config(dev, offsetof(struct virtio_blk_config, geometry), &conf); verbose("device %u: virtblock %llu sectors\n", ++devices.device_num, le64_to_cpu(conf.capacity)); } -struct rng_info { - int rfd; -}; - -/* Our random number generator device reads from /dev/random into the Guest's +/*L:211 + * Our random number generator device reads from /dev/random into the Guest's * input buffers. The usual case is that the Guest doesn't want random numbers * and so has no buffers although /dev/random is still readable, whereas * console is the reverse. * - * The same logic applies, however. */ + * The same logic applies, however. + */ +struct rng_info { + int rfd; +}; + static void rng_input(struct virtqueue *vq) { int len; @@ -1543,9 +1763,10 @@ static void rng_input(struct virtqueue *vq) if (out_num) errx(1, "Output buffers in rng?"); - /* This is why we convert to iovecs: the readv() call uses them, and so - * it reads straight into the Guest's buffer. We loop to make sure we - * fill it. */ + /* + * Just like the console write, we loop to cover the whole iovec. + * In this case, short reads actually happen quite a bit. + */ while (!iov_empty(iov, in_num)) { len = readv(rng_info->rfd, iov, in_num); if (len <= 0) @@ -1558,15 +1779,18 @@ static void rng_input(struct virtqueue *vq) add_used(vq, head, totlen); } -/* And this creates a "hardware" random number device for the Guest. */ +/*L:199 + * This creates a "hardware" random number device for the Guest. + */ static void setup_rng(void) { struct device *dev; struct rng_info *rng_info = malloc(sizeof(*rng_info)); + /* Our device's privat info simply contains the /dev/random fd. */ rng_info->rfd = open_or_die("/dev/random", O_RDONLY); - /* The device responds to return from I/O thread. */ + /* Create the new device. */ dev = new_device("rng", VIRTIO_ID_RNG); dev->priv = rng_info; @@ -1582,8 +1806,10 @@ static void __attribute__((noreturn)) restart_guest(void) { unsigned int i; - /* Since we don't track all open fds, we simply close everything beyond - * stderr. */ + /* + * Since we don't track all open fds, we simply close everything beyond + * stderr. + */ for (i = 3; i < FD_SETSIZE; i++) close(i); @@ -1594,8 +1820,10 @@ static void __attribute__((noreturn)) restart_guest(void) err(1, "Could not exec %s", main_args[0]); } -/*L:220 Finally we reach the core of the Launcher which runs the Guest, serves - * its input and output, and finally, lays it to rest. */ +/*L:220 + * Finally we reach the core of the Launcher which runs the Guest, serves + * its input and output, and finally, lays it to rest. + */ static void __attribute__((noreturn)) run_guest(void) { for (;;) { @@ -1630,7 +1858,7 @@ static void __attribute__((noreturn)) run_guest(void) * * Are you ready? Take a deep breath and join me in the core of the Host, in * "make Host". - :*/ +:*/ static struct option opts[] = { { "verbose", 0, NULL, 'v' }, @@ -1651,8 +1879,7 @@ static void usage(void) /*L:105 The main routine is where the real work begins: */ int main(int argc, char *argv[]) { - /* Memory, top-level pagetable, code startpoint and size of the - * (optional) initrd. */ + /* Memory, code startpoint and size of the (optional) initrd. */ unsigned long mem = 0, start, initrd_size = 0; /* Two temporaries. */ int i, c; @@ -1664,24 +1891,32 @@ int main(int argc, char *argv[]) /* Save the args: we "reboot" by execing ourselves again. */ main_args = argv; - /* First we initialize the device list. We keep a pointer to the last + /* + * First we initialize the device list. We keep a pointer to the last * device, and the next interrupt number to use for devices (1: - * remember that 0 is used by the timer). */ + * remember that 0 is used by the timer). + */ devices.lastdev = NULL; devices.next_irq = 1; + /* We're CPU 0. In fact, that's the only CPU possible right now. */ cpu_id = 0; - /* We need to know how much memory so we can set up the device + + /* + * We need to know how much memory so we can set up the device * descriptor and memory pages for the devices as we parse the command * line. So we quickly look through the arguments to find the amount - * of memory now. */ + * of memory now. + */ for (i = 1; i < argc; i++) { if (argv[i][0] != '-') { mem = atoi(argv[i]) * 1024 * 1024; - /* We start by mapping anonymous pages over all of + /* + * We start by mapping anonymous pages over all of * guest-physical memory range. This fills it with 0, * and ensures that the Guest won't be killed when it - * tries to access it. */ + * tries to access it. + */ guest_base = map_zeroed_pages(mem / getpagesize() + DEVICE_PAGES); guest_limit = mem; @@ -1714,8 +1949,10 @@ int main(int argc, char *argv[]) usage(); } } - /* After the other arguments we expect memory and kernel image name, - * followed by command line arguments for the kernel. */ + /* + * After the other arguments we expect memory and kernel image name, + * followed by command line arguments for the kernel. + */ if (optind + 2 > argc) usage(); @@ -1733,20 +1970,26 @@ int main(int argc, char *argv[]) /* Map the initrd image if requested (at top of physical memory) */ if (initrd_name) { initrd_size = load_initrd(initrd_name, mem); - /* These are the location in the Linux boot header where the - * start and size of the initrd are expected to be found. */ + /* + * These are the location in the Linux boot header where the + * start and size of the initrd are expected to be found. + */ boot->hdr.ramdisk_image = mem - initrd_size; boot->hdr.ramdisk_size = initrd_size; /* The bootloader type 0xFF means "unknown"; that's OK. */ boot->hdr.type_of_loader = 0xFF; } - /* The Linux boot header contains an "E820" memory map: ours is a - * simple, single region. */ + /* + * The Linux boot header contains an "E820" memory map: ours is a + * simple, single region. + */ boot->e820_entries = 1; boot->e820_map[0] = ((struct e820entry) { 0, mem, E820_RAM }); - /* The boot header contains a command line pointer: we put the command - * line after the boot header. */ + /* + * The boot header contains a command line pointer: we put the command + * line after the boot header. + */ boot->hdr.cmd_line_ptr = to_guest_phys(boot + 1); /* We use a simple helper to copy the arguments separated by spaces. */ concat((char *)(boot + 1), argv+optind+2); @@ -1760,11 +2003,13 @@ int main(int argc, char *argv[]) /* Tell the entry path not to try to reload segment registers. */ boot->hdr.loadflags |= KEEP_SEGMENTS; - /* We tell the kernel to initialize the Guest: this returns the open - * /dev/lguest file descriptor. */ + /* + * We tell the kernel to initialize the Guest: this returns the open + * /dev/lguest file descriptor. + */ tell_kernel(start); - /* Ensure that we terminate if a child dies. */ + /* Ensure that we terminate if a device-servicing child dies. */ signal(SIGCHLD, kill_launcher); /* If we exit via err(), this kills all the threads, restores tty. */ diff --git a/Documentation/lockdep-design.txt b/Documentation/lockdep-design.txt index e20d913d591..abf768c681e 100644 --- a/Documentation/lockdep-design.txt +++ b/Documentation/lockdep-design.txt @@ -30,9 +30,9 @@ State The validator tracks lock-class usage history into 4n + 1 separate state bits: - 'ever held in STATE context' -- 'ever head as readlock in STATE context' -- 'ever head with STATE enabled' -- 'ever head as readlock with STATE enabled' +- 'ever held as readlock in STATE context' +- 'ever held with STATE enabled' +- 'ever held as readlock with STATE enabled' Where STATE can be either one of (kernel/lockdep_states.h) - hardirq diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX index 1634c6dceca..50189bf07d5 100644 --- a/Documentation/networking/00-INDEX +++ b/Documentation/networking/00-INDEX @@ -60,6 +60,8 @@ framerelay.txt - info on using Frame Relay/Data Link Connection Identifier (DLCI). generic_netlink.txt - info on Generic Netlink +ieee802154.txt + - Linux IEEE 802.15.4 implementation, API and drivers ip-sysctl.txt - /proc/sys/net/ipv4/* variables ip_dynaddr.txt diff --git a/Documentation/networking/6pack.txt b/Documentation/networking/6pack.txt index d0777a1200e..8f339428fdf 100644 --- a/Documentation/networking/6pack.txt +++ b/Documentation/networking/6pack.txt @@ -1,7 +1,7 @@ This is the 6pack-mini-HOWTO, written by Andreas Könsgen DG3KQ -Internet: ajk@iehk.rwth-aachen.de +Internet: ajk@comnets.uni-bremen.de AMPR-net: dg3kq@db0pra.ampr.org AX.25: dg3kq@db0ach.#nrw.deu.eu diff --git a/Documentation/networking/ieee802154.txt b/Documentation/networking/ieee802154.txt index a0280ad2edc..23c995e6403 100644 --- a/Documentation/networking/ieee802154.txt +++ b/Documentation/networking/ieee802154.txt @@ -22,7 +22,7 @@ int sd = socket(PF_IEEE802154, SOCK_DGRAM, 0); ..... The address family, socket addresses etc. are defined in the -include/net/ieee802154/af_ieee802154.h header or in the special header +include/net/af_ieee802154.h header or in the special header in our userspace package (see either linux-zigbee sourceforge download page or git tree at git://linux-zigbee.git.sourceforge.net/gitroot/linux-zigbee). @@ -33,7 +33,7 @@ MLME - MAC Level Management ============================ Most of IEEE 802.15.4 MLME interfaces are directly mapped on netlink commands. -See the include/net/ieee802154/nl802154.h header. Our userspace tools package +See the include/net/nl802154.h header. Our userspace tools package (see above) provides CLI configuration utility for radio interfaces and simple coordinator for IEEE 802.15.4 networks as an example users of MLME protocol. @@ -54,10 +54,14 @@ Those types of devices require different approach to be hooked into Linux kernel HardMAC ======= -See the header include/net/ieee802154/netdevice.h. You have to implement Linux +See the header include/net/ieee802154_netdev.h. You have to implement Linux net_device, with .type = ARPHRD_IEEE802154. Data is exchanged with socket family -code via plain sk_buffs. The control block of sk_buffs will contain additional -info as described in the struct ieee802154_mac_cb. +code via plain sk_buffs. On skb reception skb->cb must contain additional +info as described in the struct ieee802154_mac_cb. During packet transmission +the skb->cb is used to provide additional data to device's header_ops->create +function. Be aware, that this data can be overriden later (when socket code +submits skb to qdisc), so if you need something from that cb later, you should +store info in the skb->data on your own. To hook the MLME interface you have to populate the ml_priv field of your net_device with a pointer to struct ieee802154_mlme_ops instance. All fields are @@ -69,8 +73,8 @@ We provide an example of simple HardMAC driver at drivers/ieee802154/fakehard.c SoftMAC ======= -We are going to provide intermediate layer impelementing IEEE 802.15.4 MAC +We are going to provide intermediate layer implementing IEEE 802.15.4 MAC in software. This is currently WIP. -See header include/net/ieee802154/mac802154.h and several drivers in -drivers/ieee802154/ +See header include/net/mac802154.h and several drivers in drivers/ieee802154/. + diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 8be76235fe6..fbe427a6580 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -311,9 +311,12 @@ tcp_no_metrics_save - BOOLEAN connections. tcp_orphan_retries - INTEGER - How may times to retry before killing TCP connection, closed - by our side. Default value 7 corresponds to ~50sec-16min - depending on RTO. If you machine is loaded WEB server, + This value influences the timeout of a locally closed TCP connection, + when RTO retransmissions remain unacknowledged. + See tcp_retries2 for more details. + + The default value is 7. + If your machine is a loaded WEB server, you should think about lowering this value, such sockets may consume significant resources. Cf. tcp_max_orphans. @@ -327,16 +330,28 @@ tcp_retrans_collapse - BOOLEAN certain TCP stacks. tcp_retries1 - INTEGER - How many times to retry before deciding that something is wrong - and it is necessary to report this suspicion to network layer. - Minimal RFC value is 3, it is default, which corresponds - to ~3sec-8min depending on RTO. + This value influences the time, after which TCP decides, that + something is wrong due to unacknowledged RTO retransmissions, + and reports this suspicion to the network layer. + See tcp_retries2 for more details. + + RFC 1122 recommends at least 3 retransmissions, which is the + default. tcp_retries2 - INTEGER - How may times to retry before killing alive TCP connection. - RFC1122 says that the limit should be longer than 100 sec. - It is too small number. Default value 15 corresponds to ~13-30min - depending on RTO. + This value influences the timeout of an alive TCP connection, + when RTO retransmissions remain unacknowledged. + Given a value of N, a hypothetical TCP connection following + exponential backoff with an initial RTO of TCP_RTO_MIN would + retransmit N times before killing the connection at the (N+1)th RTO. + + The default value of 15 yields a hypothetical timeout of 924.6 + seconds and is a lower bound for the effective timeout. + TCP will effectively time out at the first RTO which exceeds the + hypothetical timeout. + + RFC 1122 recommends at least 100 seconds for the timeout, + which corresponds to a value of at least 8. tcp_rfc1337 - BOOLEAN If set, the TCP stack behaves conforming to RFC1337. If unset, @@ -1282,6 +1297,16 @@ sctp_rmem - vector of 3 INTEGERs: min, default, max sctp_wmem - vector of 3 INTEGERs: min, default, max See tcp_wmem for a description. +addr_scope_policy - INTEGER + Control IPv4 address scoping - draft-stewart-tsvwg-sctp-ipv4-00 + + 0 - Disable IPv4 address scoping + 1 - Enable IPv4 address scoping + 2 - Follow draft but allow IPv4 private addresses + 3 - Follow draft but allow IPv4 link local addresses + + Default: 1 + /proc/sys/net/core/* dev_weight - INTEGER diff --git a/Documentation/power/runtime_pm.txt b/Documentation/power/runtime_pm.txt new file mode 100644 index 00000000000..f49a33b704d --- /dev/null +++ b/Documentation/power/runtime_pm.txt @@ -0,0 +1,378 @@ +Run-time Power Management Framework for I/O Devices + +(C) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc. + +1. Introduction + +Support for run-time power management (run-time PM) of I/O devices is provided +at the power management core (PM core) level by means of: + +* The power management workqueue pm_wq in which bus types and device drivers can + put their PM-related work items. It is strongly recommended that pm_wq be + used for queuing all work items related to run-time PM, because this allows + them to be synchronized with system-wide power transitions (suspend to RAM, + hibernation and resume from system sleep states). pm_wq is declared in + include/linux/pm_runtime.h and defined in kernel/power/main.c. + +* A number of run-time PM fields in the 'power' member of 'struct device' (which + is of the type 'struct dev_pm_info', defined in include/linux/pm.h) that can + be used for synchronizing run-time PM operations with one another. + +* Three device run-time PM callbacks in 'struct dev_pm_ops' (defined in + include/linux/pm.h). + +* A set of helper functions defined in drivers/base/power/runtime.c that can be + used for carrying out run-time PM operations in such a way that the + synchronization between them is taken care of by the PM core. Bus types and + device drivers are encouraged to use these functions. + +The run-time PM callbacks present in 'struct dev_pm_ops', the device run-time PM +fields of 'struct dev_pm_info' and the core helper functions provided for +run-time PM are described below. + +2. Device Run-time PM Callbacks + +There are three device run-time PM callbacks defined in 'struct dev_pm_ops': + +struct dev_pm_ops { + ... + int (*runtime_suspend)(struct device *dev); + int (*runtime_resume)(struct device *dev); + void (*runtime_idle)(struct device *dev); + ... +}; + +The ->runtime_suspend() callback is executed by the PM core for the bus type of +the device being suspended. The bus type's callback is then _entirely_ +_responsible_ for handling the device as appropriate, which may, but need not +include executing the device driver's own ->runtime_suspend() callback (from the +PM core's point of view it is not necessary to implement a ->runtime_suspend() +callback in a device driver as long as the bus type's ->runtime_suspend() knows +what to do to handle the device). + + * Once the bus type's ->runtime_suspend() callback has completed successfully + for given device, the PM core regards the device as suspended, which need + not mean that the device has been put into a low power state. It is + supposed to mean, however, that the device will not process data and will + not communicate with the CPU(s) and RAM until its bus type's + ->runtime_resume() callback is executed for it. The run-time PM status of + a device after successful execution of its bus type's ->runtime_suspend() + callback is 'suspended'. + + * If the bus type's ->runtime_suspend() callback returns -EBUSY or -EAGAIN, + the device's run-time PM status is supposed to be 'active', which means that + the device _must_ be fully operational afterwards. + + * If the bus type's ->runtime_suspend() callback returns an error code + different from -EBUSY or -EAGAIN, the PM core regards this as a fatal + error and will refuse to run the helper functions described in Section 4 + for the device, until the status of it is directly set either to 'active' + or to 'suspended' (the PM core provides special helper functions for this + purpose). + +In particular, if the driver requires remote wakeup capability for proper +functioning and device_may_wakeup() returns 'false' for the device, then +->runtime_suspend() should return -EBUSY. On the other hand, if +device_may_wakeup() returns 'true' for the device and the device is put +into a low power state during the execution of its bus type's +->runtime_suspend(), it is expected that remote wake-up (i.e. hardware mechanism +allowing the device to request a change of its power state, such as PCI PME) +will be enabled for the device. Generally, remote wake-up should be enabled +for all input devices put into a low power state at run time. + +The ->runtime_resume() callback is executed by the PM core for the bus type of +the device being woken up. The bus type's callback is then _entirely_ +_responsible_ for handling the device as appropriate, which may, but need not +include executing the device driver's own ->runtime_resume() callback (from the +PM core's point of view it is not necessary to implement a ->runtime_resume() +callback in a device driver as long as the bus type's ->runtime_resume() knows +what to do to handle the device). + + * Once the bus type's ->runtime_resume() callback has completed successfully, + the PM core regards the device as fully operational, which means that the + device _must_ be able to complete I/O operations as needed. The run-time + PM status of the device is then 'active'. + + * If the bus type's ->runtime_resume() callback returns an error code, the PM + core regards this as a fatal error and will refuse to run the helper + functions described in Section 4 for the device, until its status is + directly set either to 'active' or to 'suspended' (the PM core provides + special helper functions for this purpose). + +The ->runtime_idle() callback is executed by the PM core for the bus type of +given device whenever the device appears to be idle, which is indicated to the +PM core by two counters, the device's usage counter and the counter of 'active' +children of the device. + + * If any of these counters is decreased using a helper function provided by + the PM core and it turns out to be equal to zero, the other counter is + checked. If that counter also is equal to zero, the PM core executes the + device bus type's ->runtime_idle() callback (with the device as an + argument). + +The action performed by a bus type's ->runtime_idle() callback is totally +dependent on the bus type in question, but the expected and recommended action +is to check if the device can be suspended (i.e. if all of the conditions +necessary for suspending the device are satisfied) and to queue up a suspend +request for the device in that case. + +The helper functions provided by the PM core, described in Section 4, guarantee +that the following constraints are met with respect to the bus type's run-time +PM callbacks: + +(1) The callbacks are mutually exclusive (e.g. it is forbidden to execute + ->runtime_suspend() in parallel with ->runtime_resume() or with another + instance of ->runtime_suspend() for the same device) with the exception that + ->runtime_suspend() or ->runtime_resume() can be executed in parallel with + ->runtime_idle() (although ->runtime_idle() will not be started while any + of the other callbacks is being executed for the same device). + +(2) ->runtime_idle() and ->runtime_suspend() can only be executed for 'active' + devices (i.e. the PM core will only execute ->runtime_idle() or + ->runtime_suspend() for the devices the run-time PM status of which is + 'active'). + +(3) ->runtime_idle() and ->runtime_suspend() can only be executed for a device + the usage counter of which is equal to zero _and_ either the counter of + 'active' children of which is equal to zero, or the 'power.ignore_children' + flag of which is set. + +(4) ->runtime_resume() can only be executed for 'suspended' devices (i.e. the + PM core will only execute ->runtime_resume() for the devices the run-time + PM status of which is 'suspended'). + +Additionally, the helper functions provided by the PM core obey the following +rules: + + * If ->runtime_suspend() is about to be executed or there's a pending request + to execute it, ->runtime_idle() will not be executed for the same device. + + * A request to execute or to schedule the execution of ->runtime_suspend() + will cancel any pending requests to execute ->runtime_idle() for the same + device. + + * If ->runtime_resume() is about to be executed or there's a pending request + to execute it, the other callbacks will not be executed for the same device. + + * A request to execute ->runtime_resume() will cancel any pending or + scheduled requests to execute the other callbacks for the same device. + +3. Run-time PM Device Fields + +The following device run-time PM fields are present in 'struct dev_pm_info', as +defined in include/linux/pm.h: + + struct timer_list suspend_timer; + - timer used for scheduling (delayed) suspend request + + unsigned long timer_expires; + - timer expiration time, in jiffies (if this is different from zero, the + timer is running and will expire at that time, otherwise the timer is not + running) + + struct work_struct work; + - work structure used for queuing up requests (i.e. work items in pm_wq) + + wait_queue_head_t wait_queue; + - wait queue used if any of the helper functions needs to wait for another + one to complete + + spinlock_t lock; + - lock used for synchronisation + + atomic_t usage_count; + - the usage counter of the device + + atomic_t child_count; + - the count of 'active' children of the device + + unsigned int ignore_children; + - if set, the value of child_count is ignored (but still updated) + + unsigned int disable_depth; + - used for disabling the helper funcions (they work normally if this is + equal to zero); the initial value of it is 1 (i.e. run-time PM is + initially disabled for all devices) + + unsigned int runtime_error; + - if set, there was a fatal error (one of the callbacks returned error code + as described in Section 2), so the helper funtions will not work until + this flag is cleared; this is the error code returned by the failing + callback + + unsigned int idle_notification; + - if set, ->runtime_idle() is being executed + + unsigned int request_pending; + - if set, there's a pending request (i.e. a work item queued up into pm_wq) + + enum rpm_request request; + - type of request that's pending (valid if request_pending is set) + + unsigned int deferred_resume; + - set if ->runtime_resume() is about to be run while ->runtime_suspend() is + being executed for that device and it is not practical to wait for the + suspend to complete; means "start a resume as soon as you've suspended" + + enum rpm_status runtime_status; + - the run-time PM status of the device; this field's initial value is + RPM_SUSPENDED, which means that each device is initially regarded by the + PM core as 'suspended', regardless of its real hardware status + +All of the above fields are members of the 'power' member of 'struct device'. + +4. Run-time PM Device Helper Functions + +The following run-time PM helper functions are defined in +drivers/base/power/runtime.c and include/linux/pm_runtime.h: + + void pm_runtime_init(struct device *dev); + - initialize the device run-time PM fields in 'struct dev_pm_info' + + void pm_runtime_remove(struct device *dev); + - make sure that the run-time PM of the device will be disabled after + removing the device from device hierarchy + + int pm_runtime_idle(struct device *dev); + - execute ->runtime_idle() for the device's bus type; returns 0 on success + or error code on failure, where -EINPROGRESS means that ->runtime_idle() + is already being executed + + int pm_runtime_suspend(struct device *dev); + - execute ->runtime_suspend() for the device's bus type; returns 0 on + success, 1 if the device's run-time PM status was already 'suspended', or + error code on failure, where -EAGAIN or -EBUSY means it is safe to attempt + to suspend the device again in future + + int pm_runtime_resume(struct device *dev); + - execute ->runtime_resume() for the device's bus type; returns 0 on + success, 1 if the device's run-time PM status was already 'active' or + error code on failure, where -EAGAIN means it may be safe to attempt to + resume the device again in future, but 'power.runtime_error' should be + checked additionally + + int pm_request_idle(struct device *dev); + - submit a request to execute ->runtime_idle() for the device's bus type + (the request is represented by a work item in pm_wq); returns 0 on success + or error code if the request has not been queued up + + int pm_schedule_suspend(struct device *dev, unsigned int delay); + - schedule the execution of ->runtime_suspend() for the device's bus type + in future, where 'delay' is the time to wait before queuing up a suspend + work item in pm_wq, in milliseconds (if 'delay' is zero, the work item is + queued up immediately); returns 0 on success, 1 if the device's PM + run-time status was already 'suspended', or error code if the request + hasn't been scheduled (or queued up if 'delay' is 0); if the execution of + ->runtime_suspend() is already scheduled and not yet expired, the new + value of 'delay' will be used as the time to wait + + int pm_request_resume(struct device *dev); + - submit a request to execute ->runtime_resume() for the device's bus type + (the request is represented by a work item in pm_wq); returns 0 on + success, 1 if the device's run-time PM status was already 'active', or + error code if the request hasn't been queued up + + void pm_runtime_get_noresume(struct device *dev); + - increment the device's usage counter + + int pm_runtime_get(struct device *dev); + - increment the device's usage counter, run pm_request_resume(dev) and + return its result + + int pm_runtime_get_sync(struct device *dev); + - increment the device's usage counter, run pm_runtime_resume(dev) and + return its result + + void pm_runtime_put_noidle(struct device *dev); + - decrement the device's usage counter + + int pm_runtime_put(struct device *dev); + - decrement the device's usage counter, run pm_request_idle(dev) and return + its result + + int pm_runtime_put_sync(struct device *dev); + - decrement the device's usage counter, run pm_runtime_idle(dev) and return + its result + + void pm_runtime_enable(struct device *dev); + - enable the run-time PM helper functions to run the device bus type's + run-time PM callbacks described in Section 2 + + int pm_runtime_disable(struct device *dev); + - prevent the run-time PM helper functions from running the device bus + type's run-time PM callbacks, make sure that all of the pending run-time + PM operations on the device are either completed or canceled; returns + 1 if there was a resume request pending and it was necessary to execute + ->runtime_resume() for the device's bus type to satisfy that request, + otherwise 0 is returned + + void pm_suspend_ignore_children(struct device *dev, bool enable); + - set/unset the power.ignore_children flag of the device + + int pm_runtime_set_active(struct device *dev); + - clear the device's 'power.runtime_error' flag, set the device's run-time + PM status to 'active' and update its parent's counter of 'active' + children as appropriate (it is only valid to use this function if + 'power.runtime_error' is set or 'power.disable_depth' is greater than + zero); it will fail and return error code if the device has a parent + which is not active and the 'power.ignore_children' flag of which is unset + + void pm_runtime_set_suspended(struct device *dev); + - clear the device's 'power.runtime_error' flag, set the device's run-time + PM status to 'suspended' and update its parent's counter of 'active' + children as appropriate (it is only valid to use this function if + 'power.runtime_error' is set or 'power.disable_depth' is greater than + zero) + +It is safe to execute the following helper functions from interrupt context: + +pm_request_idle() +pm_schedule_suspend() +pm_request_resume() +pm_runtime_get_noresume() +pm_runtime_get() +pm_runtime_put_noidle() +pm_runtime_put() +pm_suspend_ignore_children() +pm_runtime_set_active() +pm_runtime_set_suspended() +pm_runtime_enable() + +5. Run-time PM Initialization, Device Probing and Removal + +Initially, the run-time PM is disabled for all devices, which means that the +majority of the run-time PM helper funtions described in Section 4 will return +-EAGAIN until pm_runtime_enable() is called for the device. + +In addition to that, the initial run-time PM status of all devices is +'suspended', but it need not reflect the actual physical state of the device. +Thus, if the device is initially active (i.e. it is able to process I/O), its +run-time PM status must be changed to 'active', with the help of +pm_runtime_set_active(), before pm_runtime_enable() is called for the device. + +However, if the device has a parent and the parent's run-time PM is enabled, +calling pm_runtime_set_active() for the device will affect the parent, unless +the parent's 'power.ignore_children' flag is set. Namely, in that case the +parent won't be able to suspend at run time, using the PM core's helper +functions, as long as the child's status is 'active', even if the child's +run-time PM is still disabled (i.e. pm_runtime_enable() hasn't been called for +the child yet or pm_runtime_disable() has been called for it). For this reason, +once pm_runtime_set_active() has been called for the device, pm_runtime_enable() +should be called for it too as soon as reasonably possible or its run-time PM +status should be changed back to 'suspended' with the help of +pm_runtime_set_suspended(). + +If the default initial run-time PM status of the device (i.e. 'suspended') +reflects the actual state of the device, its bus type's or its driver's +->probe() callback will likely need to wake it up using one of the PM core's +helper functions described in Section 4. In that case, pm_runtime_resume() +should be used. Of course, for this purpose the device's run-time PM has to be +enabled earlier by calling pm_runtime_enable(). + +If the device bus type's or driver's ->probe() or ->remove() callback runs +pm_runtime_suspend() or pm_runtime_idle() or their asynchronous counterparts, +they will fail returning -EAGAIN, because the device's usage counter is +incremented by the core before executing ->probe() and ->remove(). Still, it +may be desirable to suspend the device as soon as ->probe() or ->remove() has +finished, so the PM core uses pm_runtime_idle_sync() to invoke the device bus +type's ->runtime_idle() callback at that time. diff --git a/Documentation/powerpc/booting-without-of.txt b/Documentation/powerpc/booting-without-of.txt index 8d999d862d0..79f533f38c6 100644 --- a/Documentation/powerpc/booting-without-of.txt +++ b/Documentation/powerpc/booting-without-of.txt @@ -1238,1122 +1238,7 @@ descriptions for the SOC devices for which new nodes have been defined; this list will expand as more and more SOC-containing platforms are moved over to use the flattened-device-tree model. - a) PHY nodes - - Required properties: - - - device_type : Should be "ethernet-phy" - - interrupts : <a b> where a is the interrupt number and b is a - field that represents an encoding of the sense and level - information for the interrupt. This should be encoded based on - the information in section 2) depending on the type of interrupt - controller you have. - - interrupt-parent : the phandle for the interrupt controller that - services interrupts for this device. - - reg : The ID number for the phy, usually a small integer - - linux,phandle : phandle for this node; likely referenced by an - ethernet controller node. - - - Example: - - ethernet-phy@0 { - linux,phandle = <2452000> - interrupt-parent = <40000>; - interrupts = <35 1>; - reg = <0>; - device_type = "ethernet-phy"; - }; - - - b) Interrupt controllers - - Some SOC devices contain interrupt controllers that are different - from the standard Open PIC specification. The SOC device nodes for - these types of controllers should be specified just like a standard - OpenPIC controller. Sense and level information should be encoded - as specified in section 2) of this chapter for each device that - specifies an interrupt. - - Example : - - pic@40000 { - linux,phandle = <40000>; - interrupt-controller; - #address-cells = <0>; - reg = <40000 40000>; - compatible = "chrp,open-pic"; - device_type = "open-pic"; - }; - - c) 4xx/Axon EMAC ethernet nodes - - The EMAC ethernet controller in IBM and AMCC 4xx chips, and also - the Axon bridge. To operate this needs to interact with a ths - special McMAL DMA controller, and sometimes an RGMII or ZMII - interface. In addition to the nodes and properties described - below, the node for the OPB bus on which the EMAC sits must have a - correct clock-frequency property. - - i) The EMAC node itself - - Required properties: - - device_type : "network" - - - compatible : compatible list, contains 2 entries, first is - "ibm,emac-CHIP" where CHIP is the host ASIC (440gx, - 405gp, Axon) and second is either "ibm,emac" or - "ibm,emac4". For Axon, thus, we have: "ibm,emac-axon", - "ibm,emac4" - - interrupts : <interrupt mapping for EMAC IRQ and WOL IRQ> - - interrupt-parent : optional, if needed for interrupt mapping - - reg : <registers mapping> - - local-mac-address : 6 bytes, MAC address - - mal-device : phandle of the associated McMAL node - - mal-tx-channel : 1 cell, index of the tx channel on McMAL associated - with this EMAC - - mal-rx-channel : 1 cell, index of the rx channel on McMAL associated - with this EMAC - - cell-index : 1 cell, hardware index of the EMAC cell on a given - ASIC (typically 0x0 and 0x1 for EMAC0 and EMAC1 on - each Axon chip) - - max-frame-size : 1 cell, maximum frame size supported in bytes - - rx-fifo-size : 1 cell, Rx fifo size in bytes for 10 and 100 Mb/sec - operations. - For Axon, 2048 - - tx-fifo-size : 1 cell, Tx fifo size in bytes for 10 and 100 Mb/sec - operations. - For Axon, 2048. - - fifo-entry-size : 1 cell, size of a fifo entry (used to calculate - thresholds). - For Axon, 0x00000010 - - mal-burst-size : 1 cell, MAL burst size (used to calculate thresholds) - in bytes. - For Axon, 0x00000100 (I think ...) - - phy-mode : string, mode of operations of the PHY interface. - Supported values are: "mii", "rmii", "smii", "rgmii", - "tbi", "gmii", rtbi", "sgmii". - For Axon on CAB, it is "rgmii" - - mdio-device : 1 cell, required iff using shared MDIO registers - (440EP). phandle of the EMAC to use to drive the - MDIO lines for the PHY used by this EMAC. - - zmii-device : 1 cell, required iff connected to a ZMII. phandle of - the ZMII device node - - zmii-channel : 1 cell, required iff connected to a ZMII. Which ZMII - channel or 0xffffffff if ZMII is only used for MDIO. - - rgmii-device : 1 cell, required iff connected to an RGMII. phandle - of the RGMII device node. - For Axon: phandle of plb5/plb4/opb/rgmii - - rgmii-channel : 1 cell, required iff connected to an RGMII. Which - RGMII channel is used by this EMAC. - Fox Axon: present, whatever value is appropriate for each - EMAC, that is the content of the current (bogus) "phy-port" - property. - - Optional properties: - - phy-address : 1 cell, optional, MDIO address of the PHY. If absent, - a search is performed. - - phy-map : 1 cell, optional, bitmap of addresses to probe the PHY - for, used if phy-address is absent. bit 0x00000001 is - MDIO address 0. - For Axon it can be absent, though my current driver - doesn't handle phy-address yet so for now, keep - 0x00ffffff in it. - - rx-fifo-size-gige : 1 cell, Rx fifo size in bytes for 1000 Mb/sec - operations (if absent the value is the same as - rx-fifo-size). For Axon, either absent or 2048. - - tx-fifo-size-gige : 1 cell, Tx fifo size in bytes for 1000 Mb/sec - operations (if absent the value is the same as - tx-fifo-size). For Axon, either absent or 2048. - - tah-device : 1 cell, optional. If connected to a TAH engine for - offload, phandle of the TAH device node. - - tah-channel : 1 cell, optional. If appropriate, channel used on the - TAH engine. - - Example: - - EMAC0: ethernet@40000800 { - device_type = "network"; - compatible = "ibm,emac-440gp", "ibm,emac"; - interrupt-parent = <&UIC1>; - interrupts = <1c 4 1d 4>; - reg = <40000800 70>; - local-mac-address = [00 04 AC E3 1B 1E]; - mal-device = <&MAL0>; - mal-tx-channel = <0 1>; - mal-rx-channel = <0>; - cell-index = <0>; - max-frame-size = <5dc>; - rx-fifo-size = <1000>; - tx-fifo-size = <800>; - phy-mode = "rmii"; - phy-map = <00000001>; - zmii-device = <&ZMII0>; - zmii-channel = <0>; - }; - - ii) McMAL node - - Required properties: - - device_type : "dma-controller" - - compatible : compatible list, containing 2 entries, first is - "ibm,mcmal-CHIP" where CHIP is the host ASIC (like - emac) and the second is either "ibm,mcmal" or - "ibm,mcmal2". - For Axon, "ibm,mcmal-axon","ibm,mcmal2" - - interrupts : <interrupt mapping for the MAL interrupts sources: - 5 sources: tx_eob, rx_eob, serr, txde, rxde>. - For Axon: This is _different_ from the current - firmware. We use the "delayed" interrupts for txeob - and rxeob. Thus we end up with mapping those 5 MPIC - interrupts, all level positive sensitive: 10, 11, 32, - 33, 34 (in decimal) - - dcr-reg : < DCR registers range > - - dcr-parent : if needed for dcr-reg - - num-tx-chans : 1 cell, number of Tx channels - - num-rx-chans : 1 cell, number of Rx channels - - iii) ZMII node - - Required properties: - - compatible : compatible list, containing 2 entries, first is - "ibm,zmii-CHIP" where CHIP is the host ASIC (like - EMAC) and the second is "ibm,zmii". - For Axon, there is no ZMII node. - - reg : <registers mapping> - - iv) RGMII node - - Required properties: - - compatible : compatible list, containing 2 entries, first is - "ibm,rgmii-CHIP" where CHIP is the host ASIC (like - EMAC) and the second is "ibm,rgmii". - For Axon, "ibm,rgmii-axon","ibm,rgmii" - - reg : <registers mapping> - - revision : as provided by the RGMII new version register if - available. - For Axon: 0x0000012a - - d) Xilinx IP cores - - The Xilinx EDK toolchain ships with a set of IP cores (devices) for use - in Xilinx Spartan and Virtex FPGAs. The devices cover the whole range - of standard device types (network, serial, etc.) and miscellaneous - devices (gpio, LCD, spi, etc). Also, since these devices are - implemented within the fpga fabric every instance of the device can be - synthesised with different options that change the behaviour. - - Each IP-core has a set of parameters which the FPGA designer can use to - control how the core is synthesized. Historically, the EDK tool would - extract the device parameters relevant to device drivers and copy them - into an 'xparameters.h' in the form of #define symbols. This tells the - device drivers how the IP cores are configured, but it requres the kernel - to be recompiled every time the FPGA bitstream is resynthesized. - - The new approach is to export the parameters into the device tree and - generate a new device tree each time the FPGA bitstream changes. The - parameters which used to be exported as #defines will now become - properties of the device node. In general, device nodes for IP-cores - will take the following form: - - (name): (generic-name)@(base-address) { - compatible = "xlnx,(ip-core-name)-(HW_VER)" - [, (list of compatible devices), ...]; - reg = <(baseaddr) (size)>; - interrupt-parent = <&interrupt-controller-phandle>; - interrupts = < ... >; - xlnx,(parameter1) = "(string-value)"; - xlnx,(parameter2) = <(int-value)>; - }; - - (generic-name): an open firmware-style name that describes the - generic class of device. Preferably, this is one word, such - as 'serial' or 'ethernet'. - (ip-core-name): the name of the ip block (given after the BEGIN - directive in system.mhs). Should be in lowercase - and all underscores '_' converted to dashes '-'. - (name): is derived from the "PARAMETER INSTANCE" value. - (parameter#): C_* parameters from system.mhs. The C_ prefix is - dropped from the parameter name, the name is converted - to lowercase and all underscore '_' characters are - converted to dashes '-'. - (baseaddr): the baseaddr parameter value (often named C_BASEADDR). - (HW_VER): from the HW_VER parameter. - (size): the address range size (often C_HIGHADDR - C_BASEADDR + 1). - - Typically, the compatible list will include the exact IP core version - followed by an older IP core version which implements the same - interface or any other device with the same interface. - - 'reg', 'interrupt-parent' and 'interrupts' are all optional properties. - - For example, the following block from system.mhs: - - BEGIN opb_uartlite - PARAMETER INSTANCE = opb_uartlite_0 - PARAMETER HW_VER = 1.00.b - PARAMETER C_BAUDRATE = 115200 - PARAMETER C_DATA_BITS = 8 - PARAMETER C_ODD_PARITY = 0 - PARAMETER C_USE_PARITY = 0 - PARAMETER C_CLK_FREQ = 50000000 - PARAMETER C_BASEADDR = 0xEC100000 - PARAMETER C_HIGHADDR = 0xEC10FFFF - BUS_INTERFACE SOPB = opb_7 - PORT OPB_Clk = CLK_50MHz - PORT Interrupt = opb_uartlite_0_Interrupt - PORT RX = opb_uartlite_0_RX - PORT TX = opb_uartlite_0_TX - PORT OPB_Rst = sys_bus_reset_0 - END - - becomes the following device tree node: - - opb_uartlite_0: serial@ec100000 { - device_type = "serial"; - compatible = "xlnx,opb-uartlite-1.00.b"; - reg = <ec100000 10000>; - interrupt-parent = <&opb_intc_0>; - interrupts = <1 0>; // got this from the opb_intc parameters - current-speed = <d#115200>; // standard serial device prop - clock-frequency = <d#50000000>; // standard serial device prop - xlnx,data-bits = <8>; - xlnx,odd-parity = <0>; - xlnx,use-parity = <0>; - }; - - Some IP cores actually implement 2 or more logical devices. In - this case, the device should still describe the whole IP core with - a single node and add a child node for each logical device. The - ranges property can be used to translate from parent IP-core to the - registers of each device. In addition, the parent node should be - compatible with the bus type 'xlnx,compound', and should contain - #address-cells and #size-cells, as with any other bus. (Note: this - makes the assumption that both logical devices have the same bus - binding. If this is not true, then separate nodes should be used - for each logical device). The 'cell-index' property can be used to - enumerate logical devices within an IP core. For example, the - following is the system.mhs entry for the dual ps2 controller found - on the ml403 reference design. - - BEGIN opb_ps2_dual_ref - PARAMETER INSTANCE = opb_ps2_dual_ref_0 - PARAMETER HW_VER = 1.00.a - PARAMETER C_BASEADDR = 0xA9000000 - PARAMETER C_HIGHADDR = 0xA9001FFF - BUS_INTERFACE SOPB = opb_v20_0 - PORT Sys_Intr1 = ps2_1_intr - PORT Sys_Intr2 = ps2_2_intr - PORT Clkin1 = ps2_clk_rx_1 - PORT Clkin2 = ps2_clk_rx_2 - PORT Clkpd1 = ps2_clk_tx_1 - PORT Clkpd2 = ps2_clk_tx_2 - PORT Rx1 = ps2_d_rx_1 - PORT Rx2 = ps2_d_rx_2 - PORT Txpd1 = ps2_d_tx_1 - PORT Txpd2 = ps2_d_tx_2 - END - - It would result in the following device tree nodes: - - opb_ps2_dual_ref_0: opb-ps2-dual-ref@a9000000 { - #address-cells = <1>; - #size-cells = <1>; - compatible = "xlnx,compound"; - ranges = <0 a9000000 2000>; - // If this device had extra parameters, then they would - // go here. - ps2@0 { - compatible = "xlnx,opb-ps2-dual-ref-1.00.a"; - reg = <0 40>; - interrupt-parent = <&opb_intc_0>; - interrupts = <3 0>; - cell-index = <0>; - }; - ps2@1000 { - compatible = "xlnx,opb-ps2-dual-ref-1.00.a"; - reg = <1000 40>; - interrupt-parent = <&opb_intc_0>; - interrupts = <3 0>; - cell-index = <0>; - }; - }; - - Also, the system.mhs file defines bus attachments from the processor - to the devices. The device tree structure should reflect the bus - attachments. Again an example; this system.mhs fragment: - - BEGIN ppc405_virtex4 - PARAMETER INSTANCE = ppc405_0 - PARAMETER HW_VER = 1.01.a - BUS_INTERFACE DPLB = plb_v34_0 - BUS_INTERFACE IPLB = plb_v34_0 - END - - BEGIN opb_intc - PARAMETER INSTANCE = opb_intc_0 - PARAMETER HW_VER = 1.00.c - PARAMETER C_BASEADDR = 0xD1000FC0 - PARAMETER C_HIGHADDR = 0xD1000FDF - BUS_INTERFACE SOPB = opb_v20_0 - END - - BEGIN opb_uart16550 - PARAMETER INSTANCE = opb_uart16550_0 - PARAMETER HW_VER = 1.00.d - PARAMETER C_BASEADDR = 0xa0000000 - PARAMETER C_HIGHADDR = 0xa0001FFF - BUS_INTERFACE SOPB = opb_v20_0 - END - - BEGIN plb_v34 - PARAMETER INSTANCE = plb_v34_0 - PARAMETER HW_VER = 1.02.a - END - - BEGIN plb_bram_if_cntlr - PARAMETER INSTANCE = plb_bram_if_cntlr_0 - PARAMETER HW_VER = 1.00.b - PARAMETER C_BASEADDR = 0xFFFF0000 - PARAMETER C_HIGHADDR = 0xFFFFFFFF - BUS_INTERFACE SPLB = plb_v34_0 - END - - BEGIN plb2opb_bridge - PARAMETER INSTANCE = plb2opb_bridge_0 - PARAMETER HW_VER = 1.01.a - PARAMETER C_RNG0_BASEADDR = 0x20000000 - PARAMETER C_RNG0_HIGHADDR = 0x3FFFFFFF - PARAMETER C_RNG1_BASEADDR = 0x60000000 - PARAMETER C_RNG1_HIGHADDR = 0x7FFFFFFF - PARAMETER C_RNG2_BASEADDR = 0x80000000 - PARAMETER C_RNG2_HIGHADDR = 0xBFFFFFFF - PARAMETER C_RNG3_BASEADDR = 0xC0000000 - PARAMETER C_RNG3_HIGHADDR = 0xDFFFFFFF - BUS_INTERFACE SPLB = plb_v34_0 - BUS_INTERFACE MOPB = opb_v20_0 - END - - Gives this device tree (some properties removed for clarity): - - plb@0 { - #address-cells = <1>; - #size-cells = <1>; - compatible = "xlnx,plb-v34-1.02.a"; - device_type = "ibm,plb"; - ranges; // 1:1 translation - - plb_bram_if_cntrl_0: bram@ffff0000 { - reg = <ffff0000 10000>; - } - - opb@20000000 { - #address-cells = <1>; - #size-cells = <1>; - ranges = <20000000 20000000 20000000 - 60000000 60000000 20000000 - 80000000 80000000 40000000 - c0000000 c0000000 20000000>; - - opb_uart16550_0: serial@a0000000 { - reg = <a00000000 2000>; - }; - - opb_intc_0: interrupt-controller@d1000fc0 { - reg = <d1000fc0 20>; - }; - }; - }; - - That covers the general approach to binding xilinx IP cores into the - device tree. The following are bindings for specific devices: - - i) Xilinx ML300 Framebuffer - - Simple framebuffer device from the ML300 reference design (also on the - ML403 reference design as well as others). - - Optional properties: - - resolution = <xres yres> : pixel resolution of framebuffer. Some - implementations use a different resolution. - Default is <d#640 d#480> - - virt-resolution = <xvirt yvirt> : Size of framebuffer in memory. - Default is <d#1024 d#480>. - - rotate-display (empty) : rotate display 180 degrees. - - ii) Xilinx SystemACE - - The Xilinx SystemACE device is used to program FPGAs from an FPGA - bitstream stored on a CF card. It can also be used as a generic CF - interface device. - - Optional properties: - - 8-bit (empty) : Set this property for SystemACE in 8 bit mode - - iii) Xilinx EMAC and Xilinx TEMAC - - Xilinx Ethernet devices. In addition to general xilinx properties - listed above, nodes for these devices should include a phy-handle - property, and may include other common network device properties - like local-mac-address. - - iv) Xilinx Uartlite - - Xilinx uartlite devices are simple fixed speed serial ports. - - Required properties: - - current-speed : Baud rate of uartlite - - v) Xilinx hwicap - - Xilinx hwicap devices provide access to the configuration logic - of the FPGA through the Internal Configuration Access Port - (ICAP). The ICAP enables partial reconfiguration of the FPGA, - readback of the configuration information, and some control over - 'warm boots' of the FPGA fabric. - - Required properties: - - xlnx,family : The family of the FPGA, necessary since the - capabilities of the underlying ICAP hardware - differ between different families. May be - 'virtex2p', 'virtex4', or 'virtex5'. - - vi) Xilinx Uart 16550 - - Xilinx UART 16550 devices are very similar to the NS16550 but with - different register spacing and an offset from the base address. - - Required properties: - - clock-frequency : Frequency of the clock input - - reg-offset : A value of 3 is required - - reg-shift : A value of 2 is required - - e) USB EHCI controllers - - Required properties: - - compatible : should be "usb-ehci". - - reg : should contain at least address and length of the standard EHCI - register set for the device. Optional platform-dependent registers - (debug-port or other) can be also specified here, but only after - definition of standard EHCI registers. - - interrupts : one EHCI interrupt should be described here. - If device registers are implemented in big endian mode, the device - node should have "big-endian-regs" property. - If controller implementation operates with big endian descriptors, - "big-endian-desc" property should be specified. - If both big endian registers and descriptors are used by the controller - implementation, "big-endian" property can be specified instead of having - both "big-endian-regs" and "big-endian-desc". - - Example (Sequoia 440EPx): - ehci@e0000300 { - compatible = "ibm,usb-ehci-440epx", "usb-ehci"; - interrupt-parent = <&UIC0>; - interrupts = <1a 4>; - reg = <0 e0000300 90 0 e0000390 70>; - big-endian; - }; - - f) MDIO on GPIOs - - Currently defined compatibles: - - virtual,gpio-mdio - - MDC and MDIO lines connected to GPIO controllers are listed in the - gpios property as described in section VIII.1 in the following order: - - MDC, MDIO. - - Example: - - mdio { - compatible = "virtual,mdio-gpio"; - #address-cells = <1>; - #size-cells = <0>; - gpios = <&qe_pio_a 11 - &qe_pio_c 6>; - }; - - g) SPI (Serial Peripheral Interface) busses - - SPI busses can be described with a node for the SPI master device - and a set of child nodes for each SPI slave on the bus. For this - discussion, it is assumed that the system's SPI controller is in - SPI master mode. This binding does not describe SPI controllers - in slave mode. - - The SPI master node requires the following properties: - - #address-cells - number of cells required to define a chip select - address on the SPI bus. - - #size-cells - should be zero. - - compatible - name of SPI bus controller following generic names - recommended practice. - No other properties are required in the SPI bus node. It is assumed - that a driver for an SPI bus device will understand that it is an SPI bus. - However, the binding does not attempt to define the specific method for - assigning chip select numbers. Since SPI chip select configuration is - flexible and non-standardized, it is left out of this binding with the - assumption that board specific platform code will be used to manage - chip selects. Individual drivers can define additional properties to - support describing the chip select layout. - - SPI slave nodes must be children of the SPI master node and can - contain the following properties. - - reg - (required) chip select address of device. - - compatible - (required) name of SPI device following generic names - recommended practice - - spi-max-frequency - (required) Maximum SPI clocking speed of device in Hz - - spi-cpol - (optional) Empty property indicating device requires - inverse clock polarity (CPOL) mode - - spi-cpha - (optional) Empty property indicating device requires - shifted clock phase (CPHA) mode - - spi-cs-high - (optional) Empty property indicating device requires - chip select active high - - SPI example for an MPC5200 SPI bus: - spi@f00 { - #address-cells = <1>; - #size-cells = <0>; - compatible = "fsl,mpc5200b-spi","fsl,mpc5200-spi"; - reg = <0xf00 0x20>; - interrupts = <2 13 0 2 14 0>; - interrupt-parent = <&mpc5200_pic>; - - ethernet-switch@0 { - compatible = "micrel,ks8995m"; - spi-max-frequency = <1000000>; - reg = <0>; - }; - - codec@1 { - compatible = "ti,tlv320aic26"; - spi-max-frequency = <100000>; - reg = <1>; - }; - }; - -VII - Marvell Discovery mv64[345]6x System Controller chips -=========================================================== - -The Marvell mv64[345]60 series of system controller chips contain -many of the peripherals needed to implement a complete computer -system. In this section, we define device tree nodes to describe -the system controller chip itself and each of the peripherals -which it contains. Compatible string values for each node are -prefixed with the string "marvell,", for Marvell Technology Group Ltd. - -1) The /system-controller node - - This node is used to represent the system-controller and must be - present when the system uses a system controller chip. The top-level - system-controller node contains information that is global to all - devices within the system controller chip. The node name begins - with "system-controller" followed by the unit address, which is - the base address of the memory-mapped register set for the system - controller chip. - - Required properties: - - - ranges : Describes the translation of system controller addresses - for memory mapped registers. - - clock-frequency: Contains the main clock frequency for the system - controller chip. - - reg : This property defines the address and size of the - memory-mapped registers contained within the system controller - chip. The address specified in the "reg" property should match - the unit address of the system-controller node. - - #address-cells : Address representation for system controller - devices. This field represents the number of cells needed to - represent the address of the memory-mapped registers of devices - within the system controller chip. - - #size-cells : Size representation for for the memory-mapped - registers within the system controller chip. - - #interrupt-cells : Defines the width of cells used to represent - interrupts. - - Optional properties: - - - model : The specific model of the system controller chip. Such - as, "mv64360", "mv64460", or "mv64560". - - compatible : A string identifying the compatibility identifiers - of the system controller chip. - - The system-controller node contains child nodes for each system - controller device that the platform uses. Nodes should not be created - for devices which exist on the system controller chip but are not used - - Example Marvell Discovery mv64360 system-controller node: - - system-controller@f1000000 { /* Marvell Discovery mv64360 */ - #address-cells = <1>; - #size-cells = <1>; - model = "mv64360"; /* Default */ - compatible = "marvell,mv64360"; - clock-frequency = <133333333>; - reg = <0xf1000000 0x10000>; - virtual-reg = <0xf1000000>; - ranges = <0x88000000 0x88000000 0x1000000 /* PCI 0 I/O Space */ - 0x80000000 0x80000000 0x8000000 /* PCI 0 MEM Space */ - 0xa0000000 0xa0000000 0x4000000 /* User FLASH */ - 0x00000000 0xf1000000 0x0010000 /* Bridge's regs */ - 0xf2000000 0xf2000000 0x0040000>;/* Integrated SRAM */ - - [ child node definitions... ] - } - -2) Child nodes of /system-controller - - a) Marvell Discovery MDIO bus - - The MDIO is a bus to which the PHY devices are connected. For each - device that exists on this bus, a child node should be created. See - the definition of the PHY node below for an example of how to define - a PHY. - - Required properties: - - #address-cells : Should be <1> - - #size-cells : Should be <0> - - device_type : Should be "mdio" - - compatible : Should be "marvell,mv64360-mdio" - - Example: - - mdio { - #address-cells = <1>; - #size-cells = <0>; - device_type = "mdio"; - compatible = "marvell,mv64360-mdio"; - - ethernet-phy@0 { - ...... - }; - }; - - - b) Marvell Discovery ethernet controller - - The Discover ethernet controller is described with two levels - of nodes. The first level describes an ethernet silicon block - and the second level describes up to 3 ethernet nodes within - that block. The reason for the multiple levels is that the - registers for the node are interleaved within a single set - of registers. The "ethernet-block" level describes the - shared register set, and the "ethernet" nodes describe ethernet - port-specific properties. - - Ethernet block node - - Required properties: - - #address-cells : <1> - - #size-cells : <0> - - compatible : "marvell,mv64360-eth-block" - - reg : Offset and length of the register set for this block - - Example Discovery Ethernet block node: - ethernet-block@2000 { - #address-cells = <1>; - #size-cells = <0>; - compatible = "marvell,mv64360-eth-block"; - reg = <0x2000 0x2000>; - ethernet@0 { - ....... - }; - }; - - Ethernet port node - - Required properties: - - device_type : Should be "network". - - compatible : Should be "marvell,mv64360-eth". - - reg : Should be <0>, <1>, or <2>, according to which registers - within the silicon block the device uses. - - interrupts : <a> where a is the interrupt number for the port. - - interrupt-parent : the phandle for the interrupt controller - that services interrupts for this device. - - phy : the phandle for the PHY connected to this ethernet - controller. - - local-mac-address : 6 bytes, MAC address - - Example Discovery Ethernet port node: - ethernet@0 { - device_type = "network"; - compatible = "marvell,mv64360-eth"; - reg = <0>; - interrupts = <32>; - interrupt-parent = <&PIC>; - phy = <&PHY0>; - local-mac-address = [ 00 00 00 00 00 00 ]; - }; - - - - c) Marvell Discovery PHY nodes - - Required properties: - - device_type : Should be "ethernet-phy" - - interrupts : <a> where a is the interrupt number for this phy. - - interrupt-parent : the phandle for the interrupt controller that - services interrupts for this device. - - reg : The ID number for the phy, usually a small integer - - Example Discovery PHY node: - ethernet-phy@1 { - device_type = "ethernet-phy"; - compatible = "broadcom,bcm5421"; - interrupts = <76>; /* GPP 12 */ - interrupt-parent = <&PIC>; - reg = <1>; - }; - - - d) Marvell Discovery SDMA nodes - - Represent DMA hardware associated with the MPSC (multiprotocol - serial controllers). - - Required properties: - - compatible : "marvell,mv64360-sdma" - - reg : Offset and length of the register set for this device - - interrupts : <a> where a is the interrupt number for the DMA - device. - - interrupt-parent : the phandle for the interrupt controller - that services interrupts for this device. - - Example Discovery SDMA node: - sdma@4000 { - compatible = "marvell,mv64360-sdma"; - reg = <0x4000 0xc18>; - virtual-reg = <0xf1004000>; - interrupts = <36>; - interrupt-parent = <&PIC>; - }; - - - e) Marvell Discovery BRG nodes - - Represent baud rate generator hardware associated with the MPSC - (multiprotocol serial controllers). - - Required properties: - - compatible : "marvell,mv64360-brg" - - reg : Offset and length of the register set for this device - - clock-src : A value from 0 to 15 which selects the clock - source for the baud rate generator. This value corresponds - to the CLKS value in the BRGx configuration register. See - the mv64x60 User's Manual. - - clock-frequence : The frequency (in Hz) of the baud rate - generator's input clock. - - current-speed : The current speed setting (presumably by - firmware) of the baud rate generator. - - Example Discovery BRG node: - brg@b200 { - compatible = "marvell,mv64360-brg"; - reg = <0xb200 0x8>; - clock-src = <8>; - clock-frequency = <133333333>; - current-speed = <9600>; - }; - - - f) Marvell Discovery CUNIT nodes - - Represent the Serial Communications Unit device hardware. - - Required properties: - - reg : Offset and length of the register set for this device - - Example Discovery CUNIT node: - cunit@f200 { - reg = <0xf200 0x200>; - }; - - - g) Marvell Discovery MPSCROUTING nodes - - Represent the Discovery's MPSC routing hardware - - Required properties: - - reg : Offset and length of the register set for this device - - Example Discovery CUNIT node: - mpscrouting@b500 { - reg = <0xb400 0xc>; - }; - - - h) Marvell Discovery MPSCINTR nodes - - Represent the Discovery's MPSC DMA interrupt hardware registers - (SDMA cause and mask registers). - - Required properties: - - reg : Offset and length of the register set for this device - - Example Discovery MPSCINTR node: - mpsintr@b800 { - reg = <0xb800 0x100>; - }; - - - i) Marvell Discovery MPSC nodes - - Represent the Discovery's MPSC (Multiprotocol Serial Controller) - serial port. - - Required properties: - - device_type : "serial" - - compatible : "marvell,mv64360-mpsc" - - reg : Offset and length of the register set for this device - - sdma : the phandle for the SDMA node used by this port - - brg : the phandle for the BRG node used by this port - - cunit : the phandle for the CUNIT node used by this port - - mpscrouting : the phandle for the MPSCROUTING node used by this port - - mpscintr : the phandle for the MPSCINTR node used by this port - - cell-index : the hardware index of this cell in the MPSC core - - max_idle : value needed for MPSC CHR3 (Maximum Frame Length) - register - - interrupts : <a> where a is the interrupt number for the MPSC. - - interrupt-parent : the phandle for the interrupt controller - that services interrupts for this device. - - Example Discovery MPSCINTR node: - mpsc@8000 { - device_type = "serial"; - compatible = "marvell,mv64360-mpsc"; - reg = <0x8000 0x38>; - virtual-reg = <0xf1008000>; - sdma = <&SDMA0>; - brg = <&BRG0>; - cunit = <&CUNIT>; - mpscrouting = <&MPSCROUTING>; - mpscintr = <&MPSCINTR>; - cell-index = <0>; - max_idle = <40>; - interrupts = <40>; - interrupt-parent = <&PIC>; - }; - - - j) Marvell Discovery Watch Dog Timer nodes - - Represent the Discovery's watchdog timer hardware - - Required properties: - - compatible : "marvell,mv64360-wdt" - - reg : Offset and length of the register set for this device - - Example Discovery Watch Dog Timer node: - wdt@b410 { - compatible = "marvell,mv64360-wdt"; - reg = <0xb410 0x8>; - }; - - - k) Marvell Discovery I2C nodes - - Represent the Discovery's I2C hardware - - Required properties: - - device_type : "i2c" - - compatible : "marvell,mv64360-i2c" - - reg : Offset and length of the register set for this device - - interrupts : <a> where a is the interrupt number for the I2C. - - interrupt-parent : the phandle for the interrupt controller - that services interrupts for this device. - - Example Discovery I2C node: - compatible = "marvell,mv64360-i2c"; - reg = <0xc000 0x20>; - virtual-reg = <0xf100c000>; - interrupts = <37>; - interrupt-parent = <&PIC>; - }; - - - l) Marvell Discovery PIC (Programmable Interrupt Controller) nodes - - Represent the Discovery's PIC hardware - - Required properties: - - #interrupt-cells : <1> - - #address-cells : <0> - - compatible : "marvell,mv64360-pic" - - reg : Offset and length of the register set for this device - - interrupt-controller - - Example Discovery PIC node: - pic { - #interrupt-cells = <1>; - #address-cells = <0>; - compatible = "marvell,mv64360-pic"; - reg = <0x0 0x88>; - interrupt-controller; - }; - - - m) Marvell Discovery MPP (Multipurpose Pins) multiplexing nodes - - Represent the Discovery's MPP hardware - - Required properties: - - compatible : "marvell,mv64360-mpp" - - reg : Offset and length of the register set for this device - - Example Discovery MPP node: - mpp@f000 { - compatible = "marvell,mv64360-mpp"; - reg = <0xf000 0x10>; - }; - - - n) Marvell Discovery GPP (General Purpose Pins) nodes - - Represent the Discovery's GPP hardware - - Required properties: - - compatible : "marvell,mv64360-gpp" - - reg : Offset and length of the register set for this device - - Example Discovery GPP node: - gpp@f000 { - compatible = "marvell,mv64360-gpp"; - reg = <0xf100 0x20>; - }; - - - o) Marvell Discovery PCI host bridge node - - Represents the Discovery's PCI host bridge device. The properties - for this node conform to Rev 2.1 of the PCI Bus Binding to IEEE - 1275-1994. A typical value for the compatible property is - "marvell,mv64360-pci". - - Example Discovery PCI host bridge node - pci@80000000 { - #address-cells = <3>; - #size-cells = <2>; - #interrupt-cells = <1>; - device_type = "pci"; - compatible = "marvell,mv64360-pci"; - reg = <0xcf8 0x8>; - ranges = <0x01000000 0x0 0x0 - 0x88000000 0x0 0x01000000 - 0x02000000 0x0 0x80000000 - 0x80000000 0x0 0x08000000>; - bus-range = <0 255>; - clock-frequency = <66000000>; - interrupt-parent = <&PIC>; - interrupt-map-mask = <0xf800 0x0 0x0 0x7>; - interrupt-map = < - /* IDSEL 0x0a */ - 0x5000 0 0 1 &PIC 80 - 0x5000 0 0 2 &PIC 81 - 0x5000 0 0 3 &PIC 91 - 0x5000 0 0 4 &PIC 93 - - /* IDSEL 0x0b */ - 0x5800 0 0 1 &PIC 91 - 0x5800 0 0 2 &PIC 93 - 0x5800 0 0 3 &PIC 80 - 0x5800 0 0 4 &PIC 81 - - /* IDSEL 0x0c */ - 0x6000 0 0 1 &PIC 91 - 0x6000 0 0 2 &PIC 93 - 0x6000 0 0 3 &PIC 80 - 0x6000 0 0 4 &PIC 81 - - /* IDSEL 0x0d */ - 0x6800 0 0 1 &PIC 93 - 0x6800 0 0 2 &PIC 80 - 0x6800 0 0 3 &PIC 81 - 0x6800 0 0 4 &PIC 91 - >; - }; - - - p) Marvell Discovery CPU Error nodes - - Represent the Discovery's CPU error handler device. - - Required properties: - - compatible : "marvell,mv64360-cpu-error" - - reg : Offset and length of the register set for this device - - interrupts : the interrupt number for this device - - interrupt-parent : the phandle for the interrupt controller - that services interrupts for this device. - - Example Discovery CPU Error node: - cpu-error@0070 { - compatible = "marvell,mv64360-cpu-error"; - reg = <0x70 0x10 0x128 0x28>; - interrupts = <3>; - interrupt-parent = <&PIC>; - }; - - - q) Marvell Discovery SRAM Controller nodes - - Represent the Discovery's SRAM controller device. - - Required properties: - - compatible : "marvell,mv64360-sram-ctrl" - - reg : Offset and length of the register set for this device - - interrupts : the interrupt number for this device - - interrupt-parent : the phandle for the interrupt controller - that services interrupts for this device. - - Example Discovery SRAM Controller node: - sram-ctrl@0380 { - compatible = "marvell,mv64360-sram-ctrl"; - reg = <0x380 0x80>; - interrupts = <13>; - interrupt-parent = <&PIC>; - }; - - - r) Marvell Discovery PCI Error Handler nodes - - Represent the Discovery's PCI error handler device. - - Required properties: - - compatible : "marvell,mv64360-pci-error" - - reg : Offset and length of the register set for this device - - interrupts : the interrupt number for this device - - interrupt-parent : the phandle for the interrupt controller - that services interrupts for this device. - - Example Discovery PCI Error Handler node: - pci-error@1d40 { - compatible = "marvell,mv64360-pci-error"; - reg = <0x1d40 0x40 0xc28 0x4>; - interrupts = <12>; - interrupt-parent = <&PIC>; - }; - - - s) Marvell Discovery Memory Controller nodes - - Represent the Discovery's memory controller device. - - Required properties: - - compatible : "marvell,mv64360-mem-ctrl" - - reg : Offset and length of the register set for this device - - interrupts : the interrupt number for this device - - interrupt-parent : the phandle for the interrupt controller - that services interrupts for this device. - - Example Discovery Memory Controller node: - mem-ctrl@1400 { - compatible = "marvell,mv64360-mem-ctrl"; - reg = <0x1400 0x60>; - interrupts = <17>; - interrupt-parent = <&PIC>; - }; - - -VIII - Specifying interrupt information for devices +VII - Specifying interrupt information for devices =================================================== The device tree represents the busses and devices of a hardware @@ -2439,56 +1324,7 @@ encodings listed below: 2 = high to low edge sensitive type enabled 3 = low to high edge sensitive type enabled -IX - Specifying GPIO information for devices -============================================ - -1) gpios property ------------------ - -Nodes that makes use of GPIOs should define them using `gpios' property, -format of which is: <&gpio-controller1-phandle gpio1-specifier - &gpio-controller2-phandle gpio2-specifier - 0 /* holes are permitted, means no GPIO 3 */ - &gpio-controller4-phandle gpio4-specifier - ...>; - -Note that gpio-specifier length is controller dependent. - -gpio-specifier may encode: bank, pin position inside the bank, -whether pin is open-drain and whether pin is logically inverted. - -Example of the node using GPIOs: - - node { - gpios = <&qe_pio_e 18 0>; - }; - -In this example gpio-specifier is "18 0" and encodes GPIO pin number, -and empty GPIO flags as accepted by the "qe_pio_e" gpio-controller. - -2) gpio-controller nodes ------------------------- - -Every GPIO controller node must have #gpio-cells property defined, -this information will be used to translate gpio-specifiers. - -Example of two SOC GPIO banks defined as gpio-controller nodes: - - qe_pio_a: gpio-controller@1400 { - #gpio-cells = <2>; - compatible = "fsl,qe-pario-bank-a", "fsl,qe-pario-bank"; - reg = <0x1400 0x18>; - gpio-controller; - }; - - qe_pio_e: gpio-controller@1460 { - #gpio-cells = <2>; - compatible = "fsl,qe-pario-bank-e", "fsl,qe-pario-bank"; - reg = <0x1460 0x18>; - gpio-controller; - }; - -X - Specifying Device Power Management Information (sleep property) +VIII - Specifying Device Power Management Information (sleep property) =================================================================== Devices on SOCs often have mechanisms for placing devices into low-power diff --git a/Documentation/powerpc/dts-bindings/4xx/emac.txt b/Documentation/powerpc/dts-bindings/4xx/emac.txt new file mode 100644 index 00000000000..2161334a7ca --- /dev/null +++ b/Documentation/powerpc/dts-bindings/4xx/emac.txt @@ -0,0 +1,148 @@ + 4xx/Axon EMAC ethernet nodes + + The EMAC ethernet controller in IBM and AMCC 4xx chips, and also + the Axon bridge. To operate this needs to interact with a ths + special McMAL DMA controller, and sometimes an RGMII or ZMII + interface. In addition to the nodes and properties described + below, the node for the OPB bus on which the EMAC sits must have a + correct clock-frequency property. + + i) The EMAC node itself + + Required properties: + - device_type : "network" + + - compatible : compatible list, contains 2 entries, first is + "ibm,emac-CHIP" where CHIP is the host ASIC (440gx, + 405gp, Axon) and second is either "ibm,emac" or + "ibm,emac4". For Axon, thus, we have: "ibm,emac-axon", + "ibm,emac4" + - interrupts : <interrupt mapping for EMAC IRQ and WOL IRQ> + - interrupt-parent : optional, if needed for interrupt mapping + - reg : <registers mapping> + - local-mac-address : 6 bytes, MAC address + - mal-device : phandle of the associated McMAL node + - mal-tx-channel : 1 cell, index of the tx channel on McMAL associated + with this EMAC + - mal-rx-channel : 1 cell, index of the rx channel on McMAL associated + with this EMAC + - cell-index : 1 cell, hardware index of the EMAC cell on a given + ASIC (typically 0x0 and 0x1 for EMAC0 and EMAC1 on + each Axon chip) + - max-frame-size : 1 cell, maximum frame size supported in bytes + - rx-fifo-size : 1 cell, Rx fifo size in bytes for 10 and 100 Mb/sec + operations. + For Axon, 2048 + - tx-fifo-size : 1 cell, Tx fifo size in bytes for 10 and 100 Mb/sec + operations. + For Axon, 2048. + - fifo-entry-size : 1 cell, size of a fifo entry (used to calculate + thresholds). + For Axon, 0x00000010 + - mal-burst-size : 1 cell, MAL burst size (used to calculate thresholds) + in bytes. + For Axon, 0x00000100 (I think ...) + - phy-mode : string, mode of operations of the PHY interface. + Supported values are: "mii", "rmii", "smii", "rgmii", + "tbi", "gmii", rtbi", "sgmii". + For Axon on CAB, it is "rgmii" + - mdio-device : 1 cell, required iff using shared MDIO registers + (440EP). phandle of the EMAC to use to drive the + MDIO lines for the PHY used by this EMAC. + - zmii-device : 1 cell, required iff connected to a ZMII. phandle of + the ZMII device node + - zmii-channel : 1 cell, required iff connected to a ZMII. Which ZMII + channel or 0xffffffff if ZMII is only used for MDIO. + - rgmii-device : 1 cell, required iff connected to an RGMII. phandle + of the RGMII device node. + For Axon: phandle of plb5/plb4/opb/rgmii + - rgmii-channel : 1 cell, required iff connected to an RGMII. Which + RGMII channel is used by this EMAC. + Fox Axon: present, whatever value is appropriate for each + EMAC, that is the content of the current (bogus) "phy-port" + property. + + Optional properties: + - phy-address : 1 cell, optional, MDIO address of the PHY. If absent, + a search is performed. + - phy-map : 1 cell, optional, bitmap of addresses to probe the PHY + for, used if phy-address is absent. bit 0x00000001 is + MDIO address 0. + For Axon it can be absent, though my current driver + doesn't handle phy-address yet so for now, keep + 0x00ffffff in it. + - rx-fifo-size-gige : 1 cell, Rx fifo size in bytes for 1000 Mb/sec + operations (if absent the value is the same as + rx-fifo-size). For Axon, either absent or 2048. + - tx-fifo-size-gige : 1 cell, Tx fifo size in bytes for 1000 Mb/sec + operations (if absent the value is the same as + tx-fifo-size). For Axon, either absent or 2048. + - tah-device : 1 cell, optional. If connected to a TAH engine for + offload, phandle of the TAH device node. + - tah-channel : 1 cell, optional. If appropriate, channel used on the + TAH engine. + + Example: + + EMAC0: ethernet@40000800 { + device_type = "network"; + compatible = "ibm,emac-440gp", "ibm,emac"; + interrupt-parent = <&UIC1>; + interrupts = <1c 4 1d 4>; + reg = <40000800 70>; + local-mac-address = [00 04 AC E3 1B 1E]; + mal-device = <&MAL0>; + mal-tx-channel = <0 1>; + mal-rx-channel = <0>; + cell-index = <0>; + max-frame-size = <5dc>; + rx-fifo-size = <1000>; + tx-fifo-size = <800>; + phy-mode = "rmii"; + phy-map = <00000001>; + zmii-device = <&ZMII0>; + zmii-channel = <0>; + }; + + ii) McMAL node + + Required properties: + - device_type : "dma-controller" + - compatible : compatible list, containing 2 entries, first is + "ibm,mcmal-CHIP" where CHIP is the host ASIC (like + emac) and the second is either "ibm,mcmal" or + "ibm,mcmal2". + For Axon, "ibm,mcmal-axon","ibm,mcmal2" + - interrupts : <interrupt mapping for the MAL interrupts sources: + 5 sources: tx_eob, rx_eob, serr, txde, rxde>. + For Axon: This is _different_ from the current + firmware. We use the "delayed" interrupts for txeob + and rxeob. Thus we end up with mapping those 5 MPIC + interrupts, all level positive sensitive: 10, 11, 32, + 33, 34 (in decimal) + - dcr-reg : < DCR registers range > + - dcr-parent : if needed for dcr-reg + - num-tx-chans : 1 cell, number of Tx channels + - num-rx-chans : 1 cell, number of Rx channels + + iii) ZMII node + + Required properties: + - compatible : compatible list, containing 2 entries, first is + "ibm,zmii-CHIP" where CHIP is the host ASIC (like + EMAC) and the second is "ibm,zmii". + For Axon, there is no ZMII node. + - reg : <registers mapping> + + iv) RGMII node + + Required properties: + - compatible : compatible list, containing 2 entries, first is + "ibm,rgmii-CHIP" where CHIP is the host ASIC (like + EMAC) and the second is "ibm,rgmii". + For Axon, "ibm,rgmii-axon","ibm,rgmii" + - reg : <registers mapping> + - revision : as provided by the RGMII new version register if + available. + For Axon: 0x0000012a + diff --git a/Documentation/powerpc/dts-bindings/fsl/esdhc.txt b/Documentation/powerpc/dts-bindings/fsl/esdhc.txt index 5093ddf900d..3ed3797b508 100644 --- a/Documentation/powerpc/dts-bindings/fsl/esdhc.txt +++ b/Documentation/powerpc/dts-bindings/fsl/esdhc.txt @@ -10,6 +10,8 @@ Required properties: - interrupts : should contain eSDHC interrupt. - interrupt-parent : interrupt source phandle. - clock-frequency : specifies eSDHC base clock frequency. + - sdhci,1-bit-only : (optional) specifies that a controller can + only handle 1-bit data transfers. Example: diff --git a/Documentation/powerpc/dts-bindings/gpio/gpio.txt b/Documentation/powerpc/dts-bindings/gpio/gpio.txt new file mode 100644 index 00000000000..edaa84d288a --- /dev/null +++ b/Documentation/powerpc/dts-bindings/gpio/gpio.txt @@ -0,0 +1,50 @@ +Specifying GPIO information for devices +============================================ + +1) gpios property +----------------- + +Nodes that makes use of GPIOs should define them using `gpios' property, +format of which is: <&gpio-controller1-phandle gpio1-specifier + &gpio-controller2-phandle gpio2-specifier + 0 /* holes are permitted, means no GPIO 3 */ + &gpio-controller4-phandle gpio4-specifier + ...>; + +Note that gpio-specifier length is controller dependent. + +gpio-specifier may encode: bank, pin position inside the bank, +whether pin is open-drain and whether pin is logically inverted. + +Example of the node using GPIOs: + + node { + gpios = <&qe_pio_e 18 0>; + }; + +In this example gpio-specifier is "18 0" and encodes GPIO pin number, +and empty GPIO flags as accepted by the "qe_pio_e" gpio-controller. + +2) gpio-controller nodes +------------------------ + +Every GPIO controller node must have #gpio-cells property defined, +this information will be used to translate gpio-specifiers. + +Example of two SOC GPIO banks defined as gpio-controller nodes: + + qe_pio_a: gpio-controller@1400 { + #gpio-cells = <2>; + compatible = "fsl,qe-pario-bank-a", "fsl,qe-pario-bank"; + reg = <0x1400 0x18>; + gpio-controller; + }; + + qe_pio_e: gpio-controller@1460 { + #gpio-cells = <2>; + compatible = "fsl,qe-pario-bank-e", "fsl,qe-pario-bank"; + reg = <0x1460 0x18>; + gpio-controller; + }; + + diff --git a/Documentation/powerpc/dts-bindings/gpio/led.txt b/Documentation/powerpc/dts-bindings/gpio/led.txt index 4fe14deedc0..064db928c3c 100644 --- a/Documentation/powerpc/dts-bindings/gpio/led.txt +++ b/Documentation/powerpc/dts-bindings/gpio/led.txt @@ -16,10 +16,17 @@ LED sub-node properties: string defining the trigger assigned to the LED. Current triggers are: "backlight" - LED will act as a back-light, controlled by the framebuffer system - "default-on" - LED will turn on + "default-on" - LED will turn on, but see "default-state" below "heartbeat" - LED "double" flashes at a load average based rate "ide-disk" - LED indicates disk activity "timer" - LED flashes at a fixed, configurable rate +- default-state: (optional) The initial state of the LED. Valid + values are "on", "off", and "keep". If the LED is already on or off + and the default-state property is set the to same value, then no + glitch should be produced where the LED momentarily turns off (or + on). The "keep" setting will keep the LED at whatever its current + state is, without producing a glitch. The default is off if this + property is not present. Examples: @@ -30,14 +37,22 @@ leds { gpios = <&mcu_pio 0 1>; /* Active low */ linux,default-trigger = "ide-disk"; }; + + fault { + gpios = <&mcu_pio 1 0>; + /* Keep LED on if BIOS detected hardware fault */ + default-state = "keep"; + }; }; run-control { compatible = "gpio-leds"; red { gpios = <&mpc8572 6 0>; + default-state = "off"; }; green { gpios = <&mpc8572 7 0>; + default-state = "on"; }; } diff --git a/Documentation/powerpc/dts-bindings/gpio/mdio.txt b/Documentation/powerpc/dts-bindings/gpio/mdio.txt new file mode 100644 index 00000000000..bc954952901 --- /dev/null +++ b/Documentation/powerpc/dts-bindings/gpio/mdio.txt @@ -0,0 +1,19 @@ +MDIO on GPIOs + +Currently defined compatibles: +- virtual,gpio-mdio + +MDC and MDIO lines connected to GPIO controllers are listed in the +gpios property as described in section VIII.1 in the following order: + +MDC, MDIO. + +Example: + +mdio { + compatible = "virtual,mdio-gpio"; + #address-cells = <1>; + #size-cells = <0>; + gpios = <&qe_pio_a 11 + &qe_pio_c 6>; +}; diff --git a/Documentation/powerpc/dts-bindings/marvell.txt b/Documentation/powerpc/dts-bindings/marvell.txt new file mode 100644 index 00000000000..3708a2fd474 --- /dev/null +++ b/Documentation/powerpc/dts-bindings/marvell.txt @@ -0,0 +1,521 @@ +Marvell Discovery mv64[345]6x System Controller chips +=========================================================== + +The Marvell mv64[345]60 series of system controller chips contain +many of the peripherals needed to implement a complete computer +system. In this section, we define device tree nodes to describe +the system controller chip itself and each of the peripherals +which it contains. Compatible string values for each node are +prefixed with the string "marvell,", for Marvell Technology Group Ltd. + +1) The /system-controller node + + This node is used to represent the system-controller and must be + present when the system uses a system controller chip. The top-level + system-controller node contains information that is global to all + devices within the system controller chip. The node name begins + with "system-controller" followed by the unit address, which is + the base address of the memory-mapped register set for the system + controller chip. + + Required properties: + + - ranges : Describes the translation of system controller addresses + for memory mapped registers. + - clock-frequency: Contains the main clock frequency for the system + controller chip. + - reg : This property defines the address and size of the + memory-mapped registers contained within the system controller + chip. The address specified in the "reg" property should match + the unit address of the system-controller node. + - #address-cells : Address representation for system controller + devices. This field represents the number of cells needed to + represent the address of the memory-mapped registers of devices + within the system controller chip. + - #size-cells : Size representation for for the memory-mapped + registers within the system controller chip. + - #interrupt-cells : Defines the width of cells used to represent + interrupts. + + Optional properties: + + - model : The specific model of the system controller chip. Such + as, "mv64360", "mv64460", or "mv64560". + - compatible : A string identifying the compatibility identifiers + of the system controller chip. + + The system-controller node contains child nodes for each system + controller device that the platform uses. Nodes should not be created + for devices which exist on the system controller chip but are not used + + Example Marvell Discovery mv64360 system-controller node: + + system-controller@f1000000 { /* Marvell Discovery mv64360 */ + #address-cells = <1>; + #size-cells = <1>; + model = "mv64360"; /* Default */ + compatible = "marvell,mv64360"; + clock-frequency = <133333333>; + reg = <0xf1000000 0x10000>; + virtual-reg = <0xf1000000>; + ranges = <0x88000000 0x88000000 0x1000000 /* PCI 0 I/O Space */ + 0x80000000 0x80000000 0x8000000 /* PCI 0 MEM Space */ + 0xa0000000 0xa0000000 0x4000000 /* User FLASH */ + 0x00000000 0xf1000000 0x0010000 /* Bridge's regs */ + 0xf2000000 0xf2000000 0x0040000>;/* Integrated SRAM */ + + [ child node definitions... ] + } + +2) Child nodes of /system-controller + + a) Marvell Discovery MDIO bus + + The MDIO is a bus to which the PHY devices are connected. For each + device that exists on this bus, a child node should be created. See + the definition of the PHY node below for an example of how to define + a PHY. + + Required properties: + - #address-cells : Should be <1> + - #size-cells : Should be <0> + - device_type : Should be "mdio" + - compatible : Should be "marvell,mv64360-mdio" + + Example: + + mdio { + #address-cells = <1>; + #size-cells = <0>; + device_type = "mdio"; + compatible = "marvell,mv64360-mdio"; + + ethernet-phy@0 { + ...... + }; + }; + + + b) Marvell Discovery ethernet controller + + The Discover ethernet controller is described with two levels + of nodes. The first level describes an ethernet silicon block + and the second level describes up to 3 ethernet nodes within + that block. The reason for the multiple levels is that the + registers for the node are interleaved within a single set + of registers. The "ethernet-block" level describes the + shared register set, and the "ethernet" nodes describe ethernet + port-specific properties. + + Ethernet block node + + Required properties: + - #address-cells : <1> + - #size-cells : <0> + - compatible : "marvell,mv64360-eth-block" + - reg : Offset and length of the register set for this block + + Example Discovery Ethernet block node: + ethernet-block@2000 { + #address-cells = <1>; + #size-cells = <0>; + compatible = "marvell,mv64360-eth-block"; + reg = <0x2000 0x2000>; + ethernet@0 { + ....... + }; + }; + + Ethernet port node + + Required properties: + - device_type : Should be "network". + - compatible : Should be "marvell,mv64360-eth". + - reg : Should be <0>, <1>, or <2>, according to which registers + within the silicon block the device uses. + - interrupts : <a> where a is the interrupt number for the port. + - interrupt-parent : the phandle for the interrupt controller + that services interrupts for this device. + - phy : the phandle for the PHY connected to this ethernet + controller. + - local-mac-address : 6 bytes, MAC address + + Example Discovery Ethernet port node: + ethernet@0 { + device_type = "network"; + compatible = "marvell,mv64360-eth"; + reg = <0>; + interrupts = <32>; + interrupt-parent = <&PIC>; + phy = <&PHY0>; + local-mac-address = [ 00 00 00 00 00 00 ]; + }; + + + + c) Marvell Discovery PHY nodes + + Required properties: + - device_type : Should be "ethernet-phy" + - interrupts : <a> where a is the interrupt number for this phy. + - interrupt-parent : the phandle for the interrupt controller that + services interrupts for this device. + - reg : The ID number for the phy, usually a small integer + + Example Discovery PHY node: + ethernet-phy@1 { + device_type = "ethernet-phy"; + compatible = "broadcom,bcm5421"; + interrupts = <76>; /* GPP 12 */ + interrupt-parent = <&PIC>; + reg = <1>; + }; + + + d) Marvell Discovery SDMA nodes + + Represent DMA hardware associated with the MPSC (multiprotocol + serial controllers). + + Required properties: + - compatible : "marvell,mv64360-sdma" + - reg : Offset and length of the register set for this device + - interrupts : <a> where a is the interrupt number for the DMA + device. + - interrupt-parent : the phandle for the interrupt controller + that services interrupts for this device. + + Example Discovery SDMA node: + sdma@4000 { + compatible = "marvell,mv64360-sdma"; + reg = <0x4000 0xc18>; + virtual-reg = <0xf1004000>; + interrupts = <36>; + interrupt-parent = <&PIC>; + }; + + + e) Marvell Discovery BRG nodes + + Represent baud rate generator hardware associated with the MPSC + (multiprotocol serial controllers). + + Required properties: + - compatible : "marvell,mv64360-brg" + - reg : Offset and length of the register set for this device + - clock-src : A value from 0 to 15 which selects the clock + source for the baud rate generator. This value corresponds + to the CLKS value in the BRGx configuration register. See + the mv64x60 User's Manual. + - clock-frequence : The frequency (in Hz) of the baud rate + generator's input clock. + - current-speed : The current speed setting (presumably by + firmware) of the baud rate generator. + + Example Discovery BRG node: + brg@b200 { + compatible = "marvell,mv64360-brg"; + reg = <0xb200 0x8>; + clock-src = <8>; + clock-frequency = <133333333>; + current-speed = <9600>; + }; + + + f) Marvell Discovery CUNIT nodes + + Represent the Serial Communications Unit device hardware. + + Required properties: + - reg : Offset and length of the register set for this device + + Example Discovery CUNIT node: + cunit@f200 { + reg = <0xf200 0x200>; + }; + + + g) Marvell Discovery MPSCROUTING nodes + + Represent the Discovery's MPSC routing hardware + + Required properties: + - reg : Offset and length of the register set for this device + + Example Discovery CUNIT node: + mpscrouting@b500 { + reg = <0xb400 0xc>; + }; + + + h) Marvell Discovery MPSCINTR nodes + + Represent the Discovery's MPSC DMA interrupt hardware registers + (SDMA cause and mask registers). + + Required properties: + - reg : Offset and length of the register set for this device + + Example Discovery MPSCINTR node: + mpsintr@b800 { + reg = <0xb800 0x100>; + }; + + + i) Marvell Discovery MPSC nodes + + Represent the Discovery's MPSC (Multiprotocol Serial Controller) + serial port. + + Required properties: + - device_type : "serial" + - compatible : "marvell,mv64360-mpsc" + - reg : Offset and length of the register set for this device + - sdma : the phandle for the SDMA node used by this port + - brg : the phandle for the BRG node used by this port + - cunit : the phandle for the CUNIT node used by this port + - mpscrouting : the phandle for the MPSCROUTING node used by this port + - mpscintr : the phandle for the MPSCINTR node used by this port + - cell-index : the hardware index of this cell in the MPSC core + - max_idle : value needed for MPSC CHR3 (Maximum Frame Length) + register + - interrupts : <a> where a is the interrupt number for the MPSC. + - interrupt-parent : the phandle for the interrupt controller + that services interrupts for this device. + + Example Discovery MPSCINTR node: + mpsc@8000 { + device_type = "serial"; + compatible = "marvell,mv64360-mpsc"; + reg = <0x8000 0x38>; + virtual-reg = <0xf1008000>; + sdma = <&SDMA0>; + brg = <&BRG0>; + cunit = <&CUNIT>; + mpscrouting = <&MPSCROUTING>; + mpscintr = <&MPSCINTR>; + cell-index = <0>; + max_idle = <40>; + interrupts = <40>; + interrupt-parent = <&PIC>; + }; + + + j) Marvell Discovery Watch Dog Timer nodes + + Represent the Discovery's watchdog timer hardware + + Required properties: + - compatible : "marvell,mv64360-wdt" + - reg : Offset and length of the register set for this device + + Example Discovery Watch Dog Timer node: + wdt@b410 { + compatible = "marvell,mv64360-wdt"; + reg = <0xb410 0x8>; + }; + + + k) Marvell Discovery I2C nodes + + Represent the Discovery's I2C hardware + + Required properties: + - device_type : "i2c" + - compatible : "marvell,mv64360-i2c" + - reg : Offset and length of the register set for this device + - interrupts : <a> where a is the interrupt number for the I2C. + - interrupt-parent : the phandle for the interrupt controller + that services interrupts for this device. + + Example Discovery I2C node: + compatible = "marvell,mv64360-i2c"; + reg = <0xc000 0x20>; + virtual-reg = <0xf100c000>; + interrupts = <37>; + interrupt-parent = <&PIC>; + }; + + + l) Marvell Discovery PIC (Programmable Interrupt Controller) nodes + + Represent the Discovery's PIC hardware + + Required properties: + - #interrupt-cells : <1> + - #address-cells : <0> + - compatible : "marvell,mv64360-pic" + - reg : Offset and length of the register set for this device + - interrupt-controller + + Example Discovery PIC node: + pic { + #interrupt-cells = <1>; + #address-cells = <0>; + compatible = "marvell,mv64360-pic"; + reg = <0x0 0x88>; + interrupt-controller; + }; + + + m) Marvell Discovery MPP (Multipurpose Pins) multiplexing nodes + + Represent the Discovery's MPP hardware + + Required properties: + - compatible : "marvell,mv64360-mpp" + - reg : Offset and length of the register set for this device + + Example Discovery MPP node: + mpp@f000 { + compatible = "marvell,mv64360-mpp"; + reg = <0xf000 0x10>; + }; + + + n) Marvell Discovery GPP (General Purpose Pins) nodes + + Represent the Discovery's GPP hardware + + Required properties: + - compatible : "marvell,mv64360-gpp" + - reg : Offset and length of the register set for this device + + Example Discovery GPP node: + gpp@f000 { + compatible = "marvell,mv64360-gpp"; + reg = <0xf100 0x20>; + }; + + + o) Marvell Discovery PCI host bridge node + + Represents the Discovery's PCI host bridge device. The properties + for this node conform to Rev 2.1 of the PCI Bus Binding to IEEE + 1275-1994. A typical value for the compatible property is + "marvell,mv64360-pci". + + Example Discovery PCI host bridge node + pci@80000000 { + #address-cells = <3>; + #size-cells = <2>; + #interrupt-cells = <1>; + device_type = "pci"; + compatible = "marvell,mv64360-pci"; + reg = <0xcf8 0x8>; + ranges = <0x01000000 0x0 0x0 + 0x88000000 0x0 0x01000000 + 0x02000000 0x0 0x80000000 + 0x80000000 0x0 0x08000000>; + bus-range = <0 255>; + clock-frequency = <66000000>; + interrupt-parent = <&PIC>; + interrupt-map-mask = <0xf800 0x0 0x0 0x7>; + interrupt-map = < + /* IDSEL 0x0a */ + 0x5000 0 0 1 &PIC 80 + 0x5000 0 0 2 &PIC 81 + 0x5000 0 0 3 &PIC 91 + 0x5000 0 0 4 &PIC 93 + + /* IDSEL 0x0b */ + 0x5800 0 0 1 &PIC 91 + 0x5800 0 0 2 &PIC 93 + 0x5800 0 0 3 &PIC 80 + 0x5800 0 0 4 &PIC 81 + + /* IDSEL 0x0c */ + 0x6000 0 0 1 &PIC 91 + 0x6000 0 0 2 &PIC 93 + 0x6000 0 0 3 &PIC 80 + 0x6000 0 0 4 &PIC 81 + + /* IDSEL 0x0d */ + 0x6800 0 0 1 &PIC 93 + 0x6800 0 0 2 &PIC 80 + 0x6800 0 0 3 &PIC 81 + 0x6800 0 0 4 &PIC 91 + >; + }; + + + p) Marvell Discovery CPU Error nodes + + Represent the Discovery's CPU error handler device. + + Required properties: + - compatible : "marvell,mv64360-cpu-error" + - reg : Offset and length of the register set for this device + - interrupts : the interrupt number for this device + - interrupt-parent : the phandle for the interrupt controller + that services interrupts for this device. + + Example Discovery CPU Error node: + cpu-error@0070 { + compatible = "marvell,mv64360-cpu-error"; + reg = <0x70 0x10 0x128 0x28>; + interrupts = <3>; + interrupt-parent = <&PIC>; + }; + + + q) Marvell Discovery SRAM Controller nodes + + Represent the Discovery's SRAM controller device. + + Required properties: + - compatible : "marvell,mv64360-sram-ctrl" + - reg : Offset and length of the register set for this device + - interrupts : the interrupt number for this device + - interrupt-parent : the phandle for the interrupt controller + that services interrupts for this device. + + Example Discovery SRAM Controller node: + sram-ctrl@0380 { + compatible = "marvell,mv64360-sram-ctrl"; + reg = <0x380 0x80>; + interrupts = <13>; + interrupt-parent = <&PIC>; + }; + + + r) Marvell Discovery PCI Error Handler nodes + + Represent the Discovery's PCI error handler device. + + Required properties: + - compatible : "marvell,mv64360-pci-error" + - reg : Offset and length of the register set for this device + - interrupts : the interrupt number for this device + - interrupt-parent : the phandle for the interrupt controller + that services interrupts for this device. + + Example Discovery PCI Error Handler node: + pci-error@1d40 { + compatible = "marvell,mv64360-pci-error"; + reg = <0x1d40 0x40 0xc28 0x4>; + interrupts = <12>; + interrupt-parent = <&PIC>; + }; + + + s) Marvell Discovery Memory Controller nodes + + Represent the Discovery's memory controller device. + + Required properties: + - compatible : "marvell,mv64360-mem-ctrl" + - reg : Offset and length of the register set for this device + - interrupts : the interrupt number for this device + - interrupt-parent : the phandle for the interrupt controller + that services interrupts for this device. + + Example Discovery Memory Controller node: + mem-ctrl@1400 { + compatible = "marvell,mv64360-mem-ctrl"; + reg = <0x1400 0x60>; + interrupts = <17>; + interrupt-parent = <&PIC>; + }; + + diff --git a/Documentation/powerpc/dts-bindings/phy.txt b/Documentation/powerpc/dts-bindings/phy.txt new file mode 100644 index 00000000000..bb8c742eb8c --- /dev/null +++ b/Documentation/powerpc/dts-bindings/phy.txt @@ -0,0 +1,25 @@ +PHY nodes + +Required properties: + + - device_type : Should be "ethernet-phy" + - interrupts : <a b> where a is the interrupt number and b is a + field that represents an encoding of the sense and level + information for the interrupt. This should be encoded based on + the information in section 2) depending on the type of interrupt + controller you have. + - interrupt-parent : the phandle for the interrupt controller that + services interrupts for this device. + - reg : The ID number for the phy, usually a small integer + - linux,phandle : phandle for this node; likely referenced by an + ethernet controller node. + +Example: + +ethernet-phy@0 { + linux,phandle = <2452000> + interrupt-parent = <40000>; + interrupts = <35 1>; + reg = <0>; + device_type = "ethernet-phy"; +}; diff --git a/Documentation/powerpc/dts-bindings/spi-bus.txt b/Documentation/powerpc/dts-bindings/spi-bus.txt new file mode 100644 index 00000000000..e782add2e45 --- /dev/null +++ b/Documentation/powerpc/dts-bindings/spi-bus.txt @@ -0,0 +1,57 @@ +SPI (Serial Peripheral Interface) busses + +SPI busses can be described with a node for the SPI master device +and a set of child nodes for each SPI slave on the bus. For this +discussion, it is assumed that the system's SPI controller is in +SPI master mode. This binding does not describe SPI controllers +in slave mode. + +The SPI master node requires the following properties: +- #address-cells - number of cells required to define a chip select + address on the SPI bus. +- #size-cells - should be zero. +- compatible - name of SPI bus controller following generic names + recommended practice. +No other properties are required in the SPI bus node. It is assumed +that a driver for an SPI bus device will understand that it is an SPI bus. +However, the binding does not attempt to define the specific method for +assigning chip select numbers. Since SPI chip select configuration is +flexible and non-standardized, it is left out of this binding with the +assumption that board specific platform code will be used to manage +chip selects. Individual drivers can define additional properties to +support describing the chip select layout. + +SPI slave nodes must be children of the SPI master node and can +contain the following properties. +- reg - (required) chip select address of device. +- compatible - (required) name of SPI device following generic names + recommended practice +- spi-max-frequency - (required) Maximum SPI clocking speed of device in Hz +- spi-cpol - (optional) Empty property indicating device requires + inverse clock polarity (CPOL) mode +- spi-cpha - (optional) Empty property indicating device requires + shifted clock phase (CPHA) mode +- spi-cs-high - (optional) Empty property indicating device requires + chip select active high + +SPI example for an MPC5200 SPI bus: + spi@f00 { + #address-cells = <1>; + #size-cells = <0>; + compatible = "fsl,mpc5200b-spi","fsl,mpc5200-spi"; + reg = <0xf00 0x20>; + interrupts = <2 13 0 2 14 0>; + interrupt-parent = <&mpc5200_pic>; + + ethernet-switch@0 { + compatible = "micrel,ks8995m"; + spi-max-frequency = <1000000>; + reg = <0>; + }; + + codec@1 { + compatible = "ti,tlv320aic26"; + spi-max-frequency = <100000>; + reg = <1>; + }; + }; diff --git a/Documentation/powerpc/dts-bindings/usb-ehci.txt b/Documentation/powerpc/dts-bindings/usb-ehci.txt new file mode 100644 index 00000000000..fa18612f757 --- /dev/null +++ b/Documentation/powerpc/dts-bindings/usb-ehci.txt @@ -0,0 +1,25 @@ +USB EHCI controllers + +Required properties: + - compatible : should be "usb-ehci". + - reg : should contain at least address and length of the standard EHCI + register set for the device. Optional platform-dependent registers + (debug-port or other) can be also specified here, but only after + definition of standard EHCI registers. + - interrupts : one EHCI interrupt should be described here. +If device registers are implemented in big endian mode, the device +node should have "big-endian-regs" property. +If controller implementation operates with big endian descriptors, +"big-endian-desc" property should be specified. +If both big endian registers and descriptors are used by the controller +implementation, "big-endian" property can be specified instead of having +both "big-endian-regs" and "big-endian-desc". + +Example (Sequoia 440EPx): + ehci@e0000300 { + compatible = "ibm,usb-ehci-440epx", "usb-ehci"; + interrupt-parent = <&UIC0>; + interrupts = <1a 4>; + reg = <0 e0000300 90 0 e0000390 70>; + big-endian; + }; diff --git a/Documentation/powerpc/dts-bindings/xilinx.txt b/Documentation/powerpc/dts-bindings/xilinx.txt new file mode 100644 index 00000000000..80339fe4300 --- /dev/null +++ b/Documentation/powerpc/dts-bindings/xilinx.txt @@ -0,0 +1,295 @@ + d) Xilinx IP cores + + The Xilinx EDK toolchain ships with a set of IP cores (devices) for use + in Xilinx Spartan and Virtex FPGAs. The devices cover the whole range + of standard device types (network, serial, etc.) and miscellaneous + devices (gpio, LCD, spi, etc). Also, since these devices are + implemented within the fpga fabric every instance of the device can be + synthesised with different options that change the behaviour. + + Each IP-core has a set of parameters which the FPGA designer can use to + control how the core is synthesized. Historically, the EDK tool would + extract the device parameters relevant to device drivers and copy them + into an 'xparameters.h' in the form of #define symbols. This tells the + device drivers how the IP cores are configured, but it requres the kernel + to be recompiled every time the FPGA bitstream is resynthesized. + + The new approach is to export the parameters into the device tree and + generate a new device tree each time the FPGA bitstream changes. The + parameters which used to be exported as #defines will now become + properties of the device node. In general, device nodes for IP-cores + will take the following form: + + (name): (generic-name)@(base-address) { + compatible = "xlnx,(ip-core-name)-(HW_VER)" + [, (list of compatible devices), ...]; + reg = <(baseaddr) (size)>; + interrupt-parent = <&interrupt-controller-phandle>; + interrupts = < ... >; + xlnx,(parameter1) = "(string-value)"; + xlnx,(parameter2) = <(int-value)>; + }; + + (generic-name): an open firmware-style name that describes the + generic class of device. Preferably, this is one word, such + as 'serial' or 'ethernet'. + (ip-core-name): the name of the ip block (given after the BEGIN + directive in system.mhs). Should be in lowercase + and all underscores '_' converted to dashes '-'. + (name): is derived from the "PARAMETER INSTANCE" value. + (parameter#): C_* parameters from system.mhs. The C_ prefix is + dropped from the parameter name, the name is converted + to lowercase and all underscore '_' characters are + converted to dashes '-'. + (baseaddr): the baseaddr parameter value (often named C_BASEADDR). + (HW_VER): from the HW_VER parameter. + (size): the address range size (often C_HIGHADDR - C_BASEADDR + 1). + + Typically, the compatible list will include the exact IP core version + followed by an older IP core version which implements the same + interface or any other device with the same interface. + + 'reg', 'interrupt-parent' and 'interrupts' are all optional properties. + + For example, the following block from system.mhs: + + BEGIN opb_uartlite + PARAMETER INSTANCE = opb_uartlite_0 + PARAMETER HW_VER = 1.00.b + PARAMETER C_BAUDRATE = 115200 + PARAMETER C_DATA_BITS = 8 + PARAMETER C_ODD_PARITY = 0 + PARAMETER C_USE_PARITY = 0 + PARAMETER C_CLK_FREQ = 50000000 + PARAMETER C_BASEADDR = 0xEC100000 + PARAMETER C_HIGHADDR = 0xEC10FFFF + BUS_INTERFACE SOPB = opb_7 + PORT OPB_Clk = CLK_50MHz + PORT Interrupt = opb_uartlite_0_Interrupt + PORT RX = opb_uartlite_0_RX + PORT TX = opb_uartlite_0_TX + PORT OPB_Rst = sys_bus_reset_0 + END + + becomes the following device tree node: + + opb_uartlite_0: serial@ec100000 { + device_type = "serial"; + compatible = "xlnx,opb-uartlite-1.00.b"; + reg = <ec100000 10000>; + interrupt-parent = <&opb_intc_0>; + interrupts = <1 0>; // got this from the opb_intc parameters + current-speed = <d#115200>; // standard serial device prop + clock-frequency = <d#50000000>; // standard serial device prop + xlnx,data-bits = <8>; + xlnx,odd-parity = <0>; + xlnx,use-parity = <0>; + }; + + Some IP cores actually implement 2 or more logical devices. In + this case, the device should still describe the whole IP core with + a single node and add a child node for each logical device. The + ranges property can be used to translate from parent IP-core to the + registers of each device. In addition, the parent node should be + compatible with the bus type 'xlnx,compound', and should contain + #address-cells and #size-cells, as with any other bus. (Note: this + makes the assumption that both logical devices have the same bus + binding. If this is not true, then separate nodes should be used + for each logical device). The 'cell-index' property can be used to + enumerate logical devices within an IP core. For example, the + following is the system.mhs entry for the dual ps2 controller found + on the ml403 reference design. + + BEGIN opb_ps2_dual_ref + PARAMETER INSTANCE = opb_ps2_dual_ref_0 + PARAMETER HW_VER = 1.00.a + PARAMETER C_BASEADDR = 0xA9000000 + PARAMETER C_HIGHADDR = 0xA9001FFF + BUS_INTERFACE SOPB = opb_v20_0 + PORT Sys_Intr1 = ps2_1_intr + PORT Sys_Intr2 = ps2_2_intr + PORT Clkin1 = ps2_clk_rx_1 + PORT Clkin2 = ps2_clk_rx_2 + PORT Clkpd1 = ps2_clk_tx_1 + PORT Clkpd2 = ps2_clk_tx_2 + PORT Rx1 = ps2_d_rx_1 + PORT Rx2 = ps2_d_rx_2 + PORT Txpd1 = ps2_d_tx_1 + PORT Txpd2 = ps2_d_tx_2 + END + + It would result in the following device tree nodes: + + opb_ps2_dual_ref_0: opb-ps2-dual-ref@a9000000 { + #address-cells = <1>; + #size-cells = <1>; + compatible = "xlnx,compound"; + ranges = <0 a9000000 2000>; + // If this device had extra parameters, then they would + // go here. + ps2@0 { + compatible = "xlnx,opb-ps2-dual-ref-1.00.a"; + reg = <0 40>; + interrupt-parent = <&opb_intc_0>; + interrupts = <3 0>; + cell-index = <0>; + }; + ps2@1000 { + compatible = "xlnx,opb-ps2-dual-ref-1.00.a"; + reg = <1000 40>; + interrupt-parent = <&opb_intc_0>; + interrupts = <3 0>; + cell-index = <0>; + }; + }; + + Also, the system.mhs file defines bus attachments from the processor + to the devices. The device tree structure should reflect the bus + attachments. Again an example; this system.mhs fragment: + + BEGIN ppc405_virtex4 + PARAMETER INSTANCE = ppc405_0 + PARAMETER HW_VER = 1.01.a + BUS_INTERFACE DPLB = plb_v34_0 + BUS_INTERFACE IPLB = plb_v34_0 + END + + BEGIN opb_intc + PARAMETER INSTANCE = opb_intc_0 + PARAMETER HW_VER = 1.00.c + PARAMETER C_BASEADDR = 0xD1000FC0 + PARAMETER C_HIGHADDR = 0xD1000FDF + BUS_INTERFACE SOPB = opb_v20_0 + END + + BEGIN opb_uart16550 + PARAMETER INSTANCE = opb_uart16550_0 + PARAMETER HW_VER = 1.00.d + PARAMETER C_BASEADDR = 0xa0000000 + PARAMETER C_HIGHADDR = 0xa0001FFF + BUS_INTERFACE SOPB = opb_v20_0 + END + + BEGIN plb_v34 + PARAMETER INSTANCE = plb_v34_0 + PARAMETER HW_VER = 1.02.a + END + + BEGIN plb_bram_if_cntlr + PARAMETER INSTANCE = plb_bram_if_cntlr_0 + PARAMETER HW_VER = 1.00.b + PARAMETER C_BASEADDR = 0xFFFF0000 + PARAMETER C_HIGHADDR = 0xFFFFFFFF + BUS_INTERFACE SPLB = plb_v34_0 + END + + BEGIN plb2opb_bridge + PARAMETER INSTANCE = plb2opb_bridge_0 + PARAMETER HW_VER = 1.01.a + PARAMETER C_RNG0_BASEADDR = 0x20000000 + PARAMETER C_RNG0_HIGHADDR = 0x3FFFFFFF + PARAMETER C_RNG1_BASEADDR = 0x60000000 + PARAMETER C_RNG1_HIGHADDR = 0x7FFFFFFF + PARAMETER C_RNG2_BASEADDR = 0x80000000 + PARAMETER C_RNG2_HIGHADDR = 0xBFFFFFFF + PARAMETER C_RNG3_BASEADDR = 0xC0000000 + PARAMETER C_RNG3_HIGHADDR = 0xDFFFFFFF + BUS_INTERFACE SPLB = plb_v34_0 + BUS_INTERFACE MOPB = opb_v20_0 + END + + Gives this device tree (some properties removed for clarity): + + plb@0 { + #address-cells = <1>; + #size-cells = <1>; + compatible = "xlnx,plb-v34-1.02.a"; + device_type = "ibm,plb"; + ranges; // 1:1 translation + + plb_bram_if_cntrl_0: bram@ffff0000 { + reg = <ffff0000 10000>; + } + + opb@20000000 { + #address-cells = <1>; + #size-cells = <1>; + ranges = <20000000 20000000 20000000 + 60000000 60000000 20000000 + 80000000 80000000 40000000 + c0000000 c0000000 20000000>; + + opb_uart16550_0: serial@a0000000 { + reg = <a00000000 2000>; + }; + + opb_intc_0: interrupt-controller@d1000fc0 { + reg = <d1000fc0 20>; + }; + }; + }; + + That covers the general approach to binding xilinx IP cores into the + device tree. The following are bindings for specific devices: + + i) Xilinx ML300 Framebuffer + + Simple framebuffer device from the ML300 reference design (also on the + ML403 reference design as well as others). + + Optional properties: + - resolution = <xres yres> : pixel resolution of framebuffer. Some + implementations use a different resolution. + Default is <d#640 d#480> + - virt-resolution = <xvirt yvirt> : Size of framebuffer in memory. + Default is <d#1024 d#480>. + - rotate-display (empty) : rotate display 180 degrees. + + ii) Xilinx SystemACE + + The Xilinx SystemACE device is used to program FPGAs from an FPGA + bitstream stored on a CF card. It can also be used as a generic CF + interface device. + + Optional properties: + - 8-bit (empty) : Set this property for SystemACE in 8 bit mode + + iii) Xilinx EMAC and Xilinx TEMAC + + Xilinx Ethernet devices. In addition to general xilinx properties + listed above, nodes for these devices should include a phy-handle + property, and may include other common network device properties + like local-mac-address. + + iv) Xilinx Uartlite + + Xilinx uartlite devices are simple fixed speed serial ports. + + Required properties: + - current-speed : Baud rate of uartlite + + v) Xilinx hwicap + + Xilinx hwicap devices provide access to the configuration logic + of the FPGA through the Internal Configuration Access Port + (ICAP). The ICAP enables partial reconfiguration of the FPGA, + readback of the configuration information, and some control over + 'warm boots' of the FPGA fabric. + + Required properties: + - xlnx,family : The family of the FPGA, necessary since the + capabilities of the underlying ICAP hardware + differ between different families. May be + 'virtex2p', 'virtex4', or 'virtex5'. + + vi) Xilinx Uart 16550 + + Xilinx UART 16550 devices are very similar to the NS16550 but with + different register spacing and an offset from the base address. + + Required properties: + - clock-frequency : Frequency of the clock input + - reg-offset : A value of 3 is required + - reg-shift : A value of 2 is required + + diff --git a/Documentation/pps/pps.txt b/Documentation/pps/pps.txt new file mode 100644 index 00000000000..125f4ab4899 --- /dev/null +++ b/Documentation/pps/pps.txt @@ -0,0 +1,172 @@ + + PPS - Pulse Per Second + ---------------------- + +(C) Copyright 2007 Rodolfo Giometti <giometti@enneenne.com> + +This program is free software; you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation; either version 2 of the License, or +(at your option) any later version. + +This program is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + + + +Overview +-------- + +LinuxPPS provides a programming interface (API) to define in the +system several PPS sources. + +PPS means "pulse per second" and a PPS source is just a device which +provides a high precision signal each second so that an application +can use it to adjust system clock time. + +A PPS source can be connected to a serial port (usually to the Data +Carrier Detect pin) or to a parallel port (ACK-pin) or to a special +CPU's GPIOs (this is the common case in embedded systems) but in each +case when a new pulse arrives the system must apply to it a timestamp +and record it for userland. + +Common use is the combination of the NTPD as userland program, with a +GPS receiver as PPS source, to obtain a wallclock-time with +sub-millisecond synchronisation to UTC. + + +RFC considerations +------------------ + +While implementing a PPS API as RFC 2783 defines and using an embedded +CPU GPIO-Pin as physical link to the signal, I encountered a deeper +problem: + + At startup it needs a file descriptor as argument for the function + time_pps_create(). + +This implies that the source has a /dev/... entry. This assumption is +ok for the serial and parallel port, where you can do something +useful besides(!) the gathering of timestamps as it is the central +task for a PPS-API. But this assumption does not work for a single +purpose GPIO line. In this case even basic file-related functionality +(like read() and write()) makes no sense at all and should not be a +precondition for the use of a PPS-API. + +The problem can be simply solved if you consider that a PPS source is +not always connected with a GPS data source. + +So your programs should check if the GPS data source (the serial port +for instance) is a PPS source too, and if not they should provide the +possibility to open another device as PPS source. + +In LinuxPPS the PPS sources are simply char devices usually mapped +into files /dev/pps0, /dev/pps1, etc.. + + +Coding example +-------------- + +To register a PPS source into the kernel you should define a struct +pps_source_info_s as follows: + + static struct pps_source_info pps_ktimer_info = { + .name = "ktimer", + .path = "", + .mode = PPS_CAPTUREASSERT | PPS_OFFSETASSERT | \ + PPS_ECHOASSERT | \ + PPS_CANWAIT | PPS_TSFMT_TSPEC, + .echo = pps_ktimer_echo, + .owner = THIS_MODULE, + }; + +and then calling the function pps_register_source() in your +intialization routine as follows: + + source = pps_register_source(&pps_ktimer_info, + PPS_CAPTUREASSERT | PPS_OFFSETASSERT); + +The pps_register_source() prototype is: + + int pps_register_source(struct pps_source_info_s *info, int default_params) + +where "info" is a pointer to a structure that describes a particular +PPS source, "default_params" tells the system what the initial default +parameters for the device should be (it is obvious that these parameters +must be a subset of ones defined in the struct +pps_source_info_s which describe the capabilities of the driver). + +Once you have registered a new PPS source into the system you can +signal an assert event (for example in the interrupt handler routine) +just using: + + pps_event(source, &ts, PPS_CAPTUREASSERT, ptr) + +where "ts" is the event's timestamp. + +The same function may also run the defined echo function +(pps_ktimer_echo(), passing to it the "ptr" pointer) if the user +asked for that... etc.. + +Please see the file drivers/pps/clients/ktimer.c for example code. + + +SYSFS support +------------- + +If the SYSFS filesystem is enabled in the kernel it provides a new class: + + $ ls /sys/class/pps/ + pps0/ pps1/ pps2/ + +Every directory is the ID of a PPS sources defined in the system and +inside you find several files: + + $ ls /sys/class/pps/pps0/ + assert clear echo mode name path subsystem@ uevent + +Inside each "assert" and "clear" file you can find the timestamp and a +sequence number: + + $ cat /sys/class/pps/pps0/assert + 1170026870.983207967#8 + +Where before the "#" is the timestamp in seconds; after it is the +sequence number. Other files are: + +* echo: reports if the PPS source has an echo function or not; + +* mode: reports available PPS functioning modes; + +* name: reports the PPS source's name; + +* path: reports the PPS source's device path, that is the device the + PPS source is connected to (if it exists). + + +Testing the PPS support +----------------------- + +In order to test the PPS support even without specific hardware you can use +the ktimer driver (see the client subsection in the PPS configuration menu) +and the userland tools provided into Documentaion/pps/ directory. + +Once you have enabled the compilation of ktimer just modprobe it (if +not statically compiled): + + # modprobe ktimer + +and the run ppstest as follow: + + $ ./ppstest /dev/pps0 + trying PPS source "/dev/pps1" + found PPS source "/dev/pps1" + ok, found 1 source(s), now start fetching data... + source 0 - assert 1186592699.388832443, sequence: 364 - clear 0.000000000, sequence: 0 + source 0 - assert 1186592700.388931295, sequence: 365 - clear 0.000000000, sequence: 0 + source 0 - assert 1186592701.389032765, sequence: 366 - clear 0.000000000, sequence: 0 + +Please, note that to compile userland programs you need the file timepps.h +(see Documentation/pps/). diff --git a/Documentation/rfkill.txt b/Documentation/rfkill.txt index 1b74b5f30af..b4860509c31 100644 --- a/Documentation/rfkill.txt +++ b/Documentation/rfkill.txt @@ -3,9 +3,8 @@ rfkill - RF kill switch support 1. Introduction 2. Implementation details -3. Kernel driver guidelines -4. Kernel API -5. Userspace support +3. Kernel API +4. Userspace support 1. Introduction @@ -19,82 +18,62 @@ disable all transmitters of a certain type (or all). This is intended for situations where transmitters need to be turned off, for example on aircraft. +The rfkill subsystem has a concept of "hard" and "soft" block, which +differ little in their meaning (block == transmitters off) but rather in +whether they can be changed or not: + - hard block: read-only radio block that cannot be overriden by software + - soft block: writable radio block (need not be readable) that is set by + the system software. 2. Implementation details -The rfkill subsystem is composed of various components: the rfkill class, the -rfkill-input module (an input layer handler), and some specific input layer -events. - -The rfkill class is provided for kernel drivers to register their radio -transmitter with the kernel, provide methods for turning it on and off and, -optionally, letting the system know about hardware-disabled states that may -be implemented on the device. This code is enabled with the CONFIG_RFKILL -Kconfig option, which drivers can "select". - -The rfkill class code also notifies userspace of state changes, this is -achieved via uevents. It also provides some sysfs files for userspace to -check the status of radio transmitters. See the "Userspace support" section -below. +The rfkill subsystem is composed of three main components: + * the rfkill core, + * the deprecated rfkill-input module (an input layer handler, being + replaced by userspace policy code) and + * the rfkill drivers. +The rfkill core provides API for kernel drivers to register their radio +transmitter with the kernel, methods for turning it on and off and, letting +the system know about hardware-disabled states that may be implemented on +the device. -The rfkill-input code implements a basic response to rfkill buttons -- it -implements turning on/off all devices of a certain class (or all). +The rfkill core code also notifies userspace of state changes, and provides +ways for userspace to query the current states. See the "Userspace support" +section below. When the device is hard-blocked (either by a call to rfkill_set_hw_state() -or from query_hw_block) set_block() will be invoked but drivers can well -ignore the method call since they can use the return value of the function -rfkill_set_hw_state() to sync the software state instead of keeping track -of calls to set_block(). - - -The entire functionality is spread over more than one subsystem: - - * The kernel input layer generates KEY_WWAN, KEY_WLAN etc. and - SW_RFKILL_ALL -- when the user presses a button. Drivers for radio - transmitters generally do not register to the input layer, unless the - device really provides an input device (i.e. a button that has no - effect other than generating a button press event) - - * The rfkill-input code hooks up to these events and switches the soft-block - of the various radio transmitters, depending on the button type. - - * The rfkill drivers turn off/on their transmitters as requested. - - * The rfkill class will generate userspace notifications (uevents) to tell - userspace what the current state is. +or from query_hw_block) set_block() will be invoked for additional software +block, but drivers can ignore the method call since they can use the return +value of the function rfkill_set_hw_state() to sync the software state +instead of keeping track of calls to set_block(). In fact, drivers should +use the return value of rfkill_set_hw_state() unless the hardware actually +keeps track of soft and hard block separately. +3. Kernel API -3. Kernel driver guidelines - -Drivers for radio transmitters normally implement only the rfkill class. -These drivers may not unblock the transmitter based on own decisions, they -should act on information provided by the rfkill class only. +Drivers for radio transmitters normally implement an rfkill driver. Platform drivers might implement input devices if the rfkill button is just that, a button. If that button influences the hardware then you need to -implement an rfkill class instead. This also applies if the platform provides +implement an rfkill driver instead. This also applies if the platform provides a way to turn on/off the transmitter(s). -During suspend/hibernation, transmitters should only be left enabled when -wake-on wlan or similar functionality requires it and the device wasn't -blocked before suspend/hibernate. Note that it may be necessary to update -the rfkill subsystem's idea of what the current state is at resume time if -the state may have changed over suspend. - +For some platforms, it is possible that the hardware state changes during +suspend/hibernation, in which case it will be necessary to update the rfkill +core with the current state is at resume time. +To create an rfkill driver, driver's Kconfig needs to have -4. Kernel API + depends on RFKILL || !RFKILL -To build a driver with rfkill subsystem support, the driver should depend on -(or select) the Kconfig symbol RFKILL. - -The hardware the driver talks to may be write-only (where the current state -of the hardware is unknown), or read-write (where the hardware can be queried -about its current state). +to ensure the driver cannot be built-in when rfkill is modular. The !RFKILL +case allows the driver to be built when rfkill is not configured, which which +case all rfkill API can still be used but will be provided by static inlines +which compile to almost nothing. Calling rfkill_set_hw_state() when a state change happens is required from rfkill drivers that control devices that can be hard-blocked unless they also @@ -105,10 +84,35 @@ device). Don't do this unless you cannot get the event in any other way. 5. Userspace support -The following sysfs entries exist for every rfkill device: +The recommended userspace interface to use is /dev/rfkill, which is a misc +character device that allows userspace to obtain and set the state of rfkill +devices and sets of devices. It also notifies userspace about device addition +and removal. The API is a simple read/write API that is defined in +linux/rfkill.h, with one ioctl that allows turning off the deprecated input +handler in the kernel for the transition period. + +Except for the one ioctl, communication with the kernel is done via read() +and write() of instances of 'struct rfkill_event'. In this structure, the +soft and hard block are properly separated (unlike sysfs, see below) and +userspace is able to get a consistent snapshot of all rfkill devices in the +system. Also, it is possible to switch all rfkill drivers (or all drivers of +a specified type) into a state which also updates the default state for +hotplugged devices. + +After an application opens /dev/rfkill, it can read the current state of +all devices, and afterwards can poll the descriptor for hotplug or state +change events. + +Applications must ignore operations (the "op" field) they do not handle, +this allows the API to be extended in the future. + +Additionally, each rfkill device is registered in sysfs and there has the +following attributes: name: Name assigned by driver to this key (interface or driver name). - type: Name of the key type ("wlan", "bluetooth", etc). + type: Driver type string ("wlan", "bluetooth", etc). + persistent: Whether the soft blocked state is initialised from + non-volatile storage at startup. state: Current state of the transmitter 0: RFKILL_STATE_SOFT_BLOCKED transmitter is turned off by software @@ -117,7 +121,12 @@ The following sysfs entries exist for every rfkill device: 2: RFKILL_STATE_HARD_BLOCKED transmitter is forced off by something outside of the driver's control. - claim: 0: Kernel handles events (currently always reads that value) + This file is deprecated because it can only properly show + three of the four possible states, soft-and-hard-blocked is + missing. + claim: 0: Kernel handles events + This file is deprecated because there no longer is a way to + claim just control over a single rfkill instance. rfkill devices also issue uevents (with an action of "change"), with the following environment variables set: @@ -128,9 +137,3 @@ RFKILL_TYPE The contents of these variables corresponds to the "name", "state" and "type" sysfs files explained above. - -An alternative userspace interface exists as a misc device /dev/rfkill, -which allows userspace to obtain and set the state of rfkill devices and -sets of devices. It also notifies userspace about device addition and -removal. The API is a simple read/write API that is defined in -linux/rfkill.h. diff --git a/Documentation/robust-futex-ABI.txt b/Documentation/robust-futex-ABI.txt index 535f69fab45..fd1cd8aae4e 100644 --- a/Documentation/robust-futex-ABI.txt +++ b/Documentation/robust-futex-ABI.txt @@ -135,7 +135,7 @@ manipulating this list), the user code must observe the following protocol on 'lock entry' insertion and removal: On insertion: - 1) set the 'list_op_pending' word to the address of the 'lock word' + 1) set the 'list_op_pending' word to the address of the 'lock entry' to be inserted, 2) acquire the futex lock, 3) add the lock entry, with its thread id (TID) in the bottom 29 bits @@ -143,7 +143,7 @@ On insertion: 4) clear the 'list_op_pending' word. On removal: - 1) set the 'list_op_pending' word to the address of the 'lock word' + 1) set the 'list_op_pending' word to the address of the 'lock entry' to be removed, 2) remove the lock entry for this lock from the 'head' list, 2) release the futex lock, and diff --git a/Documentation/s390/s390dbf.txt b/Documentation/s390/s390dbf.txt index 2d10053dd97..ae66f9b90a2 100644 --- a/Documentation/s390/s390dbf.txt +++ b/Documentation/s390/s390dbf.txt @@ -495,6 +495,13 @@ and for each vararg a long value. So e.g. for a debug entry with a format string plus two varargs one would need to allocate a (3 * sizeof(long)) byte data area in the debug_register() function. +IMPORTANT: Using "%s" in sprintf event functions is dangerous. You can only +use "%s" in the sprintf event functions, if the memory for the passed string is +available as long as the debug feature exists. The reason behind this is that +due to performance considerations only a pointer to the string is stored in +the debug feature. If you log a string that is freed afterwards, you will get +an OOPS when inspecting the debug feature, because then the debug feature will +access the already freed memory. NOTE: If using the sprintf view do NOT use other event/exception functions than the sprintf-event and -exception functions. diff --git a/Documentation/scheduler/sched-rt-group.txt b/Documentation/scheduler/sched-rt-group.txt index 1df7f9cdab0..86eabe6c341 100644 --- a/Documentation/scheduler/sched-rt-group.txt +++ b/Documentation/scheduler/sched-rt-group.txt @@ -73,7 +73,7 @@ The remaining CPU time will be used for user input and other tasks. Because realtime tasks have explicitly allocated the CPU time they need to perform their tasks, buffer underruns in the graphics or audio can be eliminated. -NOTE: the above example is not fully implemented as of yet (2.6.25). We still +NOTE: the above example is not fully implemented yet. We still lack an EDF scheduler to make non-uniform periods usable. @@ -140,14 +140,15 @@ The other option is: .o CONFIG_CGROUP_SCHED (aka "Basis for grouping tasks" = "Control groups") -This uses the /cgroup virtual file system and "/cgroup/<cgroup>/cpu.rt_runtime_us" -to control the CPU time reserved for each control group instead. +This uses the /cgroup virtual file system and +"/cgroup/<cgroup>/cpu.rt_runtime_us" to control the CPU time reserved for each +control group instead. For more information on working with control groups, you should read Documentation/cgroups/cgroups.txt as well. -Group settings are checked against the following limits in order to keep the configuration -schedulable: +Group settings are checked against the following limits in order to keep the +configuration schedulable: \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period @@ -189,7 +190,7 @@ Implementing SCHED_EDF might take a while to complete. Priority Inheritance is the biggest challenge as the current linux PI infrastructure is geared towards the limited static priority levels 0-99. With deadline scheduling you need to do deadline inheritance (since priority is inversely proportional to the -deadline delta (deadline - now). +deadline delta (deadline - now)). This means the whole PI machinery will have to be reworked - and that is one of the most complex pieces of code we have. diff --git a/Documentation/scsi/scsi_fc_transport.txt b/Documentation/scsi/scsi_fc_transport.txt index e5b071d4661..d7f181701dc 100644 --- a/Documentation/scsi/scsi_fc_transport.txt +++ b/Documentation/scsi/scsi_fc_transport.txt @@ -1,10 +1,11 @@ SCSI FC Tansport ============================================= -Date: 4/12/2007 +Date: 11/18/2008 Kernel Revisions for features: rports : <<TBS>> - vports : 2.6.22 (? TBD) + vports : 2.6.22 + bsg support : 2.6.30 (?TBD?) Introduction @@ -15,6 +16,7 @@ The FC transport can be found at: drivers/scsi/scsi_transport_fc.c include/scsi/scsi_transport_fc.h include/scsi/scsi_netlink_fc.h + include/scsi/scsi_bsg_fc.h This file is found at Documentation/scsi/scsi_fc_transport.txt @@ -472,6 +474,14 @@ int fc_vport_terminate(struct fc_vport *vport) +FC BSG support (CT & ELS passthru, and more) +======================================================================== +<< To Be Supplied >> + + + + + Credits ======= The following people have contributed to this document: diff --git a/Documentation/scsi/scsi_mid_low_api.txt b/Documentation/scsi/scsi_mid_low_api.txt index a6d5354639b..de67229251d 100644 --- a/Documentation/scsi/scsi_mid_low_api.txt +++ b/Documentation/scsi/scsi_mid_low_api.txt @@ -1271,6 +1271,11 @@ of interest: hostdata[0] - area reserved for LLD at end of struct Scsi_Host. Size is set by the second argument (named 'xtr_bytes') to scsi_host_alloc() or scsi_register(). + vendor_id - a unique value that identifies the vendor supplying + the LLD for the Scsi_Host. Used most often in validating + vendor-specific message requests. Value consists of an + identifier type and a vendor-specific value. + See scsi_netlink.h for a description of valid formats. The scsi_host structure is defined in include/scsi/scsi_host.h diff --git a/Documentation/sound/alsa/ALSA-Configuration.txt b/Documentation/sound/alsa/ALSA-Configuration.txt index 4252697a95d..1c8eb4518ce 100644 --- a/Documentation/sound/alsa/ALSA-Configuration.txt +++ b/Documentation/sound/alsa/ALSA-Configuration.txt @@ -60,6 +60,12 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. slots - Reserve the slot index for the given driver. This option takes multiple strings. See "Module Autoloading Support" section for details. + debug - Specifies the debug message level + (0 = disable debug prints, 1 = normal debug messages, + 2 = verbose debug messages) + This option appears only when CONFIG_SND_DEBUG=y. + This option can be dynamically changed via sysfs + /sys/modules/snd/parameters/debug file. Module snd-pcm-oss ------------------ @@ -513,6 +519,26 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. or input, but you may use this module for any application which requires a sound card (like RealPlayer). + pcm_devs - Number of PCM devices assigned to each card + (default = 1, up to 4) + pcm_substreams - Number of PCM substreams assigned to each PCM + (default = 8, up to 16) + hrtimer - Use hrtimer (=1, default) or system timer (=0) + fake_buffer - Fake buffer allocations (default = 1) + + When multiple PCM devices are created, snd-dummy gives different + behavior to each PCM device: + 0 = interleaved with mmap support + 1 = non-interleaved with mmap support + 2 = interleaved without mmap + 3 = non-interleaved without mmap + + As default, snd-dummy drivers doesn't allocate the real buffers + but either ignores read/write or mmap a single dummy page to all + buffer pages, in order to save the resouces. If your apps need + the read/ written buffer data to be consistent, pass fake_buffer=0 + option. + The power-management is supported. Module snd-echo3g @@ -768,6 +794,10 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. bdl_pos_adj - Specifies the DMA IRQ timing delay in samples. Passing -1 will make the driver to choose the appropriate value based on the controller chip. + patch - Specifies the early "patch" files to modify the HD-audio + setup before initializing the codecs. This option is + available only when CONFIG_SND_HDA_PATCH_LOADER=y is set. + See HD-Audio.txt for details. [Single (global) options] single_cmd - Use single immediate commands to communicate with diff --git a/Documentation/sound/alsa/HD-Audio-Models.txt b/Documentation/sound/alsa/HD-Audio-Models.txt index de8e10a9410..97eebd63bed 100644 --- a/Documentation/sound/alsa/HD-Audio-Models.txt +++ b/Documentation/sound/alsa/HD-Audio-Models.txt @@ -114,8 +114,8 @@ ALC662/663/272 samsung-nc10 Samsung NC10 mini notebook auto auto-config reading BIOS (default) -ALC882/885 -========== +ALC882/883/885/888/889 +====================== 3stack-dig 3-jack with SPDIF I/O 6stack-dig 6-jack digital with SPDIF I/O arima Arima W820Di1 @@ -127,18 +127,16 @@ ALC882/885 mbp3 Macbook Pro rev3 imac24 iMac 24'' with jack detection w2jc ASUS W2JC - auto auto-config reading BIOS (default) - -ALC883/888 -========== - 3stack-dig 3-jack with SPDIF I/O - 6stack-dig 6-jack digital with SPDIF I/O + 3stack-2ch-dig 3-jack with SPDIF I/O (ALC883) + alc883-6stack-dig 6-jack digital with SPDIF I/O (ALC883) 3stack-6ch 3-jack 6-channel 3stack-6ch-dig 3-jack 6-channel with SPDIF I/O 6stack-dig-demo 6-jack digital for Intel demo board acer Acer laptops (Travelmate 3012WTMi, Aspire 5600, etc) acer-aspire Acer Aspire 9810 acer-aspire-4930g Acer Aspire 4930G + acer-aspire-6530g Acer Aspire 6530G + acer-aspire-7730g Acer Aspire 7730G acer-aspire-8930g Acer Aspire 8930G medion Medion Laptops medion-md2 Medion MD2 @@ -154,10 +152,13 @@ ALC883/888 3stack-hp HP machines with 3stack (Lucknow, Samba boards) 6stack-dell Dell machines with 6stack (Inspiron 530) mitac Mitac 8252D + clevo-m540r Clevo M540R (6ch + digital) clevo-m720 Clevo M720 laptop series fujitsu-pi2515 Fujitsu AMILO Pi2515 fujitsu-xa3530 Fujitsu AMILO XA3530 3stack-6ch-intel Intel DG33* boards + intel-alc889a Intel IbexPeak with ALC889A + intel-x58 Intel DX58 with ALC889 asus-p5q ASUS P5Q-EM boards mb31 MacBook 3,1 sony-vaio-tt Sony VAIO TT @@ -228,7 +229,7 @@ AD1984 ====== basic default configuration thinkpad Lenovo Thinkpad T61/X61 - dell Dell T3400 + dell_desktop Dell T3400 AD1986A ======= @@ -239,6 +240,7 @@ AD1986A laptop-automute 2-channel with EAPD and HP-automute (Lenovo N100) ultra 2-channel with EAPD (Samsung Ultra tablet PC) samsung 2-channel with EAPD (Samsung R65) + samsung-p50 2-channel with HP-automute (Samsung P50) AD1988/AD1988B/AD1989A/AD1989B ============================== @@ -256,6 +258,7 @@ Conexant 5045 laptop-micsense Laptop with Mic sense (old model fujitsu) laptop-hpmicsense Laptop with HP and Mic senses benq Benq R55E + laptop-hp530 HP 530 laptop test for testing/debugging purpose, almost all controls can be adjusted. Appearing only when compiled with $CONFIG_SND_DEBUG=y @@ -276,9 +279,16 @@ Conexant 5051 hp-dv6736 HP dv6736 lenovo-x200 Lenovo X200 laptop +Conexant 5066 +============= + laptop Basic Laptop config (default) + dell-laptop Dell laptops + olpc-xo-1_5 OLPC XO 1.5 + STAC9200 ======== ref Reference board + oqo OQO Model 2 dell-d21 Dell (unknown) dell-d22 Dell (unknown) dell-d23 Dell (unknown) @@ -366,10 +376,12 @@ STAC92HD73* =========== ref Reference board no-jd BIOS setup but without jack-detection + intel Intel DG45* mobos dell-m6-amic Dell desktops/laptops with analog mics dell-m6-dmic Dell desktops/laptops with digital mics dell-m6 Dell desktops/laptops with both type of mics dell-eq Dell desktops/laptops + alienware Alienware M17x auto BIOS setup (default) STAC92HD83* @@ -383,3 +395,8 @@ STAC9872 ======== vaio VAIO laptop without SPDIF auto BIOS setup (default) + +Cirrus Logic CS4206/4207 +======================== + mbp55 MacBook Pro 5,5 + auto BIOS setup (default) diff --git a/Documentation/sound/alsa/HD-Audio.txt b/Documentation/sound/alsa/HD-Audio.txt index 71ac995b191..7b8a5f947d1 100644 --- a/Documentation/sound/alsa/HD-Audio.txt +++ b/Documentation/sound/alsa/HD-Audio.txt @@ -139,6 +139,10 @@ The driver checks PCI SSID and looks through the static configuration table until any matching entry is found. If you have a new machine, you may see a message like below: ------------------------------------------------------------------------ + hda_codec: ALC880: BIOS auto-probing. +------------------------------------------------------------------------ +Meanwhile, in the earlier versions, you would see a message like: +------------------------------------------------------------------------ hda_codec: Unknown model for ALC880, trying auto-probe from BIOS... ------------------------------------------------------------------------ Even if you see such a message, DON'T PANIC. Take a deep breath and @@ -403,6 +407,66 @@ re-configure based on that state, run like below: ------------------------------------------------------------------------ +Early Patching +~~~~~~~~~~~~~~ +When CONFIG_SND_HDA_PATCH_LOADER=y is set, you can pass a "patch" as a +firmware file for modifying the HD-audio setup before initializing the +codec. This can work basically like the reconfiguration via sysfs in +the above, but it does it before the first codec configuration. + +A patch file is a plain text file which looks like below: + +------------------------------------------------------------------------ + [codec] + 0x12345678 0xabcd1234 2 + + [model] + auto + + [pincfg] + 0x12 0x411111f0 + + [verb] + 0x20 0x500 0x03 + 0x20 0x400 0xff + + [hint] + hp_detect = yes +------------------------------------------------------------------------ + +The file needs to have a line `[codec]`. The next line should contain +three numbers indicating the codec vendor-id (0x12345678 in the +example), the codec subsystem-id (0xabcd1234) and the address (2) of +the codec. The rest patch entries are applied to this specified codec +until another codec entry is given. + +The `[model]` line allows to change the model name of the each codec. +In the example above, it will be changed to model=auto. +Note that this overrides the module option. + +After the `[pincfg]` line, the contents are parsed as the initial +default pin-configurations just like `user_pin_configs` sysfs above. +The values can be shown in user_pin_configs sysfs file, too. + +Similarly, the lines after `[verb]` are parsed as `init_verbs` +sysfs entries, and the lines after `[hint]` are parsed as `hints` +sysfs entries, respectively. + +The hd-audio driver reads the file via request_firmware(). Thus, +a patch file has to be located on the appropriate firmware path, +typically, /lib/firmware. For example, when you pass the option +`patch=hda-init.fw`, the file /lib/firmware/hda-init-fw must be +present. + +The patch module option is specific to each card instance, and you +need to give one file name for each instance, separated by commas. +For example, if you have two cards, one for an on-board analog and one +for an HDMI video board, you may pass patch option like below: +------------------------------------------------------------------------ + options snd-hda-intel patch=on-board-patch,hdmi-patch +------------------------------------------------------------------------ + + Power-Saving ~~~~~~~~~~~~ The power-saving is a kind of auto-suspend of the device. When the diff --git a/Documentation/sound/alsa/Procfile.txt b/Documentation/sound/alsa/Procfile.txt index 381908d8ca4..719a819f8cc 100644 --- a/Documentation/sound/alsa/Procfile.txt +++ b/Documentation/sound/alsa/Procfile.txt @@ -101,6 +101,8 @@ card*/pcm*/xrun_debug bit 0 = Enable XRUN/jiffies debug messages bit 1 = Show stack trace at XRUN / jiffies check bit 2 = Enable additional jiffies check + bit 3 = Log hwptr update at each period interrupt + bit 4 = Log hwptr update at each snd_pcm_update_hw_ptr() When the bit 0 is set, the driver will show the messages to kernel log when an xrun is detected. The debug message is @@ -117,6 +119,9 @@ card*/pcm*/xrun_debug buggy) hardware that doesn't give smooth pointer updates. This feature is enabled via the bit 2. + Bits 3 and 4 are for logging the hwptr records. Note that + these will give flood of kernel messages. + card*/pcm*/sub*/info The general information of this PCM sub-stream. diff --git a/Documentation/spi/spidev_test.c b/Documentation/spi/spidev_test.c index cf0e3ce0d52..c1a5aad3c75 100644 --- a/Documentation/spi/spidev_test.c +++ b/Documentation/spi/spidev_test.c @@ -99,11 +99,13 @@ void parse_opts(int argc, char *argv[]) { "lsb", 0, 0, 'L' }, { "cs-high", 0, 0, 'C' }, { "3wire", 0, 0, '3' }, + { "no-cs", 0, 0, 'N' }, + { "ready", 0, 0, 'R' }, { NULL, 0, 0, 0 }, }; int c; - c = getopt_long(argc, argv, "D:s:d:b:lHOLC3", lopts, NULL); + c = getopt_long(argc, argv, "D:s:d:b:lHOLC3NR", lopts, NULL); if (c == -1) break; @@ -139,6 +141,12 @@ void parse_opts(int argc, char *argv[]) case '3': mode |= SPI_3WIRE; break; + case 'N': + mode |= SPI_NO_CS; + break; + case 'R': + mode |= SPI_READY; + break; default: print_usage(argv[0]); break; diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 322a00bb99d..2dbff53369d 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -19,6 +19,7 @@ Currently, these files might (depending on your configuration) show up in /proc/sys/kernel: - acpi_video_flags - acct +- callhome [ S390 only ] - auto_msgmni - core_pattern - core_uses_pid @@ -91,6 +92,21 @@ valid for 30 seconds. ============================================================== +callhome: + +Controls the kernel's callhome behavior in case of a kernel panic. + +The s390 hardware allows an operating system to send a notification +to a service organization (callhome) in case of an operating system panic. + +When the value in this file is 0 (which is the default behavior) +nothing happens in case of a kernel panic. If this value is set to "1" +the complete kernel oops message is send to the IBM customer service +organization in case the mainframe the Linux operating system is running +on has a service contract with IBM. + +============================================================== + core_pattern: core_pattern is used to specify a core dumpfile pattern name. diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 6fab2dcbb4d..c4de6359d44 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -233,8 +233,8 @@ These protections are added to score to judge whether this zone should be used for page allocation or should be reclaimed. In this example, if normal pages (index=2) are required to this DMA zone and -pages_high is used for watermark, the kernel judges this zone should not be -used because pages_free(1355) is smaller than watermark + protection[2] +watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should +not be used because pages_free(1355) is smaller than watermark + protection[2] (4 + 2004 = 2008). If this protection value is 0, this zone would be used for normal page requirement. If requirement is DMA zone(index=0), protection[0] (=0) is used. @@ -280,9 +280,10 @@ The default value is 65536. min_free_kbytes: This is used to force the Linux VM to keep a minimum number -of kilobytes free. The VM uses this number to compute a pages_min -value for each lowmem zone in the system. Each lowmem zone gets -a number of reserved free pages based proportionally on its size. +of kilobytes free. The VM uses this number to compute a +watermark[WMARK_MIN] value for each lowmem zone in the system. +Each lowmem zone gets a number of reserved free pages based +proportionally on its size. Some minimal amount of memory is needed to satisfy PF_MEMALLOC allocations; if you set this to lower than 1024KB, your system will @@ -314,10 +315,14 @@ min_unmapped_ratio: This is available only on NUMA kernels. -A percentage of the total pages in each zone. Zone reclaim will only -occur if more than this percentage of pages are file backed and unmapped. -This is to insure that a minimal amount of local pages is still available for -file I/O even if the node is overallocated. +This is a percentage of the total pages in each zone. Zone reclaim will +only occur if more than this percentage of pages are in a state that +zone_reclaim_mode allows to be reclaimed. + +If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared +against all file-backed unmapped pages including swapcache pages and tmpfs +files. Otherwise, only unmapped pages backed by normal files but not tmpfs +files and similar are considered. The default is 1 percent. diff --git a/Documentation/sysrq.txt b/Documentation/sysrq.txt index cf42b820ff9..d56a0177542 100644 --- a/Documentation/sysrq.txt +++ b/Documentation/sysrq.txt @@ -66,7 +66,8 @@ On all - write a character to /proc/sysrq-trigger. e.g.: 'b' - Will immediately reboot the system without syncing or unmounting your disks. -'c' - Will perform a kexec reboot in order to take a crashdump. +'c' - Will perform a system crash by a NULL pointer dereference. + A crashdump will be taken if configured. 'd' - Shows all locks that are held. @@ -141,8 +142,8 @@ useful when you want to exit a program that will not let you switch consoles. re'B'oot is good when you're unable to shut down. But you should also 'S'ync and 'U'mount first. -'C'rashdump can be used to manually trigger a crashdump when the system is hung. -The kernel needs to have been built with CONFIG_KEXEC enabled. +'C'rash can be used to manually trigger a crashdump when the system is hung. +Note that this just triggers a crash if there is no dump mechanism available. 'S'ync is great when your system is locked up, it allows you to sync your disks and will certainly lessen the chance of data loss and fscking. Note diff --git a/Documentation/trace/events.txt b/Documentation/trace/events.txt index f157d7594ea..78c45a87be5 100644 --- a/Documentation/trace/events.txt +++ b/Documentation/trace/events.txt @@ -1,7 +1,7 @@ Event Tracing Documentation written by Theodore Ts'o - Updated by Li Zefan + Updated by Li Zefan and Tom Zanussi 1. Introduction =============== @@ -22,12 +22,12 @@ tracing information should be printed. --------------------------------- The events which are available for tracing can be found in the file -/debug/tracing/available_events. +/sys/kernel/debug/tracing/available_events. To enable a particular event, such as 'sched_wakeup', simply echo it -to /debug/tracing/set_event. For example: +to /sys/kernel/debug/tracing/set_event. For example: - # echo sched_wakeup >> /debug/tracing/set_event + # echo sched_wakeup >> /sys/kernel/debug/tracing/set_event [ Note: '>>' is necessary, otherwise it will firstly disable all the events. ] @@ -35,15 +35,15 @@ to /debug/tracing/set_event. For example: To disable an event, echo the event name to the set_event file prefixed with an exclamation point: - # echo '!sched_wakeup' >> /debug/tracing/set_event + # echo '!sched_wakeup' >> /sys/kernel/debug/tracing/set_event To disable all events, echo an empty line to the set_event file: - # echo > /debug/tracing/set_event + # echo > /sys/kernel/debug/tracing/set_event To enable all events, echo '*:*' or '*:' to the set_event file: - # echo *:* > /debug/tracing/set_event + # echo *:* > /sys/kernel/debug/tracing/set_event The events are organized into subsystems, such as ext4, irq, sched, etc., and a full event name looks like this: <subsystem>:<event>. The @@ -52,29 +52,29 @@ file. All of the events in a subsystem can be specified via the syntax "<subsystem>:*"; for example, to enable all irq events, you can use the command: - # echo 'irq:*' > /debug/tracing/set_event + # echo 'irq:*' > /sys/kernel/debug/tracing/set_event 2.2 Via the 'enable' toggle --------------------------- -The events available are also listed in /debug/tracing/events/ hierarchy +The events available are also listed in /sys/kernel/debug/tracing/events/ hierarchy of directories. To enable event 'sched_wakeup': - # echo 1 > /debug/tracing/events/sched/sched_wakeup/enable + # echo 1 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable To disable it: - # echo 0 > /debug/tracing/events/sched/sched_wakeup/enable + # echo 0 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable To enable all events in sched subsystem: - # echo 1 > /debug/tracing/events/sched/enable + # echo 1 > /sys/kernel/debug/tracing/events/sched/enable To eanble all events: - # echo 1 > /debug/tracing/events/enable + # echo 1 > /sys/kernel/debug/tracing/events/enable When reading one of these enable files, there are four results: @@ -83,8 +83,199 @@ When reading one of these enable files, there are four results: X - there is a mixture of events enabled and disabled ? - this file does not affect any event +2.3 Boot option +--------------- + +In order to facilitate early boot debugging, use boot option: + + trace_event=[event-list] + +The format of this boot option is the same as described in section 2.1. + 3. Defining an event-enabled tracepoint ======================================= See The example provided in samples/trace_events +4. Event formats +================ + +Each trace event has a 'format' file associated with it that contains +a description of each field in a logged event. This information can +be used to parse the binary trace stream, and is also the place to +find the field names that can be used in event filters (see section 5). + +It also displays the format string that will be used to print the +event in text mode, along with the event name and ID used for +profiling. + +Every event has a set of 'common' fields associated with it; these are +the fields prefixed with 'common_'. The other fields vary between +events and correspond to the fields defined in the TRACE_EVENT +definition for that event. + +Each field in the format has the form: + + field:field-type field-name; offset:N; size:N; + +where offset is the offset of the field in the trace record and size +is the size of the data item, in bytes. + +For example, here's the information displayed for the 'sched_wakeup' +event: + +# cat /debug/tracing/events/sched/sched_wakeup/format + +name: sched_wakeup +ID: 60 +format: + field:unsigned short common_type; offset:0; size:2; + field:unsigned char common_flags; offset:2; size:1; + field:unsigned char common_preempt_count; offset:3; size:1; + field:int common_pid; offset:4; size:4; + field:int common_tgid; offset:8; size:4; + + field:char comm[TASK_COMM_LEN]; offset:12; size:16; + field:pid_t pid; offset:28; size:4; + field:int prio; offset:32; size:4; + field:int success; offset:36; size:4; + field:int cpu; offset:40; size:4; + +print fmt: "task %s:%d [%d] success=%d [%03d]", REC->comm, REC->pid, + REC->prio, REC->success, REC->cpu + +This event contains 10 fields, the first 5 common and the remaining 5 +event-specific. All the fields for this event are numeric, except for +'comm' which is a string, a distinction important for event filtering. + +5. Event filtering +================== + +Trace events can be filtered in the kernel by associating boolean +'filter expressions' with them. As soon as an event is logged into +the trace buffer, its fields are checked against the filter expression +associated with that event type. An event with field values that +'match' the filter will appear in the trace output, and an event whose +values don't match will be discarded. An event with no filter +associated with it matches everything, and is the default when no +filter has been set for an event. + +5.1 Expression syntax +--------------------- + +A filter expression consists of one or more 'predicates' that can be +combined using the logical operators '&&' and '||'. A predicate is +simply a clause that compares the value of a field contained within a +logged event with a constant value and returns either 0 or 1 depending +on whether the field value matched (1) or didn't match (0): + + field-name relational-operator value + +Parentheses can be used to provide arbitrary logical groupings and +double-quotes can be used to prevent the shell from interpreting +operators as shell metacharacters. + +The field-names available for use in filters can be found in the +'format' files for trace events (see section 4). + +The relational-operators depend on the type of the field being tested: + +The operators available for numeric fields are: + +==, !=, <, <=, >, >= + +And for string fields they are: + +==, != + +Currently, only exact string matches are supported. + +Currently, the maximum number of predicates in a filter is 16. + +5.2 Setting filters +------------------- + +A filter for an individual event is set by writing a filter expression +to the 'filter' file for the given event. + +For example: + +# cd /debug/tracing/events/sched/sched_wakeup +# echo "common_preempt_count > 4" > filter + +A slightly more involved example: + +# cd /debug/tracing/events/sched/sched_signal_send +# echo "((sig >= 10 && sig < 15) || sig == 17) && comm != bash" > filter + +If there is an error in the expression, you'll get an 'Invalid +argument' error when setting it, and the erroneous string along with +an error message can be seen by looking at the filter e.g.: + +# cd /debug/tracing/events/sched/sched_signal_send +# echo "((sig >= 10 && sig < 15) || dsig == 17) && comm != bash" > filter +-bash: echo: write error: Invalid argument +# cat filter +((sig >= 10 && sig < 15) || dsig == 17) && comm != bash +^ +parse_error: Field not found + +Currently the caret ('^') for an error always appears at the beginning of +the filter string; the error message should still be useful though +even without more accurate position info. + +5.3 Clearing filters +-------------------- + +To clear the filter for an event, write a '0' to the event's filter +file. + +To clear the filters for all events in a subsystem, write a '0' to the +subsystem's filter file. + +5.3 Subsystem filters +--------------------- + +For convenience, filters for every event in a subsystem can be set or +cleared as a group by writing a filter expression into the filter file +at the root of the subsytem. Note however, that if a filter for any +event within the subsystem lacks a field specified in the subsystem +filter, or if the filter can't be applied for any other reason, the +filter for that event will retain its previous setting. This can +result in an unintended mixture of filters which could lead to +confusing (to the user who might think different filters are in +effect) trace output. Only filters that reference just the common +fields can be guaranteed to propagate successfully to all events. + +Here are a few subsystem filter examples that also illustrate the +above points: + +Clear the filters on all events in the sched subsytem: + +# cd /sys/kernel/debug/tracing/events/sched +# echo 0 > filter +# cat sched_switch/filter +none +# cat sched_wakeup/filter +none + +Set a filter using only common fields for all events in the sched +subsytem (all events end up with the same filter): + +# cd /sys/kernel/debug/tracing/events/sched +# echo common_pid == 0 > filter +# cat sched_switch/filter +common_pid == 0 +# cat sched_wakeup/filter +common_pid == 0 + +Attempt to set a filter using a non-common field for all events in the +sched subsytem (all events but those that have a prev_pid field retain +their old filters): + +# cd /sys/kernel/debug/tracing/events/sched +# echo prev_pid == 0 > filter +# cat sched_switch/filter +prev_pid == 0 +# cat sched_wakeup/filter +common_pid == 0 diff --git a/Documentation/trace/ftrace-design.txt b/Documentation/trace/ftrace-design.txt new file mode 100644 index 00000000000..7003e10f10f --- /dev/null +++ b/Documentation/trace/ftrace-design.txt @@ -0,0 +1,233 @@ + function tracer guts + ==================== + +Introduction +------------ + +Here we will cover the architecture pieces that the common function tracing +code relies on for proper functioning. Things are broken down into increasing +complexity so that you can start simple and at least get basic functionality. + +Note that this focuses on architecture implementation details only. If you +want more explanation of a feature in terms of common code, review the common +ftrace.txt file. + + +Prerequisites +------------- + +Ftrace relies on these features being implemented: + STACKTRACE_SUPPORT - implement save_stack_trace() + TRACE_IRQFLAGS_SUPPORT - implement include/asm/irqflags.h + + +HAVE_FUNCTION_TRACER +-------------------- + +You will need to implement the mcount and the ftrace_stub functions. + +The exact mcount symbol name will depend on your toolchain. Some call it +"mcount", "_mcount", or even "__mcount". You can probably figure it out by +running something like: + $ echo 'main(){}' | gcc -x c -S -o - - -pg | grep mcount + call mcount +We'll make the assumption below that the symbol is "mcount" just to keep things +nice and simple in the examples. + +Keep in mind that the ABI that is in effect inside of the mcount function is +*highly* architecture/toolchain specific. We cannot help you in this regard, +sorry. Dig up some old documentation and/or find someone more familiar than +you to bang ideas off of. Typically, register usage (argument/scratch/etc...) +is a major issue at this point, especially in relation to the location of the +mcount call (before/after function prologue). You might also want to look at +how glibc has implemented the mcount function for your architecture. It might +be (semi-)relevant. + +The mcount function should check the function pointer ftrace_trace_function +to see if it is set to ftrace_stub. If it is, there is nothing for you to do, +so return immediately. If it isn't, then call that function in the same way +the mcount function normally calls __mcount_internal -- the first argument is +the "frompc" while the second argument is the "selfpc" (adjusted to remove the +size of the mcount call that is embedded in the function). + +For example, if the function foo() calls bar(), when the bar() function calls +mcount(), the arguments mcount() will pass to the tracer are: + "frompc" - the address bar() will use to return to foo() + "selfpc" - the address bar() (with _mcount() size adjustment) + +Also keep in mind that this mcount function will be called *a lot*, so +optimizing for the default case of no tracer will help the smooth running of +your system when tracing is disabled. So the start of the mcount function is +typically the bare min with checking things before returning. That also means +the code flow should usually kept linear (i.e. no branching in the nop case). +This is of course an optimization and not a hard requirement. + +Here is some pseudo code that should help (these functions should actually be +implemented in assembly): + +void ftrace_stub(void) +{ + return; +} + +void mcount(void) +{ + /* save any bare state needed in order to do initial checking */ + + extern void (*ftrace_trace_function)(unsigned long, unsigned long); + if (ftrace_trace_function != ftrace_stub) + goto do_trace; + + /* restore any bare state */ + + return; + +do_trace: + + /* save all state needed by the ABI (see paragraph above) */ + + unsigned long frompc = ...; + unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE; + ftrace_trace_function(frompc, selfpc); + + /* restore all state needed by the ABI */ +} + +Don't forget to export mcount for modules ! +extern void mcount(void); +EXPORT_SYMBOL(mcount); + + +HAVE_FUNCTION_TRACE_MCOUNT_TEST +------------------------------- + +This is an optional optimization for the normal case when tracing is turned off +in the system. If you do not enable this Kconfig option, the common ftrace +code will take care of doing the checking for you. + +To support this feature, you only need to check the function_trace_stop +variable in the mcount function. If it is non-zero, there is no tracing to be +done at all, so you can return. + +This additional pseudo code would simply be: +void mcount(void) +{ + /* save any bare state needed in order to do initial checking */ + ++ if (function_trace_stop) ++ return; + + extern void (*ftrace_trace_function)(unsigned long, unsigned long); + if (ftrace_trace_function != ftrace_stub) +... + + +HAVE_FUNCTION_GRAPH_TRACER +-------------------------- + +Deep breath ... time to do some real work. Here you will need to update the +mcount function to check ftrace graph function pointers, as well as implement +some functions to save (hijack) and restore the return address. + +The mcount function should check the function pointers ftrace_graph_return +(compare to ftrace_stub) and ftrace_graph_entry (compare to +ftrace_graph_entry_stub). If either of those are not set to the relevant stub +function, call the arch-specific function ftrace_graph_caller which in turn +calls the arch-specific function prepare_ftrace_return. Neither of these +function names are strictly required, but you should use them anyways to stay +consistent across the architecture ports -- easier to compare & contrast +things. + +The arguments to prepare_ftrace_return are slightly different than what are +passed to ftrace_trace_function. The second argument "selfpc" is the same, +but the first argument should be a pointer to the "frompc". Typically this is +located on the stack. This allows the function to hijack the return address +temporarily to have it point to the arch-specific function return_to_handler. +That function will simply call the common ftrace_return_to_handler function and +that will return the original return address with which, you can return to the +original call site. + +Here is the updated mcount pseudo code: +void mcount(void) +{ +... + if (ftrace_trace_function != ftrace_stub) + goto do_trace; + ++#ifdef CONFIG_FUNCTION_GRAPH_TRACER ++ extern void (*ftrace_graph_return)(...); ++ extern void (*ftrace_graph_entry)(...); ++ if (ftrace_graph_return != ftrace_stub || ++ ftrace_graph_entry != ftrace_graph_entry_stub) ++ ftrace_graph_caller(); ++#endif + + /* restore any bare state */ +... + +Here is the pseudo code for the new ftrace_graph_caller assembly function: +#ifdef CONFIG_FUNCTION_GRAPH_TRACER +void ftrace_graph_caller(void) +{ + /* save all state needed by the ABI */ + + unsigned long *frompc = &...; + unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE; + prepare_ftrace_return(frompc, selfpc); + + /* restore all state needed by the ABI */ +} +#endif + +For information on how to implement prepare_ftrace_return(), simply look at +the x86 version. The only architecture-specific piece in it is the setup of +the fault recovery table (the asm(...) code). The rest should be the same +across architectures. + +Here is the pseudo code for the new return_to_handler assembly function. Note +that the ABI that applies here is different from what applies to the mcount +code. Since you are returning from a function (after the epilogue), you might +be able to skimp on things saved/restored (usually just registers used to pass +return values). + +#ifdef CONFIG_FUNCTION_GRAPH_TRACER +void return_to_handler(void) +{ + /* save all state needed by the ABI (see paragraph above) */ + + void (*original_return_point)(void) = ftrace_return_to_handler(); + + /* restore all state needed by the ABI */ + + /* this is usually either a return or a jump */ + original_return_point(); +} +#endif + + +HAVE_FTRACE_NMI_ENTER +--------------------- + +If you can't trace NMI functions, then skip this option. + +<details to be filled> + + +HAVE_FTRACE_SYSCALLS +--------------------- + +<details to be filled> + + +HAVE_FTRACE_MCOUNT_RECORD +------------------------- + +See scripts/recordmcount.pl for more info. + +<details to be filled> + + +HAVE_DYNAMIC_FTRACE +--------------------- + +<details to be filled> diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt index 7bd27f0e288..1b6292bbdd6 100644 --- a/Documentation/trace/ftrace.txt +++ b/Documentation/trace/ftrace.txt @@ -7,7 +7,6 @@ Copyright 2008 Red Hat Inc. (dual licensed under the GPL v2) Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton, John Kacur, and David Teigland. - Written for: 2.6.28-rc2 Introduction @@ -27,19 +26,38 @@ disabled, and more (ftrace allows for tracer plugins, which means that the list of tracers can always grow). +Implementation Details +---------------------- + +See ftrace-design.txt for details for arch porters and such. + + The File System --------------- Ftrace uses the debugfs file system to hold the control files as well as the files to display output. -To mount the debugfs system: +When debugfs is configured into the kernel (which selecting any ftrace +option will do) the directory /sys/kernel/debug will be created. To mount +this directory, you can add to your /etc/fstab file: + + debugfs /sys/kernel/debug debugfs defaults 0 0 - # mkdir /debug - # mount -t debugfs nodev /debug +Or you can mount it at run time with: -( Note: it is more common to mount at /sys/kernel/debug, but for - simplicity this document will use /debug) + mount -t debugfs nodev /sys/kernel/debug + +For quicker access to that directory you may want to make a soft link to +it: + + ln -s /sys/kernel/debug /debug + +Any selected ftrace option will also create a directory called tracing +within the debugfs. The rest of the document will assume that you are in +the ftrace directory (cd /sys/kernel/debug/tracing) and will only concentrate +on the files within that directory and not distract from the content with +the extended "/sys/kernel/debug/tracing" path name. That's it! (assuming that you have ftrace configured into your kernel) @@ -73,26 +91,19 @@ of ftrace. Here is a list of some of the key files: This file holds the output of the trace in a human readable format (described below). - latency_trace: - - This file shows the same trace but the information - is organized more to display possible latencies - in the system (described below). - trace_pipe: The output is the same as the "trace" file but this file is meant to be streamed with live tracing. - Reads from this file will block until new data - is retrieved. Unlike the "trace" and "latency_trace" - files, this file is a consumer. This means reading - from this file causes sequential reads to display - more current data. Once data is read from this - file, it is consumed, and will not be read - again with a sequential read. The "trace" and - "latency_trace" files are static, and if the - tracer is not adding more data, they will display - the same information every time they are read. + Reads from this file will block until new data is + retrieved. Unlike the "trace" file, this file is a + consumer. This means reading from this file causes + sequential reads to display more current data. Once + data is read from this file, it is consumed, and + will not be read again with a sequential read. The + "trace" file is static, and if the tracer is not + adding more data,they will display the same + information every time they are read. trace_options: @@ -105,10 +116,10 @@ of ftrace. Here is a list of some of the key files: Some of the tracers record the max latency. For example, the time interrupts are disabled. This time is saved in this file. The max trace - will also be stored, and displayed by either - "trace" or "latency_trace". A new max trace will - only be recorded if the latency is greater than - the value in this file. (in microseconds) + will also be stored, and displayed by "trace". + A new max trace will only be recorded if the + latency is greater than the value in this + file. (in microseconds) buffer_size_kb: @@ -198,7 +209,7 @@ Here is the list of current tracers that may be configured. the trace with the longest max latency. See tracing_max_latency. When a new max is recorded, it replaces the old trace. It is best to view this - trace via the latency_trace file. + trace with the latency-format option enabled. "preemptoff" @@ -295,8 +306,8 @@ the lowest priority thread (pid 0). Latency trace format -------------------- -For traces that display latency times, the latency_trace file -gives somewhat more information to see why a latency happened. +When the latency-format option is enabled, the trace file gives +somewhat more information to see why a latency happened. Here is a typical trace. # tracer: irqsoff @@ -368,9 +379,10 @@ explains which is which. The above is mostly meaningful for kernel developers. - time: This differs from the trace file output. The trace file output - includes an absolute timestamp. The timestamp used by the - latency_trace file is relative to the start of the trace. + time: When the latency-format option is enabled, the trace file + output includes a timestamp relative to the start of the + trace. This differs from the output when latency-format + is disabled, which includes an absolute timestamp. delay: This is just to help catch your eye a bit better. And needs to be fixed to be only relative to the same CPU. @@ -389,18 +401,18 @@ trace_options The trace_options file is used to control what gets printed in the trace output. To see what is available, simply cat the file: - cat /debug/tracing/trace_options + cat trace_options print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \ noblock nostacktrace nosched-tree nouserstacktrace nosym-userobj To disable one of the options, echo in the option prepended with "no". - echo noprint-parent > /debug/tracing/trace_options + echo noprint-parent > trace_options To enable an option, leave off the "no". - echo sym-offset > /debug/tracing/trace_options + echo sym-offset > trace_options Here are the available options: @@ -428,7 +440,8 @@ Here are the available options: sym-addr: bash-4000 [01] 1477.606694: simple_strtoul <c0339346> - verbose - This deals with the latency_trace file. + verbose - This deals with the trace file when the + latency-format option is enabled. bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \ (+0.000ms): simple_strtoul (strict_strtoul) @@ -460,7 +473,7 @@ Here are the available options: the app is no longer running The lookup is performed when you read - trace,trace_pipe,latency_trace. Example: + trace,trace_pipe. Example: a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0 x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] @@ -469,6 +482,11 @@ x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] every scheduling event. Will add overhead if there's a lot of tasks running at once. + latency-format - This option changes the trace. When + it is enabled, the trace displays + additional information about the + latencies, as described in "Latency + trace format". sched_switch ------------ @@ -476,11 +494,11 @@ sched_switch This tracer simply records schedule switches. Here is an example of how to use it. - # echo sched_switch > /debug/tracing/current_tracer - # echo 1 > /debug/tracing/tracing_enabled + # echo sched_switch > current_tracer + # echo 1 > tracing_enabled # sleep 1 - # echo 0 > /debug/tracing/tracing_enabled - # cat /debug/tracing/trace + # echo 0 > tracing_enabled + # cat trace # tracer: sched_switch # @@ -583,13 +601,14 @@ new trace is saved. To reset the maximum, echo 0 into tracing_max_latency. Here is an example: - # echo irqsoff > /debug/tracing/current_tracer - # echo 0 > /debug/tracing/tracing_max_latency - # echo 1 > /debug/tracing/tracing_enabled + # echo irqsoff > current_tracer + # echo latency-format > trace_options + # echo 0 > tracing_max_latency + # echo 1 > tracing_enabled # ls -ltr [...] - # echo 0 > /debug/tracing/tracing_enabled - # cat /debug/tracing/latency_trace + # echo 0 > tracing_enabled + # cat trace # tracer: irqsoff # irqsoff latency trace v1.1.5 on 2.6.26 @@ -690,13 +709,14 @@ Like the irqsoff tracer, it records the maximum latency for which preemption was disabled. The control of preemptoff tracer is much like the irqsoff tracer. - # echo preemptoff > /debug/tracing/current_tracer - # echo 0 > /debug/tracing/tracing_max_latency - # echo 1 > /debug/tracing/tracing_enabled + # echo preemptoff > current_tracer + # echo latency-format > trace_options + # echo 0 > tracing_max_latency + # echo 1 > tracing_enabled # ls -ltr [...] - # echo 0 > /debug/tracing/tracing_enabled - # cat /debug/tracing/latency_trace + # echo 0 > tracing_enabled + # cat trace # tracer: preemptoff # preemptoff latency trace v1.1.5 on 2.6.26-rc8 @@ -837,13 +857,14 @@ tracer. Again, using this trace is much like the irqsoff and preemptoff tracers. - # echo preemptirqsoff > /debug/tracing/current_tracer - # echo 0 > /debug/tracing/tracing_max_latency - # echo 1 > /debug/tracing/tracing_enabled + # echo preemptirqsoff > current_tracer + # echo latency-format > trace_options + # echo 0 > tracing_max_latency + # echo 1 > tracing_enabled # ls -ltr [...] - # echo 0 > /debug/tracing/tracing_enabled - # cat /debug/tracing/latency_trace + # echo 0 > tracing_enabled + # cat trace # tracer: preemptirqsoff # preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8 @@ -999,12 +1020,13 @@ slightly differently than we did with the previous tracers. Instead of performing an 'ls', we will run 'sleep 1' under 'chrt' which changes the priority of the task. - # echo wakeup > /debug/tracing/current_tracer - # echo 0 > /debug/tracing/tracing_max_latency - # echo 1 > /debug/tracing/tracing_enabled + # echo wakeup > current_tracer + # echo latency-format > trace_options + # echo 0 > tracing_max_latency + # echo 1 > tracing_enabled # chrt -f 5 sleep 1 - # echo 0 > /debug/tracing/tracing_enabled - # cat /debug/tracing/latency_trace + # echo 0 > tracing_enabled + # cat trace # tracer: wakeup # wakeup latency trace v1.1.5 on 2.6.26-rc8 @@ -1114,11 +1136,11 @@ can be done from the debug file system. Make sure the ftrace_enabled is set; otherwise this tracer is a nop. # sysctl kernel.ftrace_enabled=1 - # echo function > /debug/tracing/current_tracer - # echo 1 > /debug/tracing/tracing_enabled + # echo function > current_tracer + # echo 1 > tracing_enabled # usleep 1 - # echo 0 > /debug/tracing/tracing_enabled - # cat /debug/tracing/trace + # echo 0 > tracing_enabled + # cat trace # tracer: function # # TASK-PID CPU# TIMESTAMP FUNCTION @@ -1155,7 +1177,7 @@ int trace_fd; [...] int main(int argc, char *argv[]) { [...] - trace_fd = open("/debug/tracing/tracing_enabled", O_WRONLY); + trace_fd = open(tracing_file("tracing_enabled"), O_WRONLY); [...] if (condition_hit()) { write(trace_fd, "0", 1); @@ -1163,26 +1185,20 @@ int main(int argc, char *argv[]) { [...] } -Note: Here we hard coded the path name. The debugfs mount is not -guaranteed to be at /debug (and is more commonly at -/sys/kernel/debug). For simple one time traces, the above is -sufficent. For anything else, a search through /proc/mounts may -be needed to find where the debugfs file-system is mounted. - Single thread tracing --------------------- -By writing into /debug/tracing/set_ftrace_pid you can trace a +By writing into set_ftrace_pid you can trace a single thread. For example: -# cat /debug/tracing/set_ftrace_pid +# cat set_ftrace_pid no pid -# echo 3111 > /debug/tracing/set_ftrace_pid -# cat /debug/tracing/set_ftrace_pid +# echo 3111 > set_ftrace_pid +# cat set_ftrace_pid 3111 -# echo function > /debug/tracing/current_tracer -# cat /debug/tracing/trace | head +# echo function > current_tracer +# cat trace | head # tracer: function # # TASK-PID CPU# TIMESTAMP FUNCTION @@ -1193,8 +1209,8 @@ no pid yum-updatesd-3111 [003] 1637.254683: lock_hrtimer_base <-hrtimer_try_to_cancel yum-updatesd-3111 [003] 1637.254685: fget_light <-do_sys_poll yum-updatesd-3111 [003] 1637.254686: pipe_poll <-do_sys_poll -# echo -1 > /debug/tracing/set_ftrace_pid -# cat /debug/tracing/trace |head +# echo -1 > set_ftrace_pid +# cat trace |head # tracer: function # # TASK-PID CPU# TIMESTAMP FUNCTION @@ -1216,6 +1232,51 @@ something like this simple program: #include <fcntl.h> #include <unistd.h> +#define _STR(x) #x +#define STR(x) _STR(x) +#define MAX_PATH 256 + +const char *find_debugfs(void) +{ + static char debugfs[MAX_PATH+1]; + static int debugfs_found; + char type[100]; + FILE *fp; + + if (debugfs_found) + return debugfs; + + if ((fp = fopen("/proc/mounts","r")) == NULL) { + perror("/proc/mounts"); + return NULL; + } + + while (fscanf(fp, "%*s %" + STR(MAX_PATH) + "s %99s %*s %*d %*d\n", + debugfs, type) == 2) { + if (strcmp(type, "debugfs") == 0) + break; + } + fclose(fp); + + if (strcmp(type, "debugfs") != 0) { + fprintf(stderr, "debugfs not mounted"); + return NULL; + } + + debugfs_found = 1; + + return debugfs; +} + +const char *tracing_file(const char *file_name) +{ + static char trace_file[MAX_PATH+1]; + snprintf(trace_file, MAX_PATH, "%s/%s", find_debugfs(), file_name); + return trace_file; +} + int main (int argc, char **argv) { if (argc < 1) @@ -1226,12 +1287,12 @@ int main (int argc, char **argv) char line[64]; int s; - ffd = open("/debug/tracing/current_tracer", O_WRONLY); + ffd = open(tracing_file("current_tracer"), O_WRONLY); if (ffd < 0) exit(-1); write(ffd, "nop", 3); - fd = open("/debug/tracing/set_ftrace_pid", O_WRONLY); + fd = open(tracing_file("set_ftrace_pid"), O_WRONLY); s = sprintf(line, "%d\n", getpid()); write(fd, line, s); @@ -1383,22 +1444,22 @@ want, depending on your needs. tracing_cpu_mask file) or you might sometimes see unordered function calls while cpu tracing switch. - hide: echo nofuncgraph-cpu > /debug/tracing/trace_options - show: echo funcgraph-cpu > /debug/tracing/trace_options + hide: echo nofuncgraph-cpu > trace_options + show: echo funcgraph-cpu > trace_options - The duration (function's time of execution) is displayed on the closing bracket line of a function or on the same line than the current function in case of a leaf one. It is default enabled. - hide: echo nofuncgraph-duration > /debug/tracing/trace_options - show: echo funcgraph-duration > /debug/tracing/trace_options + hide: echo nofuncgraph-duration > trace_options + show: echo funcgraph-duration > trace_options - The overhead field precedes the duration field in case of reached duration thresholds. - hide: echo nofuncgraph-overhead > /debug/tracing/trace_options - show: echo funcgraph-overhead > /debug/tracing/trace_options + hide: echo nofuncgraph-overhead > trace_options + show: echo funcgraph-overhead > trace_options depends on: funcgraph-duration ie: @@ -1427,8 +1488,8 @@ want, depending on your needs. - The task/pid field displays the thread cmdline and pid which executed the function. It is default disabled. - hide: echo nofuncgraph-proc > /debug/tracing/trace_options - show: echo funcgraph-proc > /debug/tracing/trace_options + hide: echo nofuncgraph-proc > trace_options + show: echo funcgraph-proc > trace_options ie: @@ -1451,8 +1512,8 @@ want, depending on your needs. system clock since it started. A snapshot of this time is given on each entry/exit of functions - hide: echo nofuncgraph-abstime > /debug/tracing/trace_options - show: echo funcgraph-abstime > /debug/tracing/trace_options + hide: echo nofuncgraph-abstime > trace_options + show: echo funcgraph-abstime > trace_options ie: @@ -1549,7 +1610,7 @@ listed in: available_filter_functions - # cat /debug/tracing/available_filter_functions + # cat available_filter_functions put_prev_task_idle kmem_cache_create pick_next_task_rt @@ -1561,12 +1622,12 @@ mutex_lock If I am only interested in sys_nanosleep and hrtimer_interrupt: # echo sys_nanosleep hrtimer_interrupt \ - > /debug/tracing/set_ftrace_filter - # echo ftrace > /debug/tracing/current_tracer - # echo 1 > /debug/tracing/tracing_enabled + > set_ftrace_filter + # echo ftrace > current_tracer + # echo 1 > tracing_enabled # usleep 1 - # echo 0 > /debug/tracing/tracing_enabled - # cat /debug/tracing/trace + # echo 0 > tracing_enabled + # cat trace # tracer: ftrace # # TASK-PID CPU# TIMESTAMP FUNCTION @@ -1577,7 +1638,7 @@ If I am only interested in sys_nanosleep and hrtimer_interrupt: To see which functions are being traced, you can cat the file: - # cat /debug/tracing/set_ftrace_filter + # cat set_ftrace_filter hrtimer_interrupt sys_nanosleep @@ -1597,7 +1658,7 @@ Note: It is better to use quotes to enclose the wild cards, otherwise the shell may expand the parameters into names of files in the local directory. - # echo 'hrtimer_*' > /debug/tracing/set_ftrace_filter + # echo 'hrtimer_*' > set_ftrace_filter Produces: @@ -1618,7 +1679,7 @@ Produces: Notice that we lost the sys_nanosleep. - # cat /debug/tracing/set_ftrace_filter + # cat set_ftrace_filter hrtimer_run_queues hrtimer_run_pending hrtimer_init @@ -1644,17 +1705,17 @@ To append to the filters, use '>>' To clear out a filter so that all functions will be recorded again: - # echo > /debug/tracing/set_ftrace_filter - # cat /debug/tracing/set_ftrace_filter + # echo > set_ftrace_filter + # cat set_ftrace_filter # Again, now we want to append. - # echo sys_nanosleep > /debug/tracing/set_ftrace_filter - # cat /debug/tracing/set_ftrace_filter + # echo sys_nanosleep > set_ftrace_filter + # cat set_ftrace_filter sys_nanosleep - # echo 'hrtimer_*' >> /debug/tracing/set_ftrace_filter - # cat /debug/tracing/set_ftrace_filter + # echo 'hrtimer_*' >> set_ftrace_filter + # cat set_ftrace_filter hrtimer_run_queues hrtimer_run_pending hrtimer_init @@ -1677,7 +1738,7 @@ hrtimer_init_sleeper The set_ftrace_notrace prevents those functions from being traced. - # echo '*preempt*' '*lock*' > /debug/tracing/set_ftrace_notrace + # echo '*preempt*' '*lock*' > set_ftrace_notrace Produces: @@ -1767,13 +1828,13 @@ the effect on the tracing is different. Every read from trace_pipe is consumed. This means that subsequent reads will be different. The trace is live. - # echo function > /debug/tracing/current_tracer - # cat /debug/tracing/trace_pipe > /tmp/trace.out & + # echo function > current_tracer + # cat trace_pipe > /tmp/trace.out & [1] 4153 - # echo 1 > /debug/tracing/tracing_enabled + # echo 1 > tracing_enabled # usleep 1 - # echo 0 > /debug/tracing/tracing_enabled - # cat /debug/tracing/trace + # echo 0 > tracing_enabled + # cat trace # tracer: function # # TASK-PID CPU# TIMESTAMP FUNCTION @@ -1809,7 +1870,7 @@ number listed is the number of entries that can be recorded per CPU. To know the full size, multiply the number of possible CPUS with the number of entries. - # cat /debug/tracing/buffer_size_kb + # cat buffer_size_kb 1408 (units kilobytes) Note, to modify this, you must have tracing completely disabled. @@ -1817,18 +1878,18 @@ To do that, echo "nop" into the current_tracer. If the current_tracer is not set to "nop", an EINVAL error will be returned. - # echo nop > /debug/tracing/current_tracer - # echo 10000 > /debug/tracing/buffer_size_kb - # cat /debug/tracing/buffer_size_kb + # echo nop > current_tracer + # echo 10000 > buffer_size_kb + # cat buffer_size_kb 10000 (units kilobytes) The number of pages which will be allocated is limited to a percentage of available memory. Allocating too much will produce an error. - # echo 1000000000000 > /debug/tracing/buffer_size_kb + # echo 1000000000000 > buffer_size_kb -bash: echo: write error: Cannot allocate memory - # cat /debug/tracing/buffer_size_kb + # cat buffer_size_kb 85 ----------- diff --git a/Documentation/trace/function-graph-fold.vim b/Documentation/trace/function-graph-fold.vim new file mode 100644 index 00000000000..0544b504c8b --- /dev/null +++ b/Documentation/trace/function-graph-fold.vim @@ -0,0 +1,42 @@ +" Enable folding for ftrace function_graph traces. +" +" To use, :source this file while viewing a function_graph trace, or use vim's +" -S option to load from the command-line together with a trace. You can then +" use the usual vim fold commands, such as "za", to open and close nested +" functions. While closed, a fold will show the total time taken for a call, +" as would normally appear on the line with the closing brace. Folded +" functions will not include finish_task_switch(), so folding should remain +" relatively sane even through a context switch. +" +" Note that this will almost certainly only work well with a +" single-CPU trace (e.g. trace-cmd report --cpu 1). + +function! FunctionGraphFoldExpr(lnum) + let line = getline(a:lnum) + if line[-1:] == '{' + if line =~ 'finish_task_switch() {$' + return '>1' + endif + return 'a1' + elseif line[-1:] == '}' + return 's1' + else + return '=' + endif +endfunction + +function! FunctionGraphFoldText() + let s = split(getline(v:foldstart), '|', 1) + if getline(v:foldend+1) =~ 'finish_task_switch() {$' + let s[2] = ' task switch ' + else + let e = split(getline(v:foldend), '|', 1) + let s[2] = e[2] + endif + return join(s, '|') +endfunction + +setlocal foldexpr=FunctionGraphFoldExpr(v:lnum) +setlocal foldtext=FunctionGraphFoldText() +setlocal foldcolumn=12 +setlocal foldmethod=expr diff --git a/Documentation/trace/mmiotrace.txt b/Documentation/trace/mmiotrace.txt index 5731c67abc5..162effbfbde 100644 --- a/Documentation/trace/mmiotrace.txt +++ b/Documentation/trace/mmiotrace.txt @@ -32,41 +32,41 @@ is no way to automatically detect if you are losing events due to CPUs racing. Usage Quick Reference --------------------- -$ mount -t debugfs debugfs /debug -$ echo mmiotrace > /debug/tracing/current_tracer -$ cat /debug/tracing/trace_pipe > mydump.txt & +$ mount -t debugfs debugfs /sys/kernel/debug +$ echo mmiotrace > /sys/kernel/debug/tracing/current_tracer +$ cat /sys/kernel/debug/tracing/trace_pipe > mydump.txt & Start X or whatever. -$ echo "X is up" > /debug/tracing/trace_marker -$ echo nop > /debug/tracing/current_tracer +$ echo "X is up" > /sys/kernel/debug/tracing/trace_marker +$ echo nop > /sys/kernel/debug/tracing/current_tracer Check for lost events. Usage ----- -Make sure debugfs is mounted to /debug. If not, (requires root privileges) -$ mount -t debugfs debugfs /debug +Make sure debugfs is mounted to /sys/kernel/debug. If not, (requires root privileges) +$ mount -t debugfs debugfs /sys/kernel/debug Check that the driver you are about to trace is not loaded. Activate mmiotrace (requires root privileges): -$ echo mmiotrace > /debug/tracing/current_tracer +$ echo mmiotrace > /sys/kernel/debug/tracing/current_tracer Start storing the trace: -$ cat /debug/tracing/trace_pipe > mydump.txt & +$ cat /sys/kernel/debug/tracing/trace_pipe > mydump.txt & The 'cat' process should stay running (sleeping) in the background. Load the driver you want to trace and use it. Mmiotrace will only catch MMIO accesses to areas that are ioremapped while mmiotrace is active. During tracing you can place comments (markers) into the trace by -$ echo "X is up" > /debug/tracing/trace_marker +$ echo "X is up" > /sys/kernel/debug/tracing/trace_marker This makes it easier to see which part of the (huge) trace corresponds to which action. It is recommended to place descriptive markers about what you do. Shut down mmiotrace (requires root privileges): -$ echo nop > /debug/tracing/current_tracer +$ echo nop > /sys/kernel/debug/tracing/current_tracer The 'cat' process exits. If it does not, kill it by issuing 'fg' command and pressing ctrl+c. @@ -78,10 +78,10 @@ to view your kernel log and look for "mmiotrace has lost events" warning. If events were lost, the trace is incomplete. You should enlarge the buffers and try again. Buffers are enlarged by first seeing how large the current buffers are: -$ cat /debug/tracing/buffer_size_kb +$ cat /sys/kernel/debug/tracing/buffer_size_kb gives you a number. Approximately double this number and write it back, for instance: -$ echo 128000 > /debug/tracing/buffer_size_kb +$ echo 128000 > /sys/kernel/debug/tracing/buffer_size_kb Then start again from the top. If you are doing a trace for a driver project, e.g. Nouveau, you should also diff --git a/Documentation/trace/ring-buffer-design.txt b/Documentation/trace/ring-buffer-design.txt new file mode 100644 index 00000000000..5b1d23d604c --- /dev/null +++ b/Documentation/trace/ring-buffer-design.txt @@ -0,0 +1,955 @@ + Lockless Ring Buffer Design + =========================== + +Copyright 2009 Red Hat Inc. + Author: Steven Rostedt <srostedt@redhat.com> + License: The GNU Free Documentation License, Version 1.2 + (dual licensed under the GPL v2) +Reviewers: Mathieu Desnoyers, Huang Ying, Hidetoshi Seto, + and Frederic Weisbecker. + + +Written for: 2.6.31 + +Terminology used in this Document +--------------------------------- + +tail - where new writes happen in the ring buffer. + +head - where new reads happen in the ring buffer. + +producer - the task that writes into the ring buffer (same as writer) + +writer - same as producer + +consumer - the task that reads from the buffer (same as reader) + +reader - same as consumer. + +reader_page - A page outside the ring buffer used solely (for the most part) + by the reader. + +head_page - a pointer to the page that the reader will use next + +tail_page - a pointer to the page that will be written to next + +commit_page - a pointer to the page with the last finished non nested write. + +cmpxchg - hardware assisted atomic transaction that performs the following: + + A = B iff previous A == C + + R = cmpxchg(A, C, B) is saying that we replace A with B if and only if + current A is equal to C, and we put the old (current) A into R + + R gets the previous A regardless if A is updated with B or not. + + To see if the update was successful a compare of R == C may be used. + +The Generic Ring Buffer +----------------------- + +The ring buffer can be used in either an overwrite mode or in +producer/consumer mode. + +Producer/consumer mode is where the producer were to fill up the +buffer before the consumer could free up anything, the producer +will stop writing to the buffer. This will lose most recent events. + +Overwrite mode is where the produce were to fill up the buffer +before the consumer could free up anything, the producer will +overwrite the older data. This will lose the oldest events. + +No two writers can write at the same time (on the same per cpu buffer), +but a writer may interrupt another writer, but it must finish writing +before the previous writer may continue. This is very important to the +algorithm. The writers act like a "stack". The way interrupts works +enforces this behavior. + + + writer1 start + <preempted> writer2 start + <preempted> writer3 start + writer3 finishes + writer2 finishes + writer1 finishes + +This is very much like a writer being preempted by an interrupt and +the interrupt doing a write as well. + +Readers can happen at any time. But no two readers may run at the +same time, nor can a reader preempt/interrupt another reader. A reader +can not preempt/interrupt a writer, but it may read/consume from the +buffer at the same time as a writer is writing, but the reader must be +on another processor to do so. A reader may read on its own processor +and can be preempted by a writer. + +A writer can preempt a reader, but a reader can not preempt a writer. +But a reader can read the buffer at the same time (on another processor) +as a writer. + +The ring buffer is made up of a list of pages held together by a link list. + +At initialization a reader page is allocated for the reader that is not +part of the ring buffer. + +The head_page, tail_page and commit_page are all initialized to point +to the same page. + +The reader page is initialized to have its next pointer pointing to +the head page, and its previous pointer pointing to a page before +the head page. + +The reader has its own page to use. At start up time, this page is +allocated but is not attached to the list. When the reader wants +to read from the buffer, if its page is empty (like it is on start up) +it will swap its page with the head_page. The old reader page will +become part of the ring buffer and the head_page will be removed. +The page after the inserted page (old reader_page) will become the +new head page. + +Once the new page is given to the reader, the reader could do what +it wants with it, as long as a writer has left that page. + +A sample of how the reader page is swapped: Note this does not +show the head page in the buffer, it is for demonstrating a swap +only. + + +------+ + |reader| RING BUFFER + |page | + +------+ + +---+ +---+ +---+ + | |-->| |-->| | + | |<--| |<--| | + +---+ +---+ +---+ + ^ | ^ | + | +-------------+ | + +-----------------+ + + + +------+ + |reader| RING BUFFER + |page |-------------------+ + +------+ v + | +---+ +---+ +---+ + | | |-->| |-->| | + | | |<--| |<--| |<-+ + | +---+ +---+ +---+ | + | ^ | ^ | | + | | +-------------+ | | + | +-----------------+ | + +------------------------------------+ + + +------+ + |reader| RING BUFFER + |page |-------------------+ + +------+ <---------------+ v + | ^ +---+ +---+ +---+ + | | | |-->| |-->| | + | | | | | |<--| |<-+ + | | +---+ +---+ +---+ | + | | | ^ | | + | | +-------------+ | | + | +-----------------------------+ | + +------------------------------------+ + + +------+ + |buffer| RING BUFFER + |page |-------------------+ + +------+ <---------------+ v + | ^ +---+ +---+ +---+ + | | | | | |-->| | + | | New | | | |<--| |<-+ + | | Reader +---+ +---+ +---+ | + | | page ----^ | | + | | | | + | +-----------------------------+ | + +------------------------------------+ + + + +It is possible that the page swapped is the commit page and the tail page, +if what is in the ring buffer is less than what is held in a buffer page. + + + reader page commit page tail page + | | | + v | | + +---+ | | + | |<----------+ | + | |<------------------------+ + | |------+ + +---+ | + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +This case is still valid for this algorithm. +When the writer leaves the page, it simply goes into the ring buffer +since the reader page still points to the next location in the ring +buffer. + + +The main pointers: + + reader page - The page used solely by the reader and is not part + of the ring buffer (may be swapped in) + + head page - the next page in the ring buffer that will be swapped + with the reader page. + + tail page - the page where the next write will take place. + + commit page - the page that last finished a write. + +The commit page only is updated by the outer most writer in the +writer stack. A writer that preempts another writer will not move the +commit page. + +When data is written into the ring buffer, a position is reserved +in the ring buffer and passed back to the writer. When the writer +is finished writing data into that position, it commits the write. + +Another write (or a read) may take place at anytime during this +transaction. If another write happens it must finish before continuing +with the previous write. + + + Write reserve: + + Buffer page + +---------+ + |written | + +---------+ <--- given back to writer (current commit) + |reserved | + +---------+ <--- tail pointer + | empty | + +---------+ + + Write commit: + + Buffer page + +---------+ + |written | + +---------+ + |written | + +---------+ <--- next positon for write (current commit) + | empty | + +---------+ + + + If a write happens after the first reserve: + + Buffer page + +---------+ + |written | + +---------+ <-- current commit + |reserved | + +---------+ <--- given back to second writer + |reserved | + +---------+ <--- tail pointer + + After second writer commits: + + + Buffer page + +---------+ + |written | + +---------+ <--(last full commit) + |reserved | + +---------+ + |pending | + |commit | + +---------+ <--- tail pointer + + When the first writer commits: + + Buffer page + +---------+ + |written | + +---------+ + |written | + +---------+ + |written | + +---------+ <--(last full commit and tail pointer) + + +The commit pointer points to the last write location that was +committed without preempting another write. When a write that +preempted another write is committed, it only becomes a pending commit +and will not be a full commit till all writes have been committed. + +The commit page points to the page that has the last full commit. +The tail page points to the page with the last write (before +committing). + +The tail page is always equal to or after the commit page. It may +be several pages ahead. If the tail page catches up to the commit +page then no more writes may take place (regardless of the mode +of the ring buffer: overwrite and produce/consumer). + +The order of pages are: + + head page + commit page + tail page + +Possible scenario: + tail page + head page commit page | + | | | + v v v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +There is a special case that the head page is after either the commit page +and possibly the tail page. That is when the commit (and tail) page has been +swapped with the reader page. This is because the head page is always +part of the ring buffer, but the reader page is not. When ever there +has been less than a full page that has been committed inside the ring buffer, +and a reader swaps out a page, it will be swapping out the commit page. + + + reader page commit page tail page + | | | + v | | + +---+ | | + | |<----------+ | + | |<------------------------+ + | |------+ + +---+ | + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + ^ + | + head page + + +In this case, the head page will not move when the tail and commit +move back into the ring buffer. + +The reader can not swap a page into the ring buffer if the commit page +is still on that page. If the read meets the last commit (real commit +not pending or reserved), then there is nothing more to read. +The buffer is considered empty until another full commit finishes. + +When the tail meets the head page, if the buffer is in overwrite mode, +the head page will be pushed ahead one. If the buffer is in producer/consumer +mode, the write will fail. + +Overwrite mode: + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + ^ + | + head page + + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + ^ + | + head page + + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + ^ + | + head page + +Note, the reader page will still point to the previous head page. +But when a swap takes place, it will use the most recent head page. + + +Making the Ring Buffer Lockless: +-------------------------------- + +The main idea behind the lockless algorithm is to combine the moving +of the head_page pointer with the swapping of pages with the reader. +State flags are placed inside the pointer to the page. To do this, +each page must be aligned in memory by 4 bytes. This will allow the 2 +least significant bits of the address to be used as flags. Since +they will always be zero for the address. To get the address, +simply mask out the flags. + + MASK = ~3 + + address & MASK + +Two flags will be kept by these two bits: + + HEADER - the page being pointed to is a head page + + UPDATE - the page being pointed to is being updated by a writer + and was or is about to be a head page. + + + reader page + | + v + +---+ + | |------+ + +---+ | + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-H->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + +The above pointer "-H->" would have the HEADER flag set. That is +the next page is the next page to be swapped out by the reader. +This pointer means the next page is the head page. + +When the tail page meets the head pointer, it will use cmpxchg to +change the pointer to the UPDATE state: + + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-H->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +"-U->" represents a pointer in the UPDATE state. + +Any access to the reader will need to take some sort of lock to serialize +the readers. But the writers will never take a lock to write to the +ring buffer. This means we only need to worry about a single reader, +and writes only preempt in "stack" formation. + +When the reader tries to swap the page with the ring buffer, it +will also use cmpxchg. If the flag bit in the pointer to the +head page does not have the HEADER flag set, the compare will fail +and the reader will need to look for the new head page and try again. +Note, the flag UPDATE and HEADER are never set at the same time. + +The reader swaps the reader page as follows: + + +------+ + |reader| RING BUFFER + |page | + +------+ + +---+ +---+ +---+ + | |--->| |--->| | + | |<---| |<---| | + +---+ +---+ +---+ + ^ | ^ | + | +---------------+ | + +-----H-------------+ + +The reader sets the reader page next pointer as HEADER to the page after +the head page. + + + +------+ + |reader| RING BUFFER + |page |-------H-----------+ + +------+ v + | +---+ +---+ +---+ + | | |--->| |--->| | + | | |<---| |<---| |<-+ + | +---+ +---+ +---+ | + | ^ | ^ | | + | | +---------------+ | | + | +-----H-------------+ | + +--------------------------------------+ + +It does a cmpxchg with the pointer to the previous head page to make it +point to the reader page. Note that the new pointer does not have the HEADER +flag set. This action atomically moves the head page forward. + + +------+ + |reader| RING BUFFER + |page |-------H-----------+ + +------+ v + | ^ +---+ +---+ +---+ + | | | |-->| |-->| | + | | | |<--| |<--| |<-+ + | | +---+ +---+ +---+ | + | | | ^ | | + | | +-------------+ | | + | +-----------------------------+ | + +------------------------------------+ + +After the new head page is set, the previous pointer of the head page is +updated to the reader page. + + +------+ + |reader| RING BUFFER + |page |-------H-----------+ + +------+ <---------------+ v + | ^ +---+ +---+ +---+ + | | | |-->| |-->| | + | | | | | |<--| |<-+ + | | +---+ +---+ +---+ | + | | | ^ | | + | | +-------------+ | | + | +-----------------------------+ | + +------------------------------------+ + + +------+ + |buffer| RING BUFFER + |page |-------H-----------+ <--- New head page + +------+ <---------------+ v + | ^ +---+ +---+ +---+ + | | | | | |-->| | + | | New | | | |<--| |<-+ + | | Reader +---+ +---+ +---+ | + | | page ----^ | | + | | | | + | +-----------------------------+ | + +------------------------------------+ + +Another important point. The page that the reader page points back to +by its previous pointer (the one that now points to the new head page) +never points back to the reader page. That is because the reader page is +not part of the ring buffer. Traversing the ring buffer via the next pointers +will always stay in the ring buffer. Traversing the ring buffer via the +prev pointers may not. + +Note, the way to determine a reader page is simply by examining the previous +pointer of the page. If the next pointer of the previous page does not +point back to the original page, then the original page is a reader page: + + + +--------+ + | reader | next +----+ + | page |-------->| |<====== (buffer page) + +--------+ +----+ + | | ^ + | v | next + prev | +----+ + +------------->| | + +----+ + +The way the head page moves forward: + +When the tail page meets the head page and the buffer is in overwrite mode +and more writes take place, the head page must be moved forward before the +writer may move the tail page. The way this is done is that the writer +performs a cmpxchg to convert the pointer to the head page from the HEADER +flag to have the UPDATE flag set. Once this is done, the reader will +not be able to swap the head page from the buffer, nor will it be able to +move the head page, until the writer is finished with the move. + +This eliminates any races that the reader can have on the writer. The reader +must spin, and this is why the reader can not preempt the writer. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-H->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The following page will be made into the new head page. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +After the new head page has been set, we can set the old head page +pointer back to NORMAL. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +After the head page has been moved, the tail page may now move forward. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + +The above are the trivial updates. Now for the more complex scenarios. + + +As stated before, if enough writes preempt the first write, the +tail page may make it all the way around the buffer and meet the commit +page. At this time, we must start dropping writes (usually with some kind +of warning to the user). But what happens if the commit was still on the +reader page? The commit page is not part of the ring buffer. The tail page +must account for this. + + + reader page commit page + | | + v | + +---+ | + | |<----------+ + | | + | |------+ + +---+ | + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-H->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + ^ + | + tail page + +If the tail page were to simply push the head page forward, the commit when +leaving the reader page would not be pointing to the correct page. + +The solution to this is to test if the commit page is on the reader page +before pushing the head page. If it is, then it can be assumed that the +tail page wrapped the buffer, and we must drop new writes. + +This is not a race condition, because the commit page can only be moved +by the outter most writer (the writer that was preempted). +This means that the commit will not move while a writer is moving the +tail page. The reader can not swap the reader page if it is also being +used as the commit page. The reader can simply check that the commit +is off the reader page. Once the commit page leaves the reader page +it will never go back on it unless a reader does another swap with the +buffer page that is also the commit page. + + +Nested writes +------------- + +In the pushing forward of the tail page we must first push forward +the head page if the head page is the next page. If the head page +is not the next page, the tail page is simply updated with a cmpxchg. + +Only writers move the tail page. This must be done atomically to protect +against nested writers. + + temp_page = tail_page + next_page = temp_page->next + cmpxchg(tail_page, temp_page, next_page) + +The above will update the tail page if it is still pointing to the expected +page. If this fails, a nested write pushed it forward, the the current write +does not need to push it. + + + temp page + | + v + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +Nested write comes in and moves the tail page forward: + + tail page (moved by nested writer) + temp page | + | | + v v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The above would fail the cmpxchg, but since the tail page has already +been moved forward, the writer will just try again to reserve storage +on the new tail page. + +But the moving of the head page is a bit more complex. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-H->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The write converts the head page pointer to UPDATE. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +But if a nested writer preempts here. It will see that the next +page is a head page, but it is also nested. It will detect that +it is nested and will save that information. The detection is the +fact that it sees the UPDATE flag instead of a HEADER or NORMAL +pointer. + +The nested writer will set the new head page pointer. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +But it will not reset the update back to normal. Only the writer +that converted a pointer from HEAD to UPDATE will convert it back +to NORMAL. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +After the nested writer finishes, the outer most writer will convert +the UPDATE pointer to NORMAL. + + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + +It can be even more complex if several nested writes came in and moved +the tail page ahead several pages: + + +(first writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-H->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The write converts the head page pointer to UPDATE. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +Next writer comes in, and sees the update and sets up the new +head page. + +(second writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The nested writer moves the tail page forward. But does not set the old +update page to NORMAL because it is not the outer most writer. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +Another writer preempts and sees the page after the tail page is a head page. +It changes it from HEAD to UPDATE. + +(third writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-U->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The writer will move the head page forward: + + +(third writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-U->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +But now that the third writer did change the HEAD flag to UPDATE it +will convert it to normal: + + +(third writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + +Then it will move the tail page, and return back to the second writer. + + +(second writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + +The second writer will fail to move the tail page because it was already +moved, so it will try again and add its data to the new tail page. +It will return to the first writer. + + +(first writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The first writer can not know atomically test if the tail page moved +while it updates the HEAD page. It will then update the head page to +what it thinks is the new head page. + + +(first writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +Since the cmpxchg returns the old value of the pointer the first writer +will see it succeeded in updating the pointer from NORMAL to HEAD. +But as we can see, this is not good enough. It must also check to see +if the tail page is either where it use to be or on the next page: + + +(first writer) + + A B tail page + | | | + v v v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +If tail page != A and tail page does not equal B, then it must reset the +pointer back to NORMAL. The fact that it only needs to worry about +nested writers, it only needs to check this after setting the HEAD page. + + +(first writer) + + A B tail page + | | | + v v v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +Now the writer can update the head page. This is also why the head page must +remain in UPDATE and only reset by the outer most writer. This prevents +the reader from seeing the incorrect head page. + + +(first writer) + + A B tail page + | | | + v v v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + diff --git a/Documentation/vgaarbiter.txt b/Documentation/vgaarbiter.txt new file mode 100644 index 00000000000..987f9b0a5ec --- /dev/null +++ b/Documentation/vgaarbiter.txt @@ -0,0 +1,194 @@ + +VGA Arbiter +=========== + +Graphic devices are accessed through ranges in I/O or memory space. While most +modern devices allow relocation of such ranges, some "Legacy" VGA devices +implemented on PCI will typically have the same "hard-decoded" addresses as +they did on ISA. For more details see "PCI Bus Binding to IEEE Std 1275-1994 +Standard for Boot (Initialization Configuration) Firmware Revision 2.1" +Section 7, Legacy Devices. + +The Resource Access Control (RAC) module inside the X server [0] existed for +the legacy VGA arbitration task (besides other bus management tasks) when more +than one legacy device co-exists on the same machine. But the problem happens +when these devices are trying to be accessed by different userspace clients +(e.g. two server in parallel). Their address assignments conflict. Moreover, +ideally, being an userspace application, it is not the role of the the X +server to control bus resources. Therefore an arbitration scheme outside of +the X server is needed to control the sharing of these resources. This +document introduces the operation of the VGA arbiter implemented for Linux +kernel. + +---------------------------------------------------------------------------- + +I. Details and Theory of Operation + I.1 vgaarb + I.2 libpciaccess + I.3 xf86VGAArbiter (X server implementation) +II. Credits +III.References + + +I. Details and Theory of Operation +================================== + +I.1 vgaarb +---------- + +The vgaarb is a module of the Linux Kernel. When it is initially loaded, it +scans all PCI devices and adds the VGA ones inside the arbitration. The +arbiter then enables/disables the decoding on different devices of the VGA +legacy instructions. Device which do not want/need to use the arbiter may +explicitly tell it by calling vga_set_legacy_decoding(). + +The kernel exports a char device interface (/dev/vga_arbiter) to the clients, +which has the following semantics: + + open : open user instance of the arbiter. By default, it's attached to + the default VGA device of the system. + + close : close user instance. Release locks made by the user + + read : return a string indicating the status of the target like: + + "<card_ID>,decodes=<io_state>,owns=<io_state>,locks=<io_state> (ic,mc)" + + An IO state string is of the form {io,mem,io+mem,none}, mc and + ic are respectively mem and io lock counts (for debugging/ + diagnostic only). "decodes" indicate what the card currently + decodes, "owns" indicates what is currently enabled on it, and + "locks" indicates what is locked by this card. If the card is + unplugged, we get "invalid" then for card_ID and an -ENODEV + error is returned for any command until a new card is targeted. + + + write : write a command to the arbiter. List of commands: + + target <card_ID> : switch target to card <card_ID> (see below) + lock <io_state> : acquires locks on target ("none" is an invalid io_state) + trylock <io_state> : non-blocking acquire locks on target (returns EBUSY if + unsuccessful) + unlock <io_state> : release locks on target + unlock all : release all locks on target held by this user (not + implemented yet) + decodes <io_state> : set the legacy decoding attributes for the card + + poll : event if something changes on any card (not just the + target) + + card_ID is of the form "PCI:domain:bus:dev.fn". It can be set to "default" + to go back to the system default card (TODO: not implemented yet). Currently, + only PCI is supported as a prefix, but the userland API may support other bus + types in the future, even if the current kernel implementation doesn't. + +Note about locks: + +The driver keeps track of which user has which locks on which card. It +supports stacking, like the kernel one. This complexifies the implementation +a bit, but makes the arbiter more tolerant to user space problems and able +to properly cleanup in all cases when a process dies. +Currently, a max of 16 cards can have locks simultaneously issued from +user space for a given user (file descriptor instance) of the arbiter. + +In the case of devices hot-{un,}plugged, there is a hook - pci_notify() - to +notify them being added/removed in the system and automatically added/removed +in the arbiter. + +There's also a in-kernel API of the arbiter in the case of DRM, vgacon and +others which may use the arbiter. + + +I.2 libpciaccess +---------------- + +To use the vga arbiter char device it was implemented an API inside the +libpciaccess library. One fieldd was added to struct pci_device (each device +on the system): + + /* the type of resource decoded by the device */ + int vgaarb_rsrc; + +Besides it, in pci_system were added: + + int vgaarb_fd; + int vga_count; + struct pci_device *vga_target; + struct pci_device *vga_default_dev; + + +The vga_count is usually need to keep informed how many cards are being +arbitrated, so for instance if there's only one then it can totally escape the +scheme. + + +These functions below acquire VGA resources for the given card and mark those +resources as locked. If the resources requested are "normal" (and not legacy) +resources, the arbiter will first check whether the card is doing legacy +decoding for that type of resource. If yes, the lock is "converted" into a +legacy resource lock. The arbiter will first look for all VGA cards that +might conflict and disable their IOs and/or Memory access, including VGA +forwarding on P2P bridges if necessary, so that the requested resources can +be used. Then, the card is marked as locking these resources and the IO and/or +Memory access is enabled on the card (including VGA forwarding on parent +P2P bridges if any). In the case of vga_arb_lock(), the function will block +if some conflicting card is already locking one of the required resources (or +any resource on a different bus segment, since P2P bridges don't differentiate +VGA memory and IO afaik). If the card already owns the resources, the function +succeeds. vga_arb_trylock() will return (-EBUSY) instead of blocking. Nested +calls are supported (a per-resource counter is maintained). + + +Set the target device of this client. + int pci_device_vgaarb_set_target (struct pci_device *dev); + + +For instance, in x86 if two devices on the same bus want to lock different +resources, both will succeed (lock). If devices are in different buses and +trying to lock different resources, only the first who tried succeeds. + int pci_device_vgaarb_lock (void); + int pci_device_vgaarb_trylock (void); + +Unlock resources of device. + int pci_device_vgaarb_unlock (void); + +Indicates to the arbiter if the card decodes legacy VGA IOs, legacy VGA +Memory, both, or none. All cards default to both, the card driver (fbdev for +example) should tell the arbiter if it has disabled legacy decoding, so the +card can be left out of the arbitration process (and can be safe to take +interrupts at any time. + int pci_device_vgaarb_decodes (int new_vgaarb_rsrc); + +Connects to the arbiter device, allocates the struct + int pci_device_vgaarb_init (void); + +Close the connection + void pci_device_vgaarb_fini (void); + + +I.3 xf86VGAArbiter (X server implementation) +-------------------------------------------- + +(TODO) + +X server basically wraps all the functions that touch VGA registers somehow. + + +II. Credits +=========== + +Benjamin Herrenschmidt (IBM?) started this work when he discussed such design +with the Xorg community in 2005 [1, 2]. In the end of 2007, Paulo Zanoni and +Tiago Vignatti (both of C3SL/Federal University of Paraná) proceeded his work +enhancing the kernel code to adapt as a kernel module and also did the +implementation of the user space side [3]. Now (2009) Tiago Vignatti and Dave +Airlie finally put this work in shape and queued to Jesse Barnes' PCI tree. + + +III. References +============== + +[0] http://cgit.freedesktop.org/xorg/xserver/commit/?id=4b42448a2388d40f257774fbffdccaea87bd0347 +[1] http://lists.freedesktop.org/archives/xorg/2005-March/006663.html +[2] http://lists.freedesktop.org/archives/xorg/2005-March/006745.html +[3] http://lists.freedesktop.org/archives/xorg/2007-October/029507.html diff --git a/Documentation/video4linux/CARDLIST.cx23885 b/Documentation/video4linux/CARDLIST.cx23885 index 91aa3c0f0dd..525edb37c75 100644 --- a/Documentation/video4linux/CARDLIST.cx23885 +++ b/Documentation/video4linux/CARDLIST.cx23885 @@ -16,3 +16,10 @@ 15 -> TeVii S470 [d470:9022] 16 -> DVBWorld DVB-S2 2005 [0001:2005] 17 -> NetUP Dual DVB-S2 CI [1b55:2a2c] + 18 -> Hauppauge WinTV-HVR1270 [0070:2211] + 19 -> Hauppauge WinTV-HVR1275 [0070:2215] + 20 -> Hauppauge WinTV-HVR1255 [0070:2251] + 21 -> Hauppauge WinTV-HVR1210 [0070:2291,0070:2295] + 22 -> Mygica X8506 DMB-TH [14f1:8651] + 23 -> Magic-Pro ProHDTV Extreme 2 [14f1:8657] + 24 -> Hauppauge WinTV-HVR1850 [0070:8541] diff --git a/Documentation/video4linux/CARDLIST.cx88 b/Documentation/video4linux/CARDLIST.cx88 index 71e9db0b26f..3385f8b094a 100644 --- a/Documentation/video4linux/CARDLIST.cx88 +++ b/Documentation/video4linux/CARDLIST.cx88 @@ -6,8 +6,8 @@ 5 -> Leadtek Winfast 2000XP Expert [107d:6611,107d:6613] 6 -> AverTV Studio 303 (M126) [1461:000b] 7 -> MSI TV-@nywhere Master [1462:8606] - 8 -> Leadtek Winfast DV2000 [107d:6620] - 9 -> Leadtek PVR 2000 [107d:663b,107d:663c,107d:6632] + 8 -> Leadtek Winfast DV2000 [107d:6620,107d:6621] + 9 -> Leadtek PVR 2000 [107d:663b,107d:663c,107d:6632,107d:6630,107d:6638,107d:6631,107d:6637,107d:663d] 10 -> IODATA GV-VCP3/PCI [10fc:d003] 11 -> Prolink PlayTV PVR 12 -> ASUS PVR-416 [1043:4823,1461:c111] @@ -59,7 +59,7 @@ 58 -> Pinnacle PCTV HD 800i [11bd:0051] 59 -> DViCO FusionHDTV 5 PCI nano [18ac:d530] 60 -> Pinnacle Hybrid PCTV [12ab:1788] - 61 -> Winfast TV2000 XP Global [107d:6f18] + 61 -> Leadtek TV2000 XP Global [107d:6f18,107d:6618] 62 -> PowerColor RA330 [14f1:ea3d] 63 -> Geniatech X8000-MT DVBT [14f1:8852] 64 -> DViCO FusionHDTV DVB-T PRO [18ac:db30] @@ -78,3 +78,6 @@ 77 -> TBS 8910 DVB-S [8910:8888] 78 -> Prof 6200 DVB-S [b022:3022] 79 -> Terratec Cinergy HT PCI MKII [153b:1177] + 80 -> Hauppauge WinTV-IR Only [0070:9290] + 81 -> Leadtek WinFast DTV1800 Hybrid [107d:6654] + 82 -> WinFast DTV2000 H rev. J [107d:6f2b] diff --git a/Documentation/video4linux/CARDLIST.em28xx b/Documentation/video4linux/CARDLIST.em28xx index 78d0a6eed57..b13fcbd5d94 100644 --- a/Documentation/video4linux/CARDLIST.em28xx +++ b/Documentation/video4linux/CARDLIST.em28xx @@ -1,5 +1,5 @@ 0 -> Unknown EM2800 video grabber (em2800) [eb1a:2800] - 1 -> Unknown EM2750/28xx video grabber (em2820/em2840) [eb1a:2820,eb1a:2821,eb1a:2860,eb1a:2861,eb1a:2870,eb1a:2881,eb1a:2883] + 1 -> Unknown EM2750/28xx video grabber (em2820/em2840) [eb1a:2710,eb1a:2820,eb1a:2821,eb1a:2860,eb1a:2861,eb1a:2870,eb1a:2881,eb1a:2883] 2 -> Terratec Cinergy 250 USB (em2820/em2840) [0ccd:0036] 3 -> Pinnacle PCTV USB 2 (em2820/em2840) [2304:0208] 4 -> Hauppauge WinTV USB 2 (em2820/em2840) [2040:4200,2040:4201] @@ -7,7 +7,7 @@ 6 -> Terratec Cinergy 200 USB (em2800) 7 -> Leadtek Winfast USB II (em2800) [0413:6023] 8 -> Kworld USB2800 (em2800) - 9 -> Pinnacle Dazzle DVC 90/100/101/107 / Kaiser Baas Video to DVD maker (em2820/em2840) [1b80:e302,2304:0207,2304:021a] + 9 -> Pinnacle Dazzle DVC 90/100/101/107 / Kaiser Baas Video to DVD maker (em2820/em2840) [1b80:e302,1b80:e304,2304:0207,2304:021a] 10 -> Hauppauge WinTV HVR 900 (em2880) [2040:6500] 11 -> Terratec Hybrid XS (em2880) [0ccd:0042] 12 -> Kworld PVR TV 2800 RF (em2820/em2840) @@ -17,10 +17,10 @@ 16 -> Hauppauge WinTV HVR 950 (em2883) [2040:6513,2040:6517,2040:651b] 17 -> Pinnacle PCTV HD Pro Stick (em2880) [2304:0227] 18 -> Hauppauge WinTV HVR 900 (R2) (em2880) [2040:6502] - 19 -> PointNix Intra-Oral Camera (em2860) + 19 -> EM2860/SAA711X Reference Design (em2860) 20 -> AMD ATI TV Wonder HD 600 (em2880) [0438:b002] 21 -> eMPIA Technology, Inc. GrabBeeX+ Video Encoder (em2800) [eb1a:2801] - 22 -> Unknown EM2750/EM2751 webcam grabber (em2750) [eb1a:2750,eb1a:2751] + 22 -> EM2710/EM2750/EM2751 webcam grabber (em2750) [eb1a:2750,eb1a:2751] 23 -> Huaqi DLCW-130 (em2750) 24 -> D-Link DUB-T210 TV Tuner (em2820/em2840) [2001:f112] 25 -> Gadmei UTV310 (em2820/em2840) @@ -33,7 +33,7 @@ 34 -> Terratec Cinergy A Hybrid XS (em2860) [0ccd:004f] 35 -> Typhoon DVD Maker (em2860) 36 -> NetGMBH Cam (em2860) - 37 -> Gadmei UTV330 (em2860) + 37 -> Gadmei UTV330 (em2860) [eb1a:50a6] 38 -> Yakumo MovieMixer (em2861) 39 -> KWorld PVRTV 300U (em2861) [eb1a:e300] 40 -> Plextor ConvertX PX-TV100U (em2861) [093b:a005] @@ -61,3 +61,10 @@ 63 -> Kaiomy TVnPC U2 (em2860) [eb1a:e303] 64 -> Easy Cap Capture DC-60 (em2860) 65 -> IO-DATA GV-MVP/SZ (em2820/em2840) [04bb:0515] + 66 -> Empire dual TV (em2880) + 67 -> Terratec Grabby (em2860) [0ccd:0096] + 68 -> Terratec AV350 (em2860) [0ccd:0084] + 69 -> KWorld ATSC 315U HDTV TV Box (em2882) [eb1a:a313] + 70 -> Evga inDtube (em2882) + 71 -> Silvercrest Webcam 1.3mpix (em2820/em2840) + 72 -> Gadmei UTV330+ (em2861) diff --git a/Documentation/video4linux/CARDLIST.saa7134 b/Documentation/video4linux/CARDLIST.saa7134 index 6dacf282525..0ac4d254477 100644 --- a/Documentation/video4linux/CARDLIST.saa7134 +++ b/Documentation/video4linux/CARDLIST.saa7134 @@ -124,10 +124,10 @@ 123 -> Beholder BeholdTV 407 [0000:4070] 124 -> Beholder BeholdTV 407 FM [0000:4071] 125 -> Beholder BeholdTV 409 [0000:4090] -126 -> Beholder BeholdTV 505 FM/RDS [0000:5051,0000:505B,5ace:5050] -127 -> Beholder BeholdTV 507 FM/RDS / BeholdTV 509 FM [0000:5071,0000:507B,5ace:5070,5ace:5090] +126 -> Beholder BeholdTV 505 FM [5ace:5050] +127 -> Beholder BeholdTV 507 FM / BeholdTV 509 FM [5ace:5070,5ace:5090] 128 -> Beholder BeholdTV Columbus TVFM [0000:5201] -129 -> Beholder BeholdTV 607 / BeholdTV 609 [5ace:6070,5ace:6071,5ace:6072,5ace:6073,5ace:6090,5ace:6091,5ace:6092,5ace:6093] +129 -> Beholder BeholdTV 607 FM [5ace:6070] 130 -> Beholder BeholdTV M6 [5ace:6190] 131 -> Twinhan Hybrid DTV-DVB 3056 PCI [1822:0022] 132 -> Genius TVGO AM11MCE @@ -143,7 +143,7 @@ 142 -> Beholder BeholdTV H6 [5ace:6290] 143 -> Beholder BeholdTV M63 [5ace:6191] 144 -> Beholder BeholdTV M6 Extra [5ace:6193] -145 -> AVerMedia MiniPCI DVB-T Hybrid M103 [1461:f636] +145 -> AVerMedia MiniPCI DVB-T Hybrid M103 [1461:f636,1461:f736] 146 -> ASUSTeK P7131 Analog 147 -> Asus Tiger 3in1 [1043:4878] 148 -> Encore ENLTV-FM v5.3 [1a7f:2008] @@ -153,5 +153,21 @@ 152 -> Asus Tiger Rev:1.00 [1043:4857] 153 -> Kworld Plus TV Analog Lite PCI [17de:7128] 154 -> Avermedia AVerTV GO 007 FM Plus [1461:f31d] -155 -> Hauppauge WinTV-HVR1120 ATSC/QAM-Hybrid [0070:6706,0070:6708] -156 -> Hauppauge WinTV-HVR1110r3 [0070:6707,0070:6709,0070:670a] +155 -> Hauppauge WinTV-HVR1150 ATSC/QAM-Hybrid [0070:6706,0070:6708] +156 -> Hauppauge WinTV-HVR1120 DVB-T/Hybrid [0070:6707,0070:6709,0070:670a] +157 -> Avermedia AVerTV Studio 507UA [1461:a11b] +158 -> AVerMedia Cardbus TV/Radio (E501R) [1461:b7e9] +159 -> Beholder BeholdTV 505 RDS [0000:505B] +160 -> Beholder BeholdTV 507 RDS [0000:5071] +161 -> Beholder BeholdTV 507 RDS [0000:507B] +162 -> Beholder BeholdTV 607 FM [5ace:6071] +163 -> Beholder BeholdTV 609 FM [5ace:6090] +164 -> Beholder BeholdTV 609 FM [5ace:6091] +165 -> Beholder BeholdTV 607 RDS [5ace:6072] +166 -> Beholder BeholdTV 607 RDS [5ace:6073] +167 -> Beholder BeholdTV 609 RDS [5ace:6092] +168 -> Beholder BeholdTV 609 RDS [5ace:6093] +169 -> Compro VideoMate S350/S300 [185b:c900] +170 -> AverMedia AverTV Studio 505 [1461:a115] +171 -> Beholder BeholdTV X7 [5ace:7595] +172 -> RoverMedia TV Link Pro FM [19d1:0138] diff --git a/Documentation/video4linux/CARDLIST.tuner b/Documentation/video4linux/CARDLIST.tuner index 691d2f37dc5..ba9fa679e2d 100644 --- a/Documentation/video4linux/CARDLIST.tuner +++ b/Documentation/video4linux/CARDLIST.tuner @@ -76,3 +76,6 @@ tuner=75 - Philips TEA5761 FM Radio tuner=76 - Xceive 5000 tuner tuner=77 - TCL tuner MF02GIP-5N-E tuner=78 - Philips FMD1216MEX MK3 Hybrid Tuner +tuner=79 - Philips PAL/SECAM multi (FM1216 MK5) +tuner=80 - Philips FQ1216LME MK3 PAL/SECAM w/active loopthrough +tuner=81 - Partsnic (Daewoo) PTI-5NF05 diff --git a/Documentation/video4linux/CQcam.txt b/Documentation/video4linux/CQcam.txt index 04986efb731..d230878e473 100644 --- a/Documentation/video4linux/CQcam.txt +++ b/Documentation/video4linux/CQcam.txt @@ -18,8 +18,8 @@ Table of Contents 1.0 Introduction - The file ../drivers/char/c-qcam.c is a device driver for the -Logitech (nee Connectix) parallel port interface color CCD camera. + The file ../../drivers/media/video/c-qcam.c is a device driver for +the Logitech (nee Connectix) parallel port interface color CCD camera. This is a fairly inexpensive device for capturing images. Logitech does not currently provide information for developers, but many people have engineered several solutions for non-Microsoft use of the Color diff --git a/Documentation/video4linux/gspca.txt b/Documentation/video4linux/gspca.txt index 98529e03a46..4686e84dd80 100644 --- a/Documentation/video4linux/gspca.txt +++ b/Documentation/video4linux/gspca.txt @@ -44,7 +44,9 @@ zc3xx 0458:7007 Genius VideoCam V2 zc3xx 0458:700c Genius VideoCam V3 zc3xx 0458:700f Genius VideoCam Web V2 sonixj 0458:7025 Genius Eye 311Q +sn9c20x 0458:7029 Genius Look 320s sonixj 0458:702e Genius Slim 310 NB +sn9c20x 045e:00f4 LifeCam VX-6000 (SN9C20x + OV9650) sonixj 045e:00f5 MicroSoft VX3000 sonixj 045e:00f7 MicroSoft VX1000 ov519 045e:028c Micro$oft xbox cam @@ -138,6 +140,7 @@ spca500 04fc:7333 PalmPixDC85 sunplus 04fc:ffff Pure DigitalDakota spca501 0506:00df 3Com HomeConnect Lite sunplus 052b:1513 Megapix V4 +sunplus 052b:1803 MegaImage VI tv8532 0545:808b Veo Stingray tv8532 0545:8333 Veo Stingray sunplus 0546:3155 Polaroid PDC3070 @@ -163,10 +166,11 @@ sunplus 055f:c650 Mustek MDC5500Z zc3xx 055f:d003 Mustek WCam300A zc3xx 055f:d004 Mustek WCam300 AN conex 0572:0041 Creative Notebook cx11646 -ov519 05a9:0519 OmniVision +ov519 05a9:0519 OV519 Microphone ov519 05a9:0530 OmniVision -ov519 05a9:4519 OmniVision +ov519 05a9:4519 Webcam Classic ov519 05a9:8519 OmniVision +ov519 05a9:a518 D-Link DSB-C310 Webcam sunplus 05da:1018 Digital Dream Enigma 1.3 stk014 05e1:0893 Syntek DV4000 spca561 060b:a001 Maxell Compact Pc PM3 @@ -178,6 +182,8 @@ spca506 06e1:a190 ADS Instant VCD ov534 06f8:3002 Hercules Blog Webcam ov534 06f8:3003 Hercules Dualpix HD Weblog sonixj 06f8:3004 Hercules Classic Silver +sonixj 06f8:3008 Hercules Deluxe Optical Glass +pac7311 06f8:3009 Hercules Classic Link spca508 0733:0110 ViewQuest VQ110 spca508 0130:0130 Clone Digital Webcam 11043 spca501 0733:0401 Intel Create and Share @@ -209,6 +215,7 @@ sunplus 08ca:2050 Medion MD 41437 sunplus 08ca:2060 Aiptek PocketDV5300 tv8532 0923:010f ICM532 cams mars 093a:050f Mars-Semi Pc-Camera +mr97310a 093a:010f Sakar Digital no. 77379 pac207 093a:2460 Qtec Webcam 100 pac207 093a:2461 HP Webcam pac207 093a:2463 Philips SPC 220 NC @@ -230,8 +237,10 @@ pac7311 093a:2621 PAC731x pac7311 093a:2622 Genius Eye 312 pac7311 093a:2624 PAC7302 pac7311 093a:2626 Labtec 2200 +pac7311 093a:2629 Genious iSlim 300 pac7311 093a:262a Webcam 300k pac7311 093a:262c Philips SPC 230 NC +jeilinj 0979:0280 Sakar 57379 zc3xx 0ac8:0302 Z-star Vimicro zc0302 vc032x 0ac8:0321 Vimicro generic vc0321 vc032x 0ac8:0323 Vimicro Vc0323 @@ -242,6 +251,7 @@ zc3xx 0ac8:305b Z-star Vimicro zc0305b zc3xx 0ac8:307b Ldlc VC302+Ov7620 vc032x 0ac8:c001 Sony embedded vimicro vc032x 0ac8:c002 Sony embedded vimicro +vc032x 0ac8:c301 Samsung Q1 Ultra Premium spca508 0af9:0010 Hama USB Sightcam 100 spca508 0af9:0011 Hama USB Sightcam 100 sonixb 0c45:6001 Genius VideoCAM NB @@ -265,6 +275,11 @@ sonixj 0c45:60ec SN9C105+MO4000 sonixj 0c45:60fb Surfer NoName sonixj 0c45:60fc LG-LIC300 sonixj 0c45:60fe Microdia Audio +sonixj 0c45:6100 PC Camera (SN9C128) +sonixj 0c45:610a PC Camera (SN9C128) +sonixj 0c45:610b PC Camera (SN9C128) +sonixj 0c45:610c PC Camera (SN9C128) +sonixj 0c45:610e PC Camera (SN9C128) sonixj 0c45:6128 Microdia/Sonix SNP325 sonixj 0c45:612a Avant Camera sonixj 0c45:612c Typhoon Rasy Cam 1.3MPix @@ -274,6 +289,29 @@ sonixj 0c45:613a Microdia Sonix PC Camera sonixj 0c45:613b Surfer SN-206 sonixj 0c45:613c Sonix Pccam168 sonixj 0c45:6143 Sonix Pccam168 +sonixj 0c45:6148 Digitus DA-70811/ZSMC USB PC Camera ZS211/Microdia +sn9c20x 0c45:6240 PC Camera (SN9C201 + MT9M001) +sn9c20x 0c45:6242 PC Camera (SN9C201 + MT9M111) +sn9c20x 0c45:6248 PC Camera (SN9C201 + OV9655) +sn9c20x 0c45:624e PC Camera (SN9C201 + SOI968) +sn9c20x 0c45:624f PC Camera (SN9C201 + OV9650) +sn9c20x 0c45:6251 PC Camera (SN9C201 + OV9650) +sn9c20x 0c45:6253 PC Camera (SN9C201 + OV9650) +sn9c20x 0c45:6260 PC Camera (SN9C201 + OV7670) +sn9c20x 0c45:6270 PC Camera (SN9C201 + MT9V011/MT9V111/MT9V112) +sn9c20x 0c45:627b PC Camera (SN9C201 + OV7660) +sn9c20x 0c45:627c PC Camera (SN9C201 + HV7131R) +sn9c20x 0c45:627f PC Camera (SN9C201 + OV9650) +sn9c20x 0c45:6280 PC Camera (SN9C202 + MT9M001) +sn9c20x 0c45:6282 PC Camera (SN9C202 + MT9M111) +sn9c20x 0c45:6288 PC Camera (SN9C202 + OV9655) +sn9c20x 0c45:628e PC Camera (SN9C202 + SOI968) +sn9c20x 0c45:628f PC Camera (SN9C202 + OV9650) +sn9c20x 0c45:62a0 PC Camera (SN9C202 + OV7670) +sn9c20x 0c45:62b0 PC Camera (SN9C202 + MT9V011/MT9V111/MT9V112) +sn9c20x 0c45:62b3 PC Camera (SN9C202 + OV9655) +sn9c20x 0c45:62bb PC Camera (SN9C202 + OV7660) +sn9c20x 0c45:62bc PC Camera (SN9C202 + HV7131R) sunplus 0d64:0303 Sunplus FashionCam DXG etoms 102c:6151 Qcam Sangha CIF etoms 102c:6251 Qcam xxxxxx VGA @@ -282,6 +320,7 @@ spca561 10fd:7e50 FlyCam Usb 100 zc3xx 10fd:8050 Typhoon Webshot II USB 300k ov534 1415:2000 Sony HD Eye for PS3 (SLEH 00201) pac207 145f:013a Trust WB-1300N +sn9c20x 145f:013d Trust WB-3600R vc032x 15b8:6001 HP 2.0 Megapixel vc032x 15b8:6002 HP 2.0 Megapixel rz406aa spca501 1776:501c Arowana 300K CMOS Camera @@ -292,4 +331,11 @@ spca500 2899:012c Toptro Industrial spca508 8086:0110 Intel Easy PC Camera spca500 8086:0630 Intel Pocket PC Camera spca506 99fa:8988 Grandtec V.cap +sn9c20x a168:0610 Dino-Lite Digital Microscope (SN9C201 + HV7131R) +sn9c20x a168:0611 Dino-Lite Digital Microscope (SN9C201 + HV7131R) +sn9c20x a168:0613 Dino-Lite Digital Microscope (SN9C201 + HV7131R) +sn9c20x a168:0618 Dino-Lite Digital Microscope (SN9C201 + HV7131R) +sn9c20x a168:0614 Dino-Lite Digital Microscope (SN9C201 + MT9M111) +sn9c20x a168:0615 Dino-Lite Digital Microscope (SN9C201 + MT9M111) +sn9c20x a168:0617 Dino-Lite Digital Microscope (SN9C201 + MT9M111) spca561 abcd:cdee Petcam diff --git a/Documentation/video4linux/pxa_camera.txt b/Documentation/video4linux/pxa_camera.txt index b1137f9a53e..4f6d0ca0195 100644 --- a/Documentation/video4linux/pxa_camera.txt +++ b/Documentation/video4linux/pxa_camera.txt @@ -26,6 +26,55 @@ Global video workflow Once the last buffer is filled in, the QCI interface stops. + c) Capture global finite state machine schema + + +----+ +---+ +----+ + | DQ | | Q | | DQ | + | v | v | v + +-----------+ +------------------------+ + | STOP | | Wait for capture start | + +-----------+ Q +------------------------+ ++-> | QCI: stop | ------------------> | QCI: run | <------------+ +| | DMA: stop | | DMA: stop | | +| +-----------+ +-----> +------------------------+ | +| / | | +| / +---+ +----+ | | +|capture list empty / | Q | | DQ | | QCI Irq EOF | +| / | v | v v | +| +--------------------+ +----------------------+ | +| | DMA hotlink missed | | Capture running | | +| +--------------------+ +----------------------+ | +| | QCI: run | +-----> | QCI: run | <-+ | +| | DMA: stop | / | DMA: run | | | +| +--------------------+ / +----------------------+ | Other | +| ^ /DMA still | | channels | +| | capture list / running | DMA Irq End | not | +| | not empty / | | finished | +| | / v | yet | +| +----------------------+ +----------------------+ | | +| | Videobuf released | | Channel completed | | | +| +----------------------+ +----------------------+ | | ++-- | QCI: run | | QCI: run | --+ | + | DMA: run | | DMA: run | | + +----------------------+ +----------------------+ | + ^ / | | + | no overrun / | overrun | + | / v | + +--------------------+ / +----------------------+ | + | Frame completed | / | Frame overran | | + +--------------------+ <-----+ +----------------------+ restart frame | + | QCI: run | | QCI: stop | --------------+ + | DMA: run | | DMA: stop | + +--------------------+ +----------------------+ + + Legend: - each box is a FSM state + - each arrow is the condition to transition to another state + - an arrow with a comment is a mandatory transition (no condition) + - arrow "Q" means : a buffer was enqueued + - arrow "DQ" means : a buffer was dequeued + - "QCI: stop" means the QCI interface is not enabled + - "DMA: stop" means all 3 DMA channels are stopped + - "DMA: run" means at least 1 DMA channel is still running DMA usage --------- diff --git a/Documentation/video4linux/si4713.txt b/Documentation/video4linux/si4713.txt new file mode 100644 index 00000000000..25abdb78209 --- /dev/null +++ b/Documentation/video4linux/si4713.txt @@ -0,0 +1,176 @@ +Driver for I2C radios for the Silicon Labs Si4713 FM Radio Transmitters + +Copyright (c) 2009 Nokia Corporation +Contact: Eduardo Valentin <eduardo.valentin@nokia.com> + + +Information about the Device +============================ +This chip is a Silicon Labs product. It is a I2C device, currently on 0x63 address. +Basically, it has transmission and signal noise level measurement features. + +The Si4713 integrates transmit functions for FM broadcast stereo transmission. +The chip also allows integrated receive power scanning to identify low signal +power FM channels. + +The chip is programmed using commands and responses. There are also several +properties which can change the behavior of this chip. + +Users must comply with local regulations on radio frequency (RF) transmission. + +Device driver description +========================= +There are two modules to handle this device. One is a I2C device driver +and the other is a platform driver. + +The I2C device driver exports a v4l2-subdev interface to the kernel. +All properties can also be accessed by v4l2 extended controls interface, by +using the v4l2-subdev calls (g_ext_ctrls, s_ext_ctrls). + +The platform device driver exports a v4l2 radio device interface to user land. +So, it uses the I2C device driver as a sub device in order to send the user +commands to the actual device. Basically it is a wrapper to the I2C device driver. + +Applications can use v4l2 radio API to specify frequency of operation, mute state, +etc. But mostly of its properties will be present in the extended controls. + +When the v4l2 mute property is set to 1 (true), the driver will turn the chip off. + +Properties description +====================== + +The properties can be accessed using v4l2 extended controls. +Here is an output from v4l2-ctl util: +/ # v4l2-ctl -d /dev/radio0 --all -L +Driver Info: + Driver name : radio-si4713 + Card type : Silicon Labs Si4713 Modulator + Bus info : + Driver version: 0 + Capabilities : 0x00080800 + RDS Output + Modulator +Audio output: 0 (FM Modulator Audio Out) +Frequency: 1408000 (88.000000 MHz) +Video Standard = 0x00000000 +Modulator: + Name : FM Modulator + Capabilities : 62.5 Hz stereo rds + Frequency range : 76.0 MHz - 108.0 MHz + Subchannel modulation: stereo+rds + +User Controls + + mute (bool) : default=1 value=0 + +FM Radio Modulator Controls + + rds_signal_deviation (int) : min=0 max=90000 step=10 default=200 value=200 flags=slider + rds_program_id (int) : min=0 max=65535 step=1 default=0 value=0 + rds_program_type (int) : min=0 max=31 step=1 default=0 value=0 + rds_ps_name (str) : min=0 max=96 step=8 value='si4713 ' + rds_radio_text (str) : min=0 max=384 step=32 value='' + audio_limiter_feature_enabled (bool) : default=1 value=1 + audio_limiter_release_time (int) : min=250 max=102390 step=50 default=5010 value=5010 flags=slider + audio_limiter_deviation (int) : min=0 max=90000 step=10 default=66250 value=66250 flags=slider +audio_compression_feature_enabl (bool) : default=1 value=1 + audio_compression_gain (int) : min=0 max=20 step=1 default=15 value=15 flags=slider + audio_compression_threshold (int) : min=-40 max=0 step=1 default=-40 value=-40 flags=slider + audio_compression_attack_time (int) : min=0 max=5000 step=500 default=0 value=0 flags=slider + audio_compression_release_time (int) : min=100000 max=1000000 step=100000 default=1000000 value=1000000 flags=slider + pilot_tone_feature_enabled (bool) : default=1 value=1 + pilot_tone_deviation (int) : min=0 max=90000 step=10 default=6750 value=6750 flags=slider + pilot_tone_frequency (int) : min=0 max=19000 step=1 default=19000 value=19000 flags=slider + pre_emphasis_settings (menu) : min=0 max=2 default=1 value=1 + tune_power_level (int) : min=0 max=120 step=1 default=88 value=88 flags=slider + tune_antenna_capacitor (int) : min=0 max=191 step=1 default=0 value=110 flags=slider +/ # + +Here is a summary of them: + +* Pilot is an audible tone sent by the device. + +pilot_frequency - Configures the frequency of the stereo pilot tone. +pilot_deviation - Configures pilot tone frequency deviation level. +pilot_enabled - Enables or disables the pilot tone feature. + +* The si4713 device is capable of applying audio compression to the transmitted signal. + +acomp_enabled - Enables or disables the audio dynamic range control feature. +acomp_gain - Sets the gain for audio dynamic range control. +acomp_threshold - Sets the threshold level for audio dynamic range control. +acomp_attack_time - Sets the attack time for audio dynamic range control. +acomp_release_time - Sets the release time for audio dynamic range control. + +* Limiter setups audio deviation limiter feature. Once a over deviation occurs, +it is possible to adjust the front-end gain of the audio input and always +prevent over deviation. + +limiter_enabled - Enables or disables the limiter feature. +limiter_deviation - Configures audio frequency deviation level. +limiter_release_time - Sets the limiter release time. + +* Tuning power + +power_level - Sets the output power level for signal transmission. +antenna_capacitor - This selects the value of antenna tuning capacitor manually +or automatically if set to zero. + +* RDS related + +rds_ps_name - Sets the RDS ps name field for transmission. +rds_radio_text - Sets the RDS radio text for transmission. +rds_pi - Sets the RDS PI field for transmission. +rds_pty - Sets the RDS PTY field for transmission. + +* Region related + +preemphasis - sets the preemphasis to be applied for transmission. + +RNL +=== + +This device also has an interface to measure received noise level. To do that, you should +ioctl the device node. Here is an code of example: + +int main (int argc, char *argv[]) +{ + struct si4713_rnl rnl; + int fd = open("/dev/radio0", O_RDWR); + int rval; + + if (argc < 2) + return -EINVAL; + + if (fd < 0) + return fd; + + sscanf(argv[1], "%d", &rnl.frequency); + + rval = ioctl(fd, SI4713_IOC_MEASURE_RNL, &rnl); + if (rval < 0) + return rval; + + printf("received noise level: %d\n", rnl.rnl); + + close(fd); +} + +The struct si4713_rnl and SI4713_IOC_MEASURE_RNL are defined under +include/media/si4713.h. + +Stereo/Mono and RDS subchannels +=============================== + +The device can also be configured using the available sub channels for +transmission. To do that use S/G_MODULATOR ioctl and configure txsubchans properly. +Refer to v4l2-spec for proper use of this ioctl. + +Testing +======= +Testing is usually done with v4l2-ctl utility for managing FM tuner cards. +The tool can be found in v4l-dvb repository under v4l2-apps/util directory. + +Example for setting rds ps name: +# v4l2-ctl -d /dev/radio0 --set-ctrl=rds_ps_name="Dummy" + diff --git a/Documentation/video4linux/v4l2-framework.txt b/Documentation/video4linux/v4l2-framework.txt index 854808b67fa..ba4706afc5f 100644 --- a/Documentation/video4linux/v4l2-framework.txt +++ b/Documentation/video4linux/v4l2-framework.txt @@ -89,6 +89,11 @@ from dev (driver name followed by the bus_id, to be precise). If you set it up before calling v4l2_device_register then it will be untouched. If dev is NULL, then you *must* setup v4l2_dev->name before calling v4l2_device_register. +You can use v4l2_device_set_name() to set the name based on a driver name and +a driver-global atomic_t instance. This will generate names like ivtv0, ivtv1, +etc. If the name ends with a digit, then it will insert a dash: cx18-0, +cx18-1, etc. This function returns the instance number. + The first 'dev' argument is normally the struct device pointer of a pci_dev, usb_interface or platform_device. It is rare for dev to be NULL, but it happens with ISA devices or when one device creates multiple PCI devices, thus making @@ -385,6 +390,30 @@ later date. It differs between i2c drivers and as such can be confusing. To see which chip variants are supported you can look in the i2c driver code for the i2c_device_id table. This lists all the possibilities. +There are two more helper functions: + +v4l2_i2c_new_subdev_cfg: this function adds new irq and platform_data +arguments and has both 'addr' and 'probed_addrs' arguments: if addr is not +0 then that will be used (non-probing variant), otherwise the probed_addrs +are probed. + +For example: this will probe for address 0x10: + +struct v4l2_subdev *sd = v4l2_i2c_new_subdev_cfg(v4l2_dev, adapter, + "module_foo", "chipid", 0, NULL, 0, I2C_ADDRS(0x10)); + +v4l2_i2c_new_subdev_board uses an i2c_board_info struct which is passed +to the i2c driver and replaces the irq, platform_data and addr arguments. + +If the subdev supports the s_config core ops, then that op is called with +the irq and platform_data arguments after the subdev was setup. The older +v4l2_i2c_new_(probed_)subdev functions will call s_config as well, but with +irq set to 0 and platform_data set to NULL. + +Note that in the next kernel release the functions v4l2_i2c_new_subdev, +v4l2_i2c_new_probed_subdev and v4l2_i2c_new_probed_subdev_addr will all be +replaced by a single v4l2_i2c_new_subdev that is identical to +v4l2_i2c_new_subdev_cfg but without the irq and platform_data arguments. struct video_device ------------------- diff --git a/Documentation/vm/Makefile b/Documentation/vm/Makefile index 6f562f778b2..5bd269b3731 100644 --- a/Documentation/vm/Makefile +++ b/Documentation/vm/Makefile @@ -2,7 +2,7 @@ obj- := dummy.o # List of programs to build -hostprogs-y := slabinfo +hostprogs-y := slabinfo page-types # Tell kbuild to always build the programs always := $(hostprogs-y) diff --git a/Documentation/vm/balance b/Documentation/vm/balance index bd3d31bc491..c46e68cf934 100644 --- a/Documentation/vm/balance +++ b/Documentation/vm/balance @@ -75,15 +75,15 @@ Page stealing from process memory and shm is done if stealing the page would alleviate memory pressure on any zone in the page's node that has fallen below its watermark. -pages_min/pages_low/pages_high/low_on_memory/zone_wake_kswapd: These are -per-zone fields, used to determine when a zone needs to be balanced. When -the number of pages falls below pages_min, the hysteric field low_on_memory -gets set. This stays set till the number of free pages becomes pages_high. -When low_on_memory is set, page allocation requests will try to free some -pages in the zone (providing GFP_WAIT is set in the request). Orthogonal -to this, is the decision to poke kswapd to free some zone pages. That -decision is not hysteresis based, and is done when the number of free -pages is below pages_low; in which case zone_wake_kswapd is also set. +watemark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd: These +are per-zone fields, used to determine when a zone needs to be balanced. When +the number of pages falls below watermark[WMARK_MIN], the hysteric field +low_on_memory gets set. This stays set till the number of free pages becomes +watermark[WMARK_HIGH]. When low_on_memory is set, page allocation requests will +try to free some pages in the zone (providing GFP_WAIT is set in the request). +Orthogonal to this, is the decision to poke kswapd to free some zone pages. +That decision is not hysteresis based, and is done when the number of free +pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set. (Good) Ideas that I have heard: diff --git a/Documentation/vm/page-types.c b/Documentation/vm/page-types.c new file mode 100644 index 00000000000..0833f44ba16 --- /dev/null +++ b/Documentation/vm/page-types.c @@ -0,0 +1,698 @@ +/* + * page-types: Tool for querying page flags + * + * Copyright (C) 2009 Intel corporation + * Copyright (C) 2009 Wu Fengguang <fengguang.wu@intel.com> + */ + +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <stdint.h> +#include <stdarg.h> +#include <string.h> +#include <getopt.h> +#include <limits.h> +#include <sys/types.h> +#include <sys/errno.h> +#include <sys/fcntl.h> + + +/* + * kernel page flags + */ + +#define KPF_BYTES 8 +#define PROC_KPAGEFLAGS "/proc/kpageflags" + +/* copied from kpageflags_read() */ +#define KPF_LOCKED 0 +#define KPF_ERROR 1 +#define KPF_REFERENCED 2 +#define KPF_UPTODATE 3 +#define KPF_DIRTY 4 +#define KPF_LRU 5 +#define KPF_ACTIVE 6 +#define KPF_SLAB 7 +#define KPF_WRITEBACK 8 +#define KPF_RECLAIM 9 +#define KPF_BUDDY 10 + +/* [11-20] new additions in 2.6.31 */ +#define KPF_MMAP 11 +#define KPF_ANON 12 +#define KPF_SWAPCACHE 13 +#define KPF_SWAPBACKED 14 +#define KPF_COMPOUND_HEAD 15 +#define KPF_COMPOUND_TAIL 16 +#define KPF_HUGE 17 +#define KPF_UNEVICTABLE 18 +#define KPF_NOPAGE 20 + +/* [32-] kernel hacking assistances */ +#define KPF_RESERVED 32 +#define KPF_MLOCKED 33 +#define KPF_MAPPEDTODISK 34 +#define KPF_PRIVATE 35 +#define KPF_PRIVATE_2 36 +#define KPF_OWNER_PRIVATE 37 +#define KPF_ARCH 38 +#define KPF_UNCACHED 39 + +/* [48-] take some arbitrary free slots for expanding overloaded flags + * not part of kernel API + */ +#define KPF_READAHEAD 48 +#define KPF_SLOB_FREE 49 +#define KPF_SLUB_FROZEN 50 +#define KPF_SLUB_DEBUG 51 + +#define KPF_ALL_BITS ((uint64_t)~0ULL) +#define KPF_HACKERS_BITS (0xffffULL << 32) +#define KPF_OVERLOADED_BITS (0xffffULL << 48) +#define BIT(name) (1ULL << KPF_##name) +#define BITS_COMPOUND (BIT(COMPOUND_HEAD) | BIT(COMPOUND_TAIL)) + +static char *page_flag_names[] = { + [KPF_LOCKED] = "L:locked", + [KPF_ERROR] = "E:error", + [KPF_REFERENCED] = "R:referenced", + [KPF_UPTODATE] = "U:uptodate", + [KPF_DIRTY] = "D:dirty", + [KPF_LRU] = "l:lru", + [KPF_ACTIVE] = "A:active", + [KPF_SLAB] = "S:slab", + [KPF_WRITEBACK] = "W:writeback", + [KPF_RECLAIM] = "I:reclaim", + [KPF_BUDDY] = "B:buddy", + + [KPF_MMAP] = "M:mmap", + [KPF_ANON] = "a:anonymous", + [KPF_SWAPCACHE] = "s:swapcache", + [KPF_SWAPBACKED] = "b:swapbacked", + [KPF_COMPOUND_HEAD] = "H:compound_head", + [KPF_COMPOUND_TAIL] = "T:compound_tail", + [KPF_HUGE] = "G:huge", + [KPF_UNEVICTABLE] = "u:unevictable", + [KPF_NOPAGE] = "n:nopage", + + [KPF_RESERVED] = "r:reserved", + [KPF_MLOCKED] = "m:mlocked", + [KPF_MAPPEDTODISK] = "d:mappedtodisk", + [KPF_PRIVATE] = "P:private", + [KPF_PRIVATE_2] = "p:private_2", + [KPF_OWNER_PRIVATE] = "O:owner_private", + [KPF_ARCH] = "h:arch", + [KPF_UNCACHED] = "c:uncached", + + [KPF_READAHEAD] = "I:readahead", + [KPF_SLOB_FREE] = "P:slob_free", + [KPF_SLUB_FROZEN] = "A:slub_frozen", + [KPF_SLUB_DEBUG] = "E:slub_debug", +}; + + +/* + * data structures + */ + +static int opt_raw; /* for kernel developers */ +static int opt_list; /* list pages (in ranges) */ +static int opt_no_summary; /* don't show summary */ +static pid_t opt_pid; /* process to walk */ + +#define MAX_ADDR_RANGES 1024 +static int nr_addr_ranges; +static unsigned long opt_offset[MAX_ADDR_RANGES]; +static unsigned long opt_size[MAX_ADDR_RANGES]; + +#define MAX_BIT_FILTERS 64 +static int nr_bit_filters; +static uint64_t opt_mask[MAX_BIT_FILTERS]; +static uint64_t opt_bits[MAX_BIT_FILTERS]; + +static int page_size; + +#define PAGES_BATCH (64 << 10) /* 64k pages */ +static int kpageflags_fd; +static uint64_t kpageflags_buf[KPF_BYTES * PAGES_BATCH]; + +#define HASH_SHIFT 13 +#define HASH_SIZE (1 << HASH_SHIFT) +#define HASH_MASK (HASH_SIZE - 1) +#define HASH_KEY(flags) (flags & HASH_MASK) + +static unsigned long total_pages; +static unsigned long nr_pages[HASH_SIZE]; +static uint64_t page_flags[HASH_SIZE]; + + +/* + * helper functions + */ + +#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0])) + +#define min_t(type, x, y) ({ \ + type __min1 = (x); \ + type __min2 = (y); \ + __min1 < __min2 ? __min1 : __min2; }) + +unsigned long pages2mb(unsigned long pages) +{ + return (pages * page_size) >> 20; +} + +void fatal(const char *x, ...) +{ + va_list ap; + + va_start(ap, x); + vfprintf(stderr, x, ap); + va_end(ap); + exit(EXIT_FAILURE); +} + + +/* + * page flag names + */ + +char *page_flag_name(uint64_t flags) +{ + static char buf[65]; + int present; + int i, j; + + for (i = 0, j = 0; i < ARRAY_SIZE(page_flag_names); i++) { + present = (flags >> i) & 1; + if (!page_flag_names[i]) { + if (present) + fatal("unkown flag bit %d\n", i); + continue; + } + buf[j++] = present ? page_flag_names[i][0] : '_'; + } + + return buf; +} + +char *page_flag_longname(uint64_t flags) +{ + static char buf[1024]; + int i, n; + + for (i = 0, n = 0; i < ARRAY_SIZE(page_flag_names); i++) { + if (!page_flag_names[i]) + continue; + if ((flags >> i) & 1) + n += snprintf(buf + n, sizeof(buf) - n, "%s,", + page_flag_names[i] + 2); + } + if (n) + n--; + buf[n] = '\0'; + + return buf; +} + + +/* + * page list and summary + */ + +void show_page_range(unsigned long offset, uint64_t flags) +{ + static uint64_t flags0; + static unsigned long index; + static unsigned long count; + + if (flags == flags0 && offset == index + count) { + count++; + return; + } + + if (count) + printf("%lu\t%lu\t%s\n", + index, count, page_flag_name(flags0)); + + flags0 = flags; + index = offset; + count = 1; +} + +void show_page(unsigned long offset, uint64_t flags) +{ + printf("%lu\t%s\n", offset, page_flag_name(flags)); +} + +void show_summary(void) +{ + int i; + + printf(" flags\tpage-count MB" + " symbolic-flags\t\t\tlong-symbolic-flags\n"); + + for (i = 0; i < ARRAY_SIZE(nr_pages); i++) { + if (nr_pages[i]) + printf("0x%016llx\t%10lu %8lu %s\t%s\n", + (unsigned long long)page_flags[i], + nr_pages[i], + pages2mb(nr_pages[i]), + page_flag_name(page_flags[i]), + page_flag_longname(page_flags[i])); + } + + printf(" total\t%10lu %8lu\n", + total_pages, pages2mb(total_pages)); +} + + +/* + * page flag filters + */ + +int bit_mask_ok(uint64_t flags) +{ + int i; + + for (i = 0; i < nr_bit_filters; i++) { + if (opt_bits[i] == KPF_ALL_BITS) { + if ((flags & opt_mask[i]) == 0) + return 0; + } else { + if ((flags & opt_mask[i]) != opt_bits[i]) + return 0; + } + } + + return 1; +} + +uint64_t expand_overloaded_flags(uint64_t flags) +{ + /* SLOB/SLUB overload several page flags */ + if (flags & BIT(SLAB)) { + if (flags & BIT(PRIVATE)) + flags ^= BIT(PRIVATE) | BIT(SLOB_FREE); + if (flags & BIT(ACTIVE)) + flags ^= BIT(ACTIVE) | BIT(SLUB_FROZEN); + if (flags & BIT(ERROR)) + flags ^= BIT(ERROR) | BIT(SLUB_DEBUG); + } + + /* PG_reclaim is overloaded as PG_readahead in the read path */ + if ((flags & (BIT(RECLAIM) | BIT(WRITEBACK))) == BIT(RECLAIM)) + flags ^= BIT(RECLAIM) | BIT(READAHEAD); + + return flags; +} + +uint64_t well_known_flags(uint64_t flags) +{ + /* hide flags intended only for kernel hacker */ + flags &= ~KPF_HACKERS_BITS; + + /* hide non-hugeTLB compound pages */ + if ((flags & BITS_COMPOUND) && !(flags & BIT(HUGE))) + flags &= ~BITS_COMPOUND; + + return flags; +} + + +/* + * page frame walker + */ + +int hash_slot(uint64_t flags) +{ + int k = HASH_KEY(flags); + int i; + + /* Explicitly reserve slot 0 for flags 0: the following logic + * cannot distinguish an unoccupied slot from slot (flags==0). + */ + if (flags == 0) + return 0; + + /* search through the remaining (HASH_SIZE-1) slots */ + for (i = 1; i < ARRAY_SIZE(page_flags); i++, k++) { + if (!k || k >= ARRAY_SIZE(page_flags)) + k = 1; + if (page_flags[k] == 0) { + page_flags[k] = flags; + return k; + } + if (page_flags[k] == flags) + return k; + } + + fatal("hash table full: bump up HASH_SHIFT?\n"); + exit(EXIT_FAILURE); +} + +void add_page(unsigned long offset, uint64_t flags) +{ + flags = expand_overloaded_flags(flags); + + if (!opt_raw) + flags = well_known_flags(flags); + + if (!bit_mask_ok(flags)) + return; + + if (opt_list == 1) + show_page_range(offset, flags); + else if (opt_list == 2) + show_page(offset, flags); + + nr_pages[hash_slot(flags)]++; + total_pages++; +} + +void walk_pfn(unsigned long index, unsigned long count) +{ + unsigned long batch; + unsigned long n; + unsigned long i; + + if (index > ULONG_MAX / KPF_BYTES) + fatal("index overflow: %lu\n", index); + + lseek(kpageflags_fd, index * KPF_BYTES, SEEK_SET); + + while (count) { + batch = min_t(unsigned long, count, PAGES_BATCH); + n = read(kpageflags_fd, kpageflags_buf, batch * KPF_BYTES); + if (n == 0) + break; + if (n < 0) { + perror(PROC_KPAGEFLAGS); + exit(EXIT_FAILURE); + } + + if (n % KPF_BYTES != 0) + fatal("partial read: %lu bytes\n", n); + n = n / KPF_BYTES; + + for (i = 0; i < n; i++) + add_page(index + i, kpageflags_buf[i]); + + index += batch; + count -= batch; + } +} + +void walk_addr_ranges(void) +{ + int i; + + kpageflags_fd = open(PROC_KPAGEFLAGS, O_RDONLY); + if (kpageflags_fd < 0) { + perror(PROC_KPAGEFLAGS); + exit(EXIT_FAILURE); + } + + if (!nr_addr_ranges) + walk_pfn(0, ULONG_MAX); + + for (i = 0; i < nr_addr_ranges; i++) + walk_pfn(opt_offset[i], opt_size[i]); + + close(kpageflags_fd); +} + + +/* + * user interface + */ + +const char *page_flag_type(uint64_t flag) +{ + if (flag & KPF_HACKERS_BITS) + return "(r)"; + if (flag & KPF_OVERLOADED_BITS) + return "(o)"; + return " "; +} + +void usage(void) +{ + int i, j; + + printf( +"page-types [options]\n" +" -r|--raw Raw mode, for kernel developers\n" +" -a|--addr addr-spec Walk a range of pages\n" +" -b|--bits bits-spec Walk pages with specified bits\n" +#if 0 /* planned features */ +" -p|--pid pid Walk process address space\n" +" -f|--file filename Walk file address space\n" +#endif +" -l|--list Show page details in ranges\n" +" -L|--list-each Show page details one by one\n" +" -N|--no-summary Don't show summay info\n" +" -h|--help Show this usage message\n" +"addr-spec:\n" +" N one page at offset N (unit: pages)\n" +" N+M pages range from N to N+M-1\n" +" N,M pages range from N to M-1\n" +" N, pages range from N to end\n" +" ,M pages range from 0 to M\n" +"bits-spec:\n" +" bit1,bit2 (flags & (bit1|bit2)) != 0\n" +" bit1,bit2=bit1 (flags & (bit1|bit2)) == bit1\n" +" bit1,~bit2 (flags & (bit1|bit2)) == bit1\n" +" =bit1,bit2 flags == (bit1|bit2)\n" +"bit-names:\n" + ); + + for (i = 0, j = 0; i < ARRAY_SIZE(page_flag_names); i++) { + if (!page_flag_names[i]) + continue; + printf("%16s%s", page_flag_names[i] + 2, + page_flag_type(1ULL << i)); + if (++j > 3) { + j = 0; + putchar('\n'); + } + } + printf("\n " + "(r) raw mode bits (o) overloaded bits\n"); +} + +unsigned long long parse_number(const char *str) +{ + unsigned long long n; + + n = strtoll(str, NULL, 0); + + if (n == 0 && str[0] != '0') + fatal("invalid name or number: %s\n", str); + + return n; +} + +void parse_pid(const char *str) +{ + opt_pid = parse_number(str); +} + +void parse_file(const char *name) +{ +} + +void add_addr_range(unsigned long offset, unsigned long size) +{ + if (nr_addr_ranges >= MAX_ADDR_RANGES) + fatal("too much addr ranges\n"); + + opt_offset[nr_addr_ranges] = offset; + opt_size[nr_addr_ranges] = size; + nr_addr_ranges++; +} + +void parse_addr_range(const char *optarg) +{ + unsigned long offset; + unsigned long size; + char *p; + + p = strchr(optarg, ','); + if (!p) + p = strchr(optarg, '+'); + + if (p == optarg) { + offset = 0; + size = parse_number(p + 1); + } else if (p) { + offset = parse_number(optarg); + if (p[1] == '\0') + size = ULONG_MAX; + else { + size = parse_number(p + 1); + if (*p == ',') { + if (size < offset) + fatal("invalid range: %lu,%lu\n", + offset, size); + size -= offset; + } + } + } else { + offset = parse_number(optarg); + size = 1; + } + + add_addr_range(offset, size); +} + +void add_bits_filter(uint64_t mask, uint64_t bits) +{ + if (nr_bit_filters >= MAX_BIT_FILTERS) + fatal("too much bit filters\n"); + + opt_mask[nr_bit_filters] = mask; + opt_bits[nr_bit_filters] = bits; + nr_bit_filters++; +} + +uint64_t parse_flag_name(const char *str, int len) +{ + int i; + + if (!*str || !len) + return 0; + + if (len <= 8 && !strncmp(str, "compound", len)) + return BITS_COMPOUND; + + for (i = 0; i < ARRAY_SIZE(page_flag_names); i++) { + if (!page_flag_names[i]) + continue; + if (!strncmp(str, page_flag_names[i] + 2, len)) + return 1ULL << i; + } + + return parse_number(str); +} + +uint64_t parse_flag_names(const char *str, int all) +{ + const char *p = str; + uint64_t flags = 0; + + while (1) { + if (*p == ',' || *p == '=' || *p == '\0') { + if ((*str != '~') || (*str == '~' && all && *++str)) + flags |= parse_flag_name(str, p - str); + if (*p != ',') + break; + str = p + 1; + } + p++; + } + + return flags; +} + +void parse_bits_mask(const char *optarg) +{ + uint64_t mask; + uint64_t bits; + const char *p; + + p = strchr(optarg, '='); + if (p == optarg) { + mask = KPF_ALL_BITS; + bits = parse_flag_names(p + 1, 0); + } else if (p) { + mask = parse_flag_names(optarg, 0); + bits = parse_flag_names(p + 1, 0); + } else if (strchr(optarg, '~')) { + mask = parse_flag_names(optarg, 1); + bits = parse_flag_names(optarg, 0); + } else { + mask = parse_flag_names(optarg, 0); + bits = KPF_ALL_BITS; + } + + add_bits_filter(mask, bits); +} + + +struct option opts[] = { + { "raw" , 0, NULL, 'r' }, + { "pid" , 1, NULL, 'p' }, + { "file" , 1, NULL, 'f' }, + { "addr" , 1, NULL, 'a' }, + { "bits" , 1, NULL, 'b' }, + { "list" , 0, NULL, 'l' }, + { "list-each" , 0, NULL, 'L' }, + { "no-summary", 0, NULL, 'N' }, + { "help" , 0, NULL, 'h' }, + { NULL , 0, NULL, 0 } +}; + +int main(int argc, char *argv[]) +{ + int c; + + page_size = getpagesize(); + + while ((c = getopt_long(argc, argv, + "rp:f:a:b:lLNh", opts, NULL)) != -1) { + switch (c) { + case 'r': + opt_raw = 1; + break; + case 'p': + parse_pid(optarg); + break; + case 'f': + parse_file(optarg); + break; + case 'a': + parse_addr_range(optarg); + break; + case 'b': + parse_bits_mask(optarg); + break; + case 'l': + opt_list = 1; + break; + case 'L': + opt_list = 2; + break; + case 'N': + opt_no_summary = 1; + break; + case 'h': + usage(); + exit(0); + default: + usage(); + exit(1); + } + } + + if (opt_list == 1) + printf("offset\tcount\tflags\n"); + if (opt_list == 2) + printf("offset\tflags\n"); + + walk_addr_ranges(); + + if (opt_list == 1) + show_page_range(0, 0); /* drain the buffer */ + + if (opt_no_summary) + return 0; + + if (opt_list) + printf("\n\n"); + + show_summary(); + + return 0; +} diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt index ce72c0fe617..600a304a828 100644 --- a/Documentation/vm/pagemap.txt +++ b/Documentation/vm/pagemap.txt @@ -12,9 +12,9 @@ There are three components to pagemap: value for each virtual page, containing the following data (from fs/proc/task_mmu.c, above pagemap_read): - * Bits 0-55 page frame number (PFN) if present + * Bits 0-54 page frame number (PFN) if present * Bits 0-4 swap type if swapped - * Bits 5-55 swap offset if swapped + * Bits 5-54 swap offset if swapped * Bits 55-60 page shift (page size = 1<<page shift) * Bit 61 reserved for future use * Bit 62 page swapped @@ -36,7 +36,7 @@ There are three components to pagemap: * /proc/kpageflags. This file contains a 64-bit set of flags for each page, indexed by PFN. - The flags are (from fs/proc/proc_misc, above kpageflags_read): + The flags are (from fs/proc/page.c, above kpageflags_read): 0. LOCKED 1. ERROR @@ -49,6 +49,68 @@ There are three components to pagemap: 8. WRITEBACK 9. RECLAIM 10. BUDDY + 11. MMAP + 12. ANON + 13. SWAPCACHE + 14. SWAPBACKED + 15. COMPOUND_HEAD + 16. COMPOUND_TAIL + 16. HUGE + 18. UNEVICTABLE + 20. NOPAGE + +Short descriptions to the page flags: + + 0. LOCKED + page is being locked for exclusive access, eg. by undergoing read/write IO + + 7. SLAB + page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator + When compound page is used, SLUB/SLQB will only set this flag on the head + page; SLOB will not flag it at all. + +10. BUDDY + a free memory block managed by the buddy system allocator + The buddy system organizes free memory in blocks of various orders. + An order N block has 2^N physically contiguous pages, with the BUDDY flag + set for and _only_ for the first page. + +15. COMPOUND_HEAD +16. COMPOUND_TAIL + A compound page with order N consists of 2^N physically contiguous pages. + A compound page with order 2 takes the form of "HTTT", where H donates its + head page and T donates its tail page(s). The major consumers of compound + pages are hugeTLB pages (Documentation/vm/hugetlbpage.txt), the SLUB etc. + memory allocators and various device drivers. However in this interface, + only huge/giga pages are made visible to end users. +17. HUGE + this is an integral part of a HugeTLB page + +20. NOPAGE + no page frame exists at the requested address + + [IO related page flags] + 1. ERROR IO error occurred + 3. UPTODATE page has up-to-date data + ie. for file backed page: (in-memory data revision >= on-disk one) + 4. DIRTY page has been written to, hence contains new data + ie. for file backed page: (in-memory data revision > on-disk one) + 8. WRITEBACK page is being synced to disk + + [LRU related page flags] + 5. LRU page is in one of the LRU lists + 6. ACTIVE page is in the active LRU list +18. UNEVICTABLE page is in the unevictable (non-)LRU list + It is somehow pinned and not a candidate for LRU page reclaims, + eg. ramfs pages, shmctl(SHM_LOCK) and mlock() memory segments + 2. REFERENCED page has been referenced since last LRU list enqueue/requeue + 9. RECLAIM page will be reclaimed soon after its pageout IO completed +11. MMAP a memory mapped page +12. ANON a memory mapped page that is not part of a file +13. SWAPCACHE page is mapped to swap space, ie. has an associated swap entry +14. SWAPBACKED page is backed by swap/RAM + +The page-types tool in this directory can be used to query the above flags. Using pagemap to do something useful: diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.txt index bb1f5c6e28b..510917ff59e 100644 --- a/Documentation/vm/slub.txt +++ b/Documentation/vm/slub.txt @@ -41,6 +41,8 @@ Possible debug options are P Poisoning (object and padding) U User tracking (free and alloc) T Trace (please only use on single slabs) + O Switch debugging off for caches that would have + caused higher minimum slab orders - Switch all debugging off (useful if the kernel is configured with CONFIG_SLUB_DEBUG_ON) @@ -59,6 +61,14 @@ to the dentry cache with slub_debug=F,dentry +Debugging options may require the minimum possible slab order to increase as +a result of storing the metadata (for example, caches with PAGE_SIZE object +sizes). This has a higher liklihood of resulting in slab allocation errors +in low memory situations or if there's high fragmentation of memory. To +switch off debugging for such caches by default, use + + slub_debug=O + In case you forgot to enable debugging on the kernel command line: It is possible to enable debugging manually when the kernel is up. Look at the contents of: diff --git a/Documentation/watchdog/hpwdt.txt b/Documentation/watchdog/hpwdt.txt new file mode 100644 index 00000000000..9c24d5ffbb0 --- /dev/null +++ b/Documentation/watchdog/hpwdt.txt @@ -0,0 +1,95 @@ +Last reviewed: 06/02/2009 + + HP iLO2 NMI Watchdog Driver + NMI sourcing for iLO2 based ProLiant Servers + Documentation and Driver by + Thomas Mingarelli <thomas.mingarelli@hp.com> + + The HP iLO2 NMI Watchdog driver is a kernel module that provides basic + watchdog functionality and the added benefit of NMI sourcing. Both the + watchdog functionality and the NMI sourcing capability need to be enabled + by the user. Remember that the two modes are not dependant on one another. + A user can have the NMI sourcing without the watchdog timer and vice-versa. + + Watchdog functionality is enabled like any other common watchdog driver. That + is, an application needs to be started that kicks off the watchdog timer. A + basic application exists in the Documentation/watchdog/src directory called + watchdog-test.c. Simply compile the C file and kick it off. If the system + gets into a bad state and hangs, the HP ProLiant iLO 2 timer register will + not be updated in a timely fashion and a hardware system reset (also known as + an Automatic Server Recovery (ASR)) event will occur. + + The hpwdt driver also has four (4) module parameters. They are the following: + + soft_margin - allows the user to set the watchdog timer value + allow_kdump - allows the user to save off a kernel dump image after an NMI + nowayout - basic watchdog parameter that does not allow the timer to + be restarted or an impending ASR to be escaped. + priority - determines whether or not the hpwdt driver is first on the + die_notify list to handle NMIs or last. The default value + for this module parameter is 0 or LAST. If the user wants to + enable NMI sourcing then reload the hpwdt driver with + priority=1 (and boot with nmi_watchdog=0). + + NOTE: More information about watchdog drivers in general, including the ioctl + interface to /dev/watchdog can be found in + Documentation/watchdog/watchdog-api.txt and Documentation/IPMI.txt. + + The priority parameter was introduced due to other kernel software that relied + on handling NMIs (like oprofile). Keeping hpwdt's priority at 0 (or LAST) + enables the users of NMIs for non critical events to be work as expected. + + The NMI sourcing capability is disabled by default due to the inability to + distinguish between "NMI Watchdog Ticks" and "HW generated NMI events" in the + Linux kernel. What this means is that the hpwdt nmi handler code is called + each time the NMI signal fires off. This could amount to several thousands of + NMIs in a matter of seconds. If a user sees the Linux kernel's "dazed and + confused" message in the logs or if the system gets into a hung state, then + the hpwdt driver can be reloaded with the "priority" module parameter set + (priority=1). + + 1. If the kernel has not been booted with nmi_watchdog turned off then + edit /boot/grub/menu.lst and place the nmi_watchdog=0 at the end of the + currently booting kernel line. + 2. reboot the sever + 3. Once the system comes up perform a rmmod hpwdt + 4. insmod /lib/modules/`uname -r`/kernel/drivers/char/watchdog/hpwdt.ko priority=1 + + Now, the hpwdt can successfully receive and source the NMI and provide a log + message that details the reason for the NMI (as determined by the HP BIOS). + + Below is a list of NMIs the HP BIOS understands along with the associated + code (reason): + + No source found 00h + + Uncorrectable Memory Error 01h + + ASR NMI 1Bh + + PCI Parity Error 20h + + NMI Button Press 27h + + SB_BUS_NMI 28h + + ILO Doorbell NMI 29h + + ILO IOP NMI 2Ah + + ILO Watchdog NMI 2Bh + + Proc Throt NMI 2Ch + + Front Side Bus NMI 2Dh + + PCI Express Error 2Fh + + DMA controller NMI 30h + + Hypertransport/CSI Error 31h + + + + -- Tom Mingarelli + (thomas.mingarelli@hp.com) diff --git a/Documentation/x86/00-INDEX b/Documentation/x86/00-INDEX index dbe3377754a..f37b46d3486 100644 --- a/Documentation/x86/00-INDEX +++ b/Documentation/x86/00-INDEX @@ -2,3 +2,5 @@ - this file mtrr.txt - how to use x86 Memory Type Range Registers to increase performance +exception-tables.txt + - why and how Linux kernel uses exception tables on x86 diff --git a/Documentation/exception.txt b/Documentation/x86/exception-tables.txt index 2d5aded6424..32901aa36f0 100644 --- a/Documentation/exception.txt +++ b/Documentation/x86/exception-tables.txt @@ -1,123 +1,123 @@ - Kernel level exception handling in Linux 2.1.8 + Kernel level exception handling in Linux Commentary by Joerg Pommnitz <joerg@raleigh.ibm.com> -When a process runs in kernel mode, it often has to access user -mode memory whose address has been passed by an untrusted program. +When a process runs in kernel mode, it often has to access user +mode memory whose address has been passed by an untrusted program. To protect itself the kernel has to verify this address. -In older versions of Linux this was done with the -int verify_area(int type, const void * addr, unsigned long size) +In older versions of Linux this was done with the +int verify_area(int type, const void * addr, unsigned long size) function (which has since been replaced by access_ok()). -This function verified that the memory area starting at address +This function verified that the memory area starting at address 'addr' and of size 'size' was accessible for the operation specified -in type (read or write). To do this, verify_read had to look up the -virtual memory area (vma) that contained the address addr. In the -normal case (correctly working program), this test was successful. +in type (read or write). To do this, verify_read had to look up the +virtual memory area (vma) that contained the address addr. In the +normal case (correctly working program), this test was successful. It only failed for a few buggy programs. In some kernel profiling tests, this normally unneeded verification used up a considerable amount of time. -To overcome this situation, Linus decided to let the virtual memory +To overcome this situation, Linus decided to let the virtual memory hardware present in every Linux-capable CPU handle this test. How does this work? -Whenever the kernel tries to access an address that is currently not -accessible, the CPU generates a page fault exception and calls the -page fault handler +Whenever the kernel tries to access an address that is currently not +accessible, the CPU generates a page fault exception and calls the +page fault handler void do_page_fault(struct pt_regs *regs, unsigned long error_code) -in arch/i386/mm/fault.c. The parameters on the stack are set up by -the low level assembly glue in arch/i386/kernel/entry.S. The parameter -regs is a pointer to the saved registers on the stack, error_code +in arch/x86/mm/fault.c. The parameters on the stack are set up by +the low level assembly glue in arch/x86/kernel/entry_32.S. The parameter +regs is a pointer to the saved registers on the stack, error_code contains a reason code for the exception. -do_page_fault first obtains the unaccessible address from the CPU -control register CR2. If the address is within the virtual address -space of the process, the fault probably occurred, because the page -was not swapped in, write protected or something similar. However, -we are interested in the other case: the address is not valid, there -is no vma that contains this address. In this case, the kernel jumps -to the bad_area label. - -There it uses the address of the instruction that caused the exception -(i.e. regs->eip) to find an address where the execution can continue -(fixup). If this search is successful, the fault handler modifies the -return address (again regs->eip) and returns. The execution will +do_page_fault first obtains the unaccessible address from the CPU +control register CR2. If the address is within the virtual address +space of the process, the fault probably occurred, because the page +was not swapped in, write protected or something similar. However, +we are interested in the other case: the address is not valid, there +is no vma that contains this address. In this case, the kernel jumps +to the bad_area label. + +There it uses the address of the instruction that caused the exception +(i.e. regs->eip) to find an address where the execution can continue +(fixup). If this search is successful, the fault handler modifies the +return address (again regs->eip) and returns. The execution will continue at the address in fixup. Where does fixup point to? -Since we jump to the contents of fixup, fixup obviously points -to executable code. This code is hidden inside the user access macros. -I have picked the get_user macro defined in include/asm/uaccess.h as an -example. The definition is somewhat hard to follow, so let's peek at +Since we jump to the contents of fixup, fixup obviously points +to executable code. This code is hidden inside the user access macros. +I have picked the get_user macro defined in arch/x86/include/asm/uaccess.h +as an example. The definition is somewhat hard to follow, so let's peek at the code generated by the preprocessor and the compiler. I selected -the get_user call in drivers/char/console.c for a detailed examination. +the get_user call in drivers/char/sysrq.c for a detailed examination. -The original code in console.c line 1405: +The original code in sysrq.c line 587: get_user(c, buf); The preprocessor output (edited to become somewhat readable): ( - { - long __gu_err = - 14 , __gu_val = 0; - const __typeof__(*( ( buf ) )) *__gu_addr = ((buf)); - if (((((0 + current_set[0])->tss.segment) == 0x18 ) || - (((sizeof(*(buf))) <= 0xC0000000UL) && - ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf))))))) + { + long __gu_err = - 14 , __gu_val = 0; + const __typeof__(*( ( buf ) )) *__gu_addr = ((buf)); + if (((((0 + current_set[0])->tss.segment) == 0x18 ) || + (((sizeof(*(buf))) <= 0xC0000000UL) && + ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf))))))) do { - __gu_err = 0; - switch ((sizeof(*(buf)))) { - case 1: - __asm__ __volatile__( - "1: mov" "b" " %2,%" "b" "1\n" - "2:\n" - ".section .fixup,\"ax\"\n" - "3: movl %3,%0\n" - " xor" "b" " %" "b" "1,%" "b" "1\n" - " jmp 2b\n" - ".section __ex_table,\"a\"\n" - " .align 4\n" - " .long 1b,3b\n" + __gu_err = 0; + switch ((sizeof(*(buf)))) { + case 1: + __asm__ __volatile__( + "1: mov" "b" " %2,%" "b" "1\n" + "2:\n" + ".section .fixup,\"ax\"\n" + "3: movl %3,%0\n" + " xor" "b" " %" "b" "1,%" "b" "1\n" + " jmp 2b\n" + ".section __ex_table,\"a\"\n" + " .align 4\n" + " .long 1b,3b\n" ".text" : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *) - ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )) ; - break; - case 2: + ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )) ; + break; + case 2: __asm__ __volatile__( - "1: mov" "w" " %2,%" "w" "1\n" - "2:\n" - ".section .fixup,\"ax\"\n" - "3: movl %3,%0\n" - " xor" "w" " %" "w" "1,%" "w" "1\n" - " jmp 2b\n" - ".section __ex_table,\"a\"\n" - " .align 4\n" - " .long 1b,3b\n" + "1: mov" "w" " %2,%" "w" "1\n" + "2:\n" + ".section .fixup,\"ax\"\n" + "3: movl %3,%0\n" + " xor" "w" " %" "w" "1,%" "w" "1\n" + " jmp 2b\n" + ".section __ex_table,\"a\"\n" + " .align 4\n" + " .long 1b,3b\n" ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *) - ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )); - break; - case 4: - __asm__ __volatile__( - "1: mov" "l" " %2,%" "" "1\n" - "2:\n" - ".section .fixup,\"ax\"\n" - "3: movl %3,%0\n" - " xor" "l" " %" "" "1,%" "" "1\n" - " jmp 2b\n" - ".section __ex_table,\"a\"\n" - " .align 4\n" " .long 1b,3b\n" + ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )); + break; + case 4: + __asm__ __volatile__( + "1: mov" "l" " %2,%" "" "1\n" + "2:\n" + ".section .fixup,\"ax\"\n" + "3: movl %3,%0\n" + " xor" "l" " %" "" "1,%" "" "1\n" + " jmp 2b\n" + ".section __ex_table,\"a\"\n" + " .align 4\n" " .long 1b,3b\n" ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *) - ( __gu_addr )) ), "i"(- 14 ), "0"(__gu_err)); - break; - default: - (__gu_val) = __get_user_bad(); - } - } while (0) ; - ((c)) = (__typeof__(*((buf))))__gu_val; + ( __gu_addr )) ), "i"(- 14 ), "0"(__gu_err)); + break; + default: + (__gu_val) = __get_user_bad(); + } + } while (0) ; + ((c)) = (__typeof__(*((buf))))__gu_val; __gu_err; } ); @@ -127,12 +127,12 @@ see what code gcc generates: > xorl %edx,%edx > movl current_set,%eax - > cmpl $24,788(%eax) - > je .L1424 + > cmpl $24,788(%eax) + > je .L1424 > cmpl $-1073741825,64(%esp) - > ja .L1423 + > ja .L1423 > .L1424: - > movl %edx,%eax + > movl %edx,%eax > movl 64(%esp),%ebx > #APP > 1: movb (%ebx),%dl /* this is the actual user access */ @@ -149,17 +149,17 @@ see what code gcc generates: > .L1423: > movzbl %dl,%esi -The optimizer does a good job and gives us something we can actually -understand. Can we? The actual user access is quite obvious. Thanks -to the unified address space we can just access the address in user +The optimizer does a good job and gives us something we can actually +understand. Can we? The actual user access is quite obvious. Thanks +to the unified address space we can just access the address in user memory. But what does the .section stuff do????? To understand this we have to look at the final kernel: > objdump --section-headers vmlinux - > + > > vmlinux: file format elf32-i386 - > + > > Sections: > Idx Name Size VMA LMA File off Algn > 0 .text 00098f40 c0100000 c0100000 00001000 2**4 @@ -198,18 +198,18 @@ final kernel executable: The whole user memory access is reduced to 10 x86 machine instructions. The instructions bracketed in the .section directives are no longer -in the normal execution path. They are located in a different section +in the normal execution path. They are located in a different section of the executable file: > objdump --disassemble --section=.fixup vmlinux - > + > > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax > c0199ffa <.fixup+10ba> xorb %dl,%dl > c0199ffc <.fixup+10bc> jmp c017e7a7 <do_con_write+e3> And finally: > objdump --full-contents --section=__ex_table vmlinux - > + > > c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0 ................ > c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0 ................ > c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0 ................ @@ -235,8 +235,8 @@ sections in the ELF object file. So the instructions ended up in the .fixup section of the object file and the addresses .long 1b,3b ended up in the __ex_table section of the object file. 1b and 3b -are local labels. The local label 1b (1b stands for next label 1 -backward) is the address of the instruction that might fault, i.e. +are local labels. The local label 1b (1b stands for next label 1 +backward) is the address of the instruction that might fault, i.e. in our case the address of the label 1 is c017e7a5: the original assembly code: > 1: movb (%ebx),%dl and linked in vmlinux : > c017e7a5 <do_con_write+e1> movb (%ebx),%dl @@ -254,7 +254,7 @@ The assembly code becomes the value pair > c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................ ^this is ^this is - 1b 3b + 1b 3b c017e7a5,c0199ff5 in the exception table of the kernel. So, what actually happens if a fault from kernel mode with no suitable @@ -266,9 +266,9 @@ vma occurs? 3.) CPU calls do_page_fault 4.) do page fault calls search_exception_table (regs->eip == c017e7a5); 5.) search_exception_table looks up the address c017e7a5 in the - exception table (i.e. the contents of the ELF section __ex_table) + exception table (i.e. the contents of the ELF section __ex_table) and returns the address of the associated fault handle code c0199ff5. -6.) do_page_fault modifies its own return address to point to the fault +6.) do_page_fault modifies its own return address to point to the fault handle code and returns. 7.) execution continues in the fault handling code. 8.) 8a) EAX becomes -EFAULT (== -14) diff --git a/Documentation/x86/zero-page.txt b/Documentation/x86/zero-page.txt index 4f913857b8a..feb37e17701 100644 --- a/Documentation/x86/zero-page.txt +++ b/Documentation/x86/zero-page.txt @@ -12,6 +12,7 @@ Offset Proto Name Meaning 000/040 ALL screen_info Text mode or frame buffer information (struct screen_info) 040/014 ALL apm_bios_info APM BIOS information (struct apm_bios_info) +058/008 ALL tboot_addr Physical address of tboot shared page 060/010 ALL ist_info Intel SpeedStep (IST) BIOS support information (struct ist_info) 080/010 ALL hd0_info hd0 disk parameter, OBSOLETE!! |