aboutsummaryrefslogtreecommitdiff
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/DocBook/kernel-api.tmpl44
-rw-r--r--Documentation/RCU/checklist.txt44
-rw-r--r--Documentation/RCU/whatisRCU.txt12
-rw-r--r--Documentation/devices.txt7
-rw-r--r--Documentation/filesystems/fuse.txt118
-rw-r--r--Documentation/filesystems/ramfs-rootfs-initramfs.txt146
-rw-r--r--Documentation/kdump/kdump.txt420
-rw-r--r--Documentation/memory-barriers.txt34
-rw-r--r--Documentation/rtc.txt7
-rw-r--r--Documentation/sysrq.txt5
10 files changed, 611 insertions, 226 deletions
diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl
index 31b727ceb12..3630a0d7695 100644
--- a/Documentation/DocBook/kernel-api.tmpl
+++ b/Documentation/DocBook/kernel-api.tmpl
@@ -62,6 +62,8 @@
<sect1><title>Internal Functions</title>
!Ikernel/exit.c
!Ikernel/signal.c
+!Iinclude/linux/kthread.h
+!Ekernel/kthread.c
</sect1>
<sect1><title>Kernel objects manipulation</title>
@@ -114,6 +116,29 @@ X!Ilib/string.c
</sect1>
</chapter>
+ <chapter id="kernel-lib">
+ <title>Basic Kernel Library Functions</title>
+
+ <para>
+ The Linux kernel provides more basic utility functions.
+ </para>
+
+ <sect1><title>Bitmap Operations</title>
+!Elib/bitmap.c
+!Ilib/bitmap.c
+ </sect1>
+
+ <sect1><title>Command-line Parsing</title>
+!Elib/cmdline.c
+ </sect1>
+
+ <sect1><title>CRC Functions</title>
+!Elib/crc16.c
+!Elib/crc32.c
+!Elib/crc-ccitt.c
+ </sect1>
+ </chapter>
+
<chapter id="mm">
<title>Memory Management in Linux</title>
<sect1><title>The Slab Cache</title>
@@ -281,12 +306,13 @@ X!Ekernel/module.c
<sect1><title>MTRR Handling</title>
!Earch/i386/kernel/cpu/mtrr/main.c
</sect1>
+
<sect1><title>PCI Support Library</title>
!Edrivers/pci/pci.c
!Edrivers/pci/pci-driver.c
!Edrivers/pci/remove.c
!Edrivers/pci/pci-acpi.c
-<!-- kerneldoc does not understand to __devinit
+<!-- kerneldoc does not understand __devinit
X!Edrivers/pci/search.c
-->
!Edrivers/pci/msi.c
@@ -315,6 +341,13 @@ X!Earch/i386/kernel/mca.c
</sect1>
</chapter>
+ <chapter id="firmware">
+ <title>Firmware Interfaces</title>
+ <sect1><title>DMI Interfaces</title>
+!Edrivers/firmware/dmi_scan.c
+ </sect1>
+ </chapter>
+
<chapter id="devfs">
<title>The Device File System</title>
!Efs/devfs/base.c
@@ -403,7 +436,6 @@ X!Edrivers/pnp/system.c
</sect1>
</chapter>
-
<chapter id="blkdev">
<title>Block Devices</title>
!Eblock/ll_rw_blk.c
@@ -414,6 +446,14 @@ X!Edrivers/pnp/system.c
!Edrivers/char/misc.c
</chapter>
+ <chapter id="parportdev">
+ <title>Parallel Port Devices</title>
+!Iinclude/linux/parport.h
+!Edrivers/parport/ieee1284.c
+!Edrivers/parport/share.c
+!Idrivers/parport/daisy.c
+ </chapter>
+
<chapter id="viddev">
<title>Video4Linux</title>
!Edrivers/media/video/videodev.c
diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
index 49e27cc1938..1d50cf0c905 100644
--- a/Documentation/RCU/checklist.txt
+++ b/Documentation/RCU/checklist.txt
@@ -144,9 +144,47 @@ over a rather long period of time, but improvements are always welcome!
whether the increased speed is worth it.
8. Although synchronize_rcu() is a bit slower than is call_rcu(),
- it usually results in simpler code. So, unless update performance
- is important or the updaters cannot block, synchronize_rcu()
- should be used in preference to call_rcu().
+ it usually results in simpler code. So, unless update
+ performance is critically important or the updaters cannot block,
+ synchronize_rcu() should be used in preference to call_rcu().
+
+ An especially important property of the synchronize_rcu()
+ primitive is that it automatically self-limits: if grace periods
+ are delayed for whatever reason, then the synchronize_rcu()
+ primitive will correspondingly delay updates. In contrast,
+ code using call_rcu() should explicitly limit update rate in
+ cases where grace periods are delayed, as failing to do so can
+ result in excessive realtime latencies or even OOM conditions.
+
+ Ways of gaining this self-limiting property when using call_rcu()
+ include:
+
+ a. Keeping a count of the number of data-structure elements
+ used by the RCU-protected data structure, including those
+ waiting for a grace period to elapse. Enforce a limit
+ on this number, stalling updates as needed to allow
+ previously deferred frees to complete.
+
+ Alternatively, limit only the number awaiting deferred
+ free rather than the total number of elements.
+
+ b. Limiting update rate. For example, if updates occur only
+ once per hour, then no explicit rate limiting is required,
+ unless your system is already badly broken. The dcache
+ subsystem takes this approach -- updates are guarded
+ by a global lock, limiting their rate.
+
+ c. Trusted update -- if updates can only be done manually by
+ superuser or some other trusted user, then it might not
+ be necessary to automatically limit them. The theory
+ here is that superuser already has lots of ways to crash
+ the machine.
+
+ d. Use call_rcu_bh() rather than call_rcu(), in order to take
+ advantage of call_rcu_bh()'s faster grace periods.
+
+ e. Periodically invoke synchronize_rcu(), permitting a limited
+ number of updates per grace period.
9. All RCU list-traversal primitives, which include
list_for_each_rcu(), list_for_each_entry_rcu(),
diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt
index 6e459420ee9..4f41a60e511 100644
--- a/Documentation/RCU/whatisRCU.txt
+++ b/Documentation/RCU/whatisRCU.txt
@@ -184,7 +184,17 @@ synchronize_rcu()
blocking, it registers a function and argument which are invoked
after all ongoing RCU read-side critical sections have completed.
This callback variant is particularly useful in situations where
- it is illegal to block.
+ it is illegal to block or where update-side performance is
+ critically important.
+
+ However, the call_rcu() API should not be used lightly, as use
+ of the synchronize_rcu() API generally results in simpler code.
+ In addition, the synchronize_rcu() API has the nice property
+ of automatically limiting update rate should grace periods
+ be delayed. This property results in system resilience in face
+ of denial-of-service attacks. Code using call_rcu() should limit
+ update rate in order to gain this same sort of resilience. See
+ checklist.txt for some approaches to limiting the update rate.
rcu_assign_pointer()
diff --git a/Documentation/devices.txt b/Documentation/devices.txt
index b2f593fc76c..4aaf68fafeb 100644
--- a/Documentation/devices.txt
+++ b/Documentation/devices.txt
@@ -3,7 +3,7 @@
Maintained by Torben Mathiasen <device@lanana.org>
- Last revised: 01 March 2006
+ Last revised: 15 May 2006
This list is the Linux Device List, the official registry of allocated
device numbers and /dev directory nodes for the Linux operating
@@ -2791,6 +2791,7 @@ Your cooperation is appreciated.
170 = /dev/ttyNX0 Hilscher netX serial port 0
...
185 = /dev/ttyNX15 Hilscher netX serial port 15
+ 186 = /dev/ttyJ0 JTAG1 DCC protocol based serial port emulation
205 char Low-density serial ports (alternate device)
0 = /dev/culu0 Callout device for ttyLU0
@@ -3108,6 +3109,10 @@ Your cooperation is appreciated.
...
240 = /dev/rfdp 16th RFD FTL layer
+257 char Phoenix Technologies Cryptographic Services Driver
+ 0 = /dev/ptlsec Crypto Services Driver
+
+
**** ADDITIONAL /dev DIRECTORY ENTRIES
diff --git a/Documentation/filesystems/fuse.txt b/Documentation/filesystems/fuse.txt
index 33f74310d16..a584f05403a 100644
--- a/Documentation/filesystems/fuse.txt
+++ b/Documentation/filesystems/fuse.txt
@@ -18,6 +18,14 @@ Non-privileged mount (or user mount):
user. NOTE: this is not the same as mounts allowed with the "user"
option in /etc/fstab, which is not discussed here.
+Filesystem connection:
+
+ A connection between the filesystem daemon and the kernel. The
+ connection exists until either the daemon dies, or the filesystem is
+ umounted. Note that detaching (or lazy umounting) the filesystem
+ does _not_ break the connection, in this case it will exist until
+ the last reference to the filesystem is released.
+
Mount owner:
The user who does the mounting.
@@ -86,16 +94,20 @@ Mount options
The default is infinite. Note that the size of read requests is
limited anyway to 32 pages (which is 128kbyte on i386).
-Sysfs
-~~~~~
+Control filesystem
+~~~~~~~~~~~~~~~~~~
+
+There's a control filesystem for FUSE, which can be mounted by:
-FUSE sets up the following hierarchy in sysfs:
+ mount -t fusectl none /sys/fs/fuse/connections
- /sys/fs/fuse/connections/N/
+Mounting it under the '/sys/fs/fuse/connections' directory makes it
+backwards compatible with earlier versions.
-where N is an increasing number allocated to each new connection.
+Under the fuse control filesystem each connection has a directory
+named by a unique number.
-For each connection the following attributes are defined:
+For each connection the following files exist within this directory:
'waiting'
@@ -110,7 +122,47 @@ For each connection the following attributes are defined:
connection. This means that all waiting requests will be aborted an
error returned for all aborted and new requests.
-Only a privileged user may read or write these attributes.
+Only the owner of the mount may read or write these files.
+
+Interrupting filesystem operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If a process issuing a FUSE filesystem request is interrupted, the
+following will happen:
+
+ 1) If the request is not yet sent to userspace AND the signal is
+ fatal (SIGKILL or unhandled fatal signal), then the request is
+ dequeued and returns immediately.
+
+ 2) If the request is not yet sent to userspace AND the signal is not
+ fatal, then an 'interrupted' flag is set for the request. When
+ the request has been successfully transfered to userspace and
+ this flag is set, an INTERRUPT request is queued.
+
+ 3) If the request is already sent to userspace, then an INTERRUPT
+ request is queued.
+
+INTERRUPT requests take precedence over other requests, so the
+userspace filesystem will receive queued INTERRUPTs before any others.
+
+The userspace filesystem may ignore the INTERRUPT requests entirely,
+or may honor them by sending a reply to the _original_ request, with
+the error set to EINTR.
+
+It is also possible that there's a race between processing the
+original request and it's INTERRUPT request. There are two possibilities:
+
+ 1) The INTERRUPT request is processed before the original request is
+ processed
+
+ 2) The INTERRUPT request is processed after the original request has
+ been answered
+
+If the filesystem cannot find the original request, it should wait for
+some timeout and/or a number of new requests to arrive, after which it
+should reply to the INTERRUPT request with an EAGAIN error. In case
+1) the INTERRUPT request will be requeued. In case 2) the INTERRUPT
+reply will be ignored.
Aborting a filesystem connection
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -139,8 +191,8 @@ the filesystem. There are several ways to do this:
- Use forced umount (umount -f). Works in all cases but only if
filesystem is still attached (it hasn't been lazy unmounted)
- - Abort filesystem through the sysfs interface. Most powerful
- method, always works.
+ - Abort filesystem through the FUSE control filesystem. Most
+ powerful method, always works.
How do non-privileged mounts work?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -304,25 +356,7 @@ Scenario 1 - Simple deadlock
| | for "file"]
| | *DEADLOCK*
-The solution for this is to allow requests to be interrupted while
-they are in userspace:
-
- | [interrupted by signal] |
- | <fuse_unlink() |
- | [release semaphore] | [semaphore acquired]
- | <sys_unlink() |
- | | >fuse_unlink()
- | | [queue req on fc->pending]
- | | [wake up fc->waitq]
- | | [sleep on req->waitq]
-
-If the filesystem daemon was single threaded, this will stop here,
-since there's no other thread to dequeue and execute the request.
-In this case the solution is to kill the FUSE daemon as well. If
-there are multiple serving threads, you just have to kill them as
-long as any remain.
-
-Moral: a filesystem which deadlocks, can soon find itself dead.
+The solution for this is to allow the filesystem to be aborted.
Scenario 2 - Tricky deadlock
----------------------------
@@ -355,24 +389,14 @@ but is caused by a pagefault.
| | [lock page]
| | * DEADLOCK *
-Solution is again to let the the request be interrupted (not
-elaborated further).
-
-An additional problem is that while the write buffer is being
-copied to the request, the request must not be interrupted. This
-is because the destination address of the copy may not be valid
-after the request is interrupted.
-
-This is solved with doing the copy atomically, and allowing
-interruption while the page(s) belonging to the write buffer are
-faulted with get_user_pages(). The 'req->locked' flag indicates
-when the copy is taking place, and interruption is delayed until
-this flag is unset.
+Solution is basically the same as above.
-Scenario 3 - Tricky deadlock with asynchronous read
----------------------------------------------------
+An additional problem is that while the write buffer is being copied
+to the request, the request must not be interrupted/aborted. This is
+because the destination address of the copy may not be valid after the
+request has returned.
-The same situation as above, except thread-1 will wait on page lock
-and hence it will be uninterruptible as well. The solution is to
-abort the connection with forced umount (if mount is attached) or
-through the abort attribute in sysfs.
+This is solved with doing the copy atomically, and allowing abort
+while the page(s) belonging to the write buffer are faulted with
+get_user_pages(). The 'req->locked' flag indicates when the copy is
+taking place, and abort is delayed until this flag is unset.
diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.txt
index 60ab61e54e8..25981e2e51b 100644
--- a/Documentation/filesystems/ramfs-rootfs-initramfs.txt
+++ b/Documentation/filesystems/ramfs-rootfs-initramfs.txt
@@ -70,11 +70,13 @@ tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information.
What is rootfs?
---------------
-Rootfs is a special instance of ramfs, which is always present in 2.6 systems.
-(It's used internally as the starting and stopping point for searches of the
-kernel's doubly-linked list of mount points.)
+Rootfs is a special instance of ramfs (or tmpfs, if that's enabled), which is
+always present in 2.6 systems. You can't unmount rootfs for approximately the
+same reason you can't kill the init process; rather than having special code
+to check for and handle an empty list, it's smaller and simpler for the kernel
+to just make sure certain lists can't become empty.
-Most systems just mount another filesystem over it and ignore it. The
+Most systems just mount another filesystem over rootfs and ignore it. The
amount of space an empty instance of ramfs takes up is tiny.
What is initramfs?
@@ -92,14 +94,16 @@ out of that.
All this differs from the old initrd in several ways:
- - The old initrd was a separate file, while the initramfs archive is linked
- into the linux kernel image. (The directory linux-*/usr is devoted to
- generating this archive during the build.)
+ - The old initrd was always a separate file, while the initramfs archive is
+ linked into the linux kernel image. (The directory linux-*/usr is devoted
+ to generating this archive during the build.)
- The old initrd file was a gzipped filesystem image (in some file format,
- such as ext2, that had to be built into the kernel), while the new
+ such as ext2, that needed a driver built into the kernel), while the new
initramfs archive is a gzipped cpio archive (like tar only simpler,
- see cpio(1) and Documentation/early-userspace/buffer-format.txt).
+ see cpio(1) and Documentation/early-userspace/buffer-format.txt). The
+ kernel's cpio extraction code is not only extremely small, it's also
+ __init data that can be discarded during the boot process.
- The program run by the old initrd (which was called /initrd, not /init) did
some setup and then returned to the kernel, while the init program from
@@ -124,13 +128,14 @@ Populating initramfs:
The 2.6 kernel build process always creates a gzipped cpio format initramfs
archive and links it into the resulting kernel binary. By default, this
-archive is empty (consuming 134 bytes on x86). The config option
-CONFIG_INITRAMFS_SOURCE (for some reason buried under devices->block devices
-in menuconfig, and living in usr/Kconfig) can be used to specify a source for
-the initramfs archive, which will automatically be incorporated into the
-resulting binary. This option can point to an existing gzipped cpio archive, a
-directory containing files to be archived, or a text file specification such
-as the following example:
+archive is empty (consuming 134 bytes on x86).
+
+The config option CONFIG_INITRAMFS_SOURCE (for some reason buried under
+devices->block devices in menuconfig, and living in usr/Kconfig) can be used
+to specify a source for the initramfs archive, which will automatically be
+incorporated into the resulting binary. This option can point to an existing
+gzipped cpio archive, a directory containing files to be archived, or a text
+file specification such as the following example:
dir /dev 755 0 0
nod /dev/console 644 0 0 c 5 1
@@ -146,23 +151,84 @@ as the following example:
Run "usr/gen_init_cpio" (after the kernel build) to get a usage message
documenting the above file format.
-One advantage of the text file is that root access is not required to
+One advantage of the configuration file is that root access is not required to
set permissions or create device nodes in the new archive. (Note that those
two example "file" entries expect to find files named "init.sh" and "busybox" in
a directory called "initramfs", under the linux-2.6.* directory. See
Documentation/early-userspace/README for more details.)
-The kernel does not depend on external cpio tools, gen_init_cpio is created
-from usr/gen_init_cpio.c which is entirely self-contained, and the kernel's
-boot-time extractor is also (obviously) self-contained. However, if you _do_
-happen to have cpio installed, the following command line can extract the
-generated cpio image back into its component files:
+The kernel does not depend on external cpio tools. If you specify a
+directory instead of a configuration file, the kernel's build infrastructure
+creates a configuration file from that directory (usr/Makefile calls
+scripts/gen_initramfs_list.sh), and proceeds to package up that directory
+using the config file (by feeding it to usr/gen_init_cpio, which is created
+from usr/gen_init_cpio.c). The kernel's build-time cpio creation code is
+entirely self-contained, and the kernel's boot-time extractor is also
+(obviously) self-contained.
+
+The one thing you might need external cpio utilities installed for is creating
+or extracting your own preprepared cpio files to feed to the kernel build
+(instead of a config file or directory).
+
+The following command line can extract a cpio image (either by the above script
+or by the kernel build) back into its component files:
cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames
+The following shell script can create a prebuilt cpio archive you can
+use in place of the above config file:
+
+ #!/bin/sh
+
+ # Copyright 2006 Rob Landley <rob@landley.net> and TimeSys Corporation.
+ # Licensed under GPL version 2
+
+ if [ $# -ne 2 ]
+ then
+ echo "usage: mkinitramfs directory imagename.cpio.gz"
+ exit 1
+ fi
+
+ if [ -d "$1" ]
+ then
+ echo "creating $2 from $1"
+ (cd "$1"; find . | cpio -o -H newc | gzip) > "$2"
+ else
+ echo "First argument must be a directory"
+ exit 1
+ fi
+
+Note: The cpio man page contains some bad advice that will break your initramfs
+archive if you follow it. It says "A typical way to generate the list
+of filenames is with the find command; you should give find the -depth option
+to minimize problems with permissions on directories that are unwritable or not
+searchable." Don't do this when creating initramfs.cpio.gz images, it won't
+work. The Linux kernel cpio extractor won't create files in a directory that
+doesn't exist, so the directory entries must go before the files that go in
+those directories. The above script gets them in the right order.
+
+External initramfs images:
+--------------------------
+
+If the kernel has initrd support enabled, an external cpio.gz archive can also
+be passed into a 2.6 kernel in place of an initrd. In this case, the kernel
+will autodetect the type (initramfs, not initrd) and extract the external cpio
+archive into rootfs before trying to run /init.
+
+This has the memory efficiency advantages of initramfs (no ramdisk block
+device) but the separate packaging of initrd (which is nice if you have
+non-GPL code you'd like to run from initramfs, without conflating it with
+the GPL licensed Linux kernel binary).
+
+It can also be used to supplement the kernel's built-in initamfs image. The
+files in the external archive will overwrite any conflicting files in
+the built-in initramfs archive. Some distributors also prefer to customize
+a single kernel image with task-specific initramfs images, without recompiling.
+
Contents of initramfs:
----------------------
+An initramfs archive is a complete self-contained root filesystem for Linux.
If you don't already understand what shared libraries, devices, and paths
you need to get a minimal root filesystem up and running, here are some
references:
@@ -176,13 +242,36 @@ code against, along with some related utilities. It is BSD licensed.
I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net)
myself. These are LGPL and GPL, respectively. (A self-contained initramfs
-package is planned for the busybox 1.2 release.)
+package is planned for the busybox 1.3 release.)
In theory you could use glibc, but that's not well suited for small embedded
uses like this. (A "hello world" program statically linked against glibc is
over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do
name lookups, even when otherwise statically linked.)
+A good first step is to get initramfs to run a statically linked "hello world"
+program as init, and test it under an emulator like qemu (www.qemu.org) or
+User Mode Linux, like so:
+
+ cat > hello.c << EOF
+ #include <stdio.h>
+ #include <unistd.h>
+
+ int main(int argc, char *argv[])
+ {
+ printf("Hello world!\n");
+ sleep(999999999);
+ }
+ EOF
+ gcc -static hello2.c -o init
+ echo init | cpio -o -H newc | gzip > test.cpio.gz
+ # Testing external initramfs using the initrd loading mechanism.
+ qemu -kernel /boot/vmlinuz -initrd test.cpio.gz /dev/zero
+
+When debugging a normal root filesystem, it's nice to be able to boot with
+"init=/bin/sh". The initramfs equivalent is "rdinit=/bin/sh", and it's
+just as useful.
+
Why cpio rather than tar?
-------------------------
@@ -241,7 +330,7 @@ the above threads) is:
Future directions:
------------------
-Today (2.6.14), initramfs is always compiled in, but not always used. The
+Today (2.6.16), initramfs is always compiled in, but not always used. The
kernel falls back to legacy boot code that is reached only if initramfs does
not contain an /init program. The fallback is legacy code, there to ensure a
smooth transition and allowing early boot functionality to gradually move to
@@ -258,8 +347,9 @@ and so on.
This kind of complexity (which inevitably includes policy) is rightly handled
in userspace. Both klibc and busybox/uClibc are working on simple initramfs
-packages to drop into a kernel build, and when standard solutions are ready
-and widely deployed, the kernel's legacy early boot code will become obsolete
-and a candidate for the feature removal schedule.
+packages to drop into a kernel build.
-But that's a while off yet.
+The klibc package has now been accepted into Andrew Morton's 2.6.17-mm tree.
+The kernel's current early boot code (partition detection, etc) will probably
+be migrated into a default initramfs, automatically created and used by the
+kernel build.
diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
index 212cf3c21ab..08bafa8c1ca 100644
--- a/Documentation/kdump/kdump.txt
+++ b/Documentation/kdump/kdump.txt
@@ -1,155 +1,325 @@
-Documentation for kdump - the kexec-based crash dumping solution
+================================================================
+Documentation for Kdump - The kexec-based Crash Dumping Solution
================================================================
-DESIGN
-======
+This document includes overview, setup and installation, and analysis
+information.
-Kdump uses kexec to reboot to a second kernel whenever a dump needs to be
-taken. This second kernel is booted with very little memory. The first kernel
-reserves the section of memory that the second kernel uses. This ensures that
-on-going DMA from the first kernel does not corrupt the second kernel.
+Overview
+========
-All the necessary information about Core image is encoded in ELF format and
-stored in reserved area of memory before crash. Physical address of start of
-ELF header is passed to new kernel through command line parameter elfcorehdr=.
+Kdump uses kexec to quickly boot to a dump-capture kernel whenever a
+dump of the system kernel's memory needs to be taken (for example, when
+the system panics). The system kernel's memory image is preserved across
+the reboot and is accessible to the dump-capture kernel.
-On i386, the first 640 KB of physical memory is needed to boot, irrespective
-of where the kernel loads. Hence, this region is backed up by kexec just before
-rebooting into the new kernel.
+You can use common Linux commands, such as cp and scp, to copy the
+memory image to a dump file on the local disk, or across the network to
+a remote system.
-In the second kernel, "old memory" can be accessed in two ways.
+Kdump and kexec are currently supported on the x86, x86_64, and ppc64
+architectures.
-- The first one is through a /dev/oldmem device interface. A capture utility
- can read the device file and write out the memory in raw format. This is raw
- dump of memory and analysis/capture tool should be intelligent enough to
- determine where to look for the right information. ELF headers (elfcorehdr=)
- can become handy here.
+When the system kernel boots, it reserves a small section of memory for
+the dump-capture kernel. This ensures that ongoing Direct Memory Access
+(DMA) from the system kernel does not corrupt the dump-capture kernel.
+The kexec -p command loads the dump-capture kernel into this reserved
+memory.
-- The second interface is through /proc/vmcore. This exports the dump as an ELF
- format file which can be written out using any file copy command
- (cp, scp, etc). Further, gdb can be used to perform limited debugging on
- the dump file. This method ensures methods ensure that there is correct
- ordering of the dump pages (corresponding to the first 640 KB that has been
- relocated).
+On x86 machines, the first 640 KB of physical memory is needed to boot,
+regardless of where the kernel loads. Therefore, kexec backs up this
+region just before rebooting into the dump-capture kernel.
-SETUP
-=====
+All of the necessary information about the system kernel's core image is
+encoded in the ELF format, and stored in a reserved area of memory
+before a crash. The physical address of the start of the ELF header is
+passed to the dump-capture kernel through the elfcorehdr= boot
+parameter.
+
+With the dump-capture kernel, you can access the memory image, or "old
+memory," in two ways:
+
+- Through a /dev/oldmem device interface. A capture utility can read the
+ device file and write out the memory in raw format. This is a raw dump
+ of memory. Analysis and capture tools must be intelligent enough to
+ determine where to look for the right information.
+
+- Through /proc/vmcore. This exports the dump as an ELF-format file that
+ you can write out using file copy commands such as cp or scp. Further,
+ you can use analysis tools such as the GNU Debugger (GDB) and the Crash
+ tool to debug the dump file. This method ensures that the dump pages are
+ correctly ordered.
+
+
+Setup and Installation
+======================
+
+Install kexec-tools and the Kdump patch
+---------------------------------------
+
+1) Login as the root user.
+
+2) Download the kexec-tools user-space package from the following URL:
+
+ http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.101.tar.gz
+
+3) Unpack the tarball with the tar command, as follows:
+
+ tar xvpzf kexec-tools-1.101.tar.gz
+
+4) Download the latest consolidated Kdump patch from the following URL:
+
+ http://lse.sourceforge.net/kdump/
+
+ (This location is being used until all the user-space Kdump patches
+ are integrated with the kexec-tools package.)
+
+5) Change to the kexec-tools-1.101 directory, as follows:
+
+ cd kexec-tools-1.101
+
+6) Apply the consolidated patch to the kexec-tools-1.101 source tree
+ with the patch command, as follows. (Modify the path to the downloaded
+ patch as necessary.)
+
+ patch -p1 < /path-to-kdump-patch/kexec-tools-1.101-kdump.patch
+
+7) Configure the package, as follows:
+
+ ./configure
+
+8) Compile the package, as follows:
+
+ make
+
+9) Install the package, as follows:
+
+ make install
+
+
+Download and build the system and dump-capture kernels
+------------------------------------------------------
+
+Download the mainline (vanilla) kernel source code (2.6.13-rc1 or newer)
+from http://www.kernel.org. Two kernels must be built: a system kernel
+and a dump-capture kernel. Use the following steps to configure these
+kernels with the necessary kexec and Kdump features:
+
+System kernel
+-------------
+
+1) Enable "kexec system call" in "Processor type and features."
+
+ CONFIG_KEXEC=y
+
+2) Enable "sysfs file system support" in "Filesystem" -> "Pseudo
+ filesystems." This is usually enabled by default.
+
+ CONFIG_SYSFS=y
+
+ Note that "sysfs file system support" might not appear in the "Pseudo
+ filesystems" menu if "Configure standard kernel features (for small
+ systems)" is not enabled in "General Setup." In this case, check the
+ .config file itself to ensure that sysfs is turned on, as follows:
+
+ grep 'CONFIG_SYSFS' .config
+
+3) Enable "Compile the kernel with debug info" in "Kernel hacking."
+
+ CONFIG_DEBUG_INFO=Y
+
+ This causes the kernel to be built with debug symbols. The dump
+ analysis tools require a vmlinux with debug symbols in order to read
+ and analyze a dump file.
+
+4) Make and install the kernel and its modules. Update the boot loader
+ (such as grub, yaboot, or lilo) configuration files as necessary.
+
+5) Boot the system kernel with the boot parameter "crashkernel=Y@X",
+ where Y specifies how much memory to reserve for the dump-capture kernel
+ and X specifies the beginning of this reserved memory. For example,
+ "crashkernel=64M@16M" tells the system kernel to reserve 64 MB of memory
+ starting at physical address 0x01000000 for the dump-capture kernel.
+
+ On x86 and x86_64, use "crashkernel=64M@16M".
+
+ On ppc64, use "crashkernel=128M@32M".
+
+
+The dump-capture kernel
+-----------------------
-1) Download the upstream kexec-tools userspace package from
- http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.101.tar.gz.
-
- Apply the latest consolidated kdump patch on top of kexec-tools-1.101
- from http://lse.sourceforge.net/kdump/. This arrangment has been made
- till all the userspace patches supporting kdump are integrated with
- upstream kexec-tools userspace.
-
-2) Download and build the appropriate (2.6.13-rc1 onwards) vanilla kernels.
- Two kernels need to be built in order to get this feature working.
- Following are the steps to properly configure the two kernels specific
- to kexec and kdump features:
-
- A) First kernel or regular kernel:
- ----------------------------------
- a) Enable "kexec system call" feature (in Processor type and features).
- CONFIG_KEXEC=y
- b) Enable "sysfs file system support" (in Pseudo filesystems).
- CONFIG_SYSFS=y
- c) make
- d) Boot into first kernel with the command line parameter "crashkernel=Y@X".
- Use appropriate values for X and Y. Y denotes how much memory to reserve
- for the second kernel, and X denotes at what physical address the
- reserved memory section starts. For example: "crashkernel=64M@16M".
-
-
- B) Second kernel or dump capture kernel:
- ---------------------------------------
- a) For i386 architecture enable Highmem support
- CONFIG_HIGHMEM=y
- b) Enable "kernel crash dumps" feature (under "Processor type and features")
- CONFIG_CRASH_DUMP=y
- c) Make sure a suitable value for "Physical address where the kernel is
- loaded" (under "Processor type and features"). By default this value
- is 0x1000000 (16MB) and it should be same as X (See option d above),
- e.g., 16 MB or 0x1000000.
- CONFIG_PHYSICAL_START=0x1000000
- d) Enable "/proc/vmcore support" (Optional, under "Pseudo filesystems").
- CONFIG_PROC_VMCORE=y
-
-3) After booting to regular kernel or first kernel, load the second kernel
- using the following command:
-
- kexec -p <second-kernel> --args-linux --elf32-core-headers
- --append="root=<root-dev> init 1 irqpoll maxcpus=1"
-
- Notes:
- ======
- i) <second-kernel> has to be a vmlinux image ie uncompressed elf image.
- bzImage will not work, as of now.
- ii) --args-linux has to be speicfied as if kexec it loading an elf image,
- it needs to know that the arguments supplied are of linux type.
- iii) By default ELF headers are stored in ELF64 format to support systems
- with more than 4GB memory. Option --elf32-core-headers forces generation
- of ELF32 headers. The reason for this option being, as of now gdb can
- not open vmcore file with ELF64 headers on a 32 bit systems. So ELF32
- headers can be used if one has non-PAE systems and hence memory less
- than 4GB.
- iv) Specify "irqpoll" as command line parameter. This reduces driver
- initialization failures in second kernel due to shared interrupts.
- v) <root-dev> needs to be specified in a format corresponding to the root
- device name in the output of mount command.
- vi) If you have built the drivers required to mount root file system as
- modules in <second-kernel>, then, specify
- --initrd=<initrd-for-second-kernel>.
- vii) Specify maxcpus=1 as, if during first kernel run, if panic happens on
- non-boot cpus, second kernel doesn't seem to be boot up all the cpus.
- The other option is to always built the second kernel without SMP
- support ie CONFIG_SMP=n
-
-4) After successfully loading the second kernel as above, if a panic occurs
- system reboots into the second kernel. A module can be written to force
- the panic or "ALT-SysRq-c" can be used initiate a crash dump for testing
- purposes.
-
-5) Once the second kernel has booted, write out the dump file using
+1) Under "General setup," append "-kdump" to the current string in
+ "Local version."
+
+2) On x86, enable high memory support under "Processor type and
+ features":
+
+ CONFIG_HIGHMEM64G=y
+ or
+ CONFIG_HIGHMEM4G
+
+3) On x86 and x86_64, disable symmetric multi-processing support
+ under "Processor type and features":
+
+ CONFIG_SMP=n
+ (If CONFIG_SMP=y, then specify maxcpus=1 on the kernel command line
+ when loading the dump-capture kernel, see section "Load the Dump-capture
+ Kernel".)
+
+4) On ppc64, disable NUMA support and enable EMBEDDED support:
+
+ CONFIG_NUMA=n
+ CONFIG_EMBEDDED=y
+ CONFIG_EEH=N for the dump-capture kernel
+
+5) Enable "kernel crash dumps" support under "Processor type and
+ features":
+
+ CONFIG_CRASH_DUMP=y
+
+6) Use a suitable value for "Physical address where the kernel is
+ loaded" (under "Processor type and features"). This only appears when
+ "kernel crash dumps" is enabled. By default this value is 0x1000000
+ (16MB). It should be the same as X in the "crashkernel=Y@X" boot
+ parameter discussed above.
+
+ On x86 and x86_64, use "CONFIG_PHYSICAL_START=0x1000000".
+
+ On ppc64 the value is automatically set at 32MB when
+ CONFIG_CRASH_DUMP is set.
+
+6) Optionally enable "/proc/vmcore support" under "Filesystems" ->
+ "Pseudo filesystems".
+
+ CONFIG_PROC_VMCORE=y
+ (CONFIG_PROC_VMCORE is set by default when CONFIG_CRASH_DUMP is selected.)
+
+7) Make and install the kernel and its modules. DO NOT add this kernel
+ to the boot loader configuration files.
+
+
+Load the Dump-capture Kernel
+============================
+
+After booting to the system kernel, load the dump-capture kernel using
+the following command:
+
+ kexec -p <dump-capture-kernel> \
+ --initrd=<initrd-for-dump-capture-kernel> --args-linux \
+ --append="root=<root-dev> init 1 irqpoll"
+
+
+Notes on loading the dump-capture kernel:
+
+* <dump-capture-kernel> must be a vmlinux image (that is, an
+ uncompressed ELF image). bzImage does not work at this time.
+
+* By default, the ELF headers are stored in ELF64 format to support
+ systems with more than 4GB memory. The --elf32-core-headers option can
+ be used to force the generation of ELF32 headers. This is necessary
+ because GDB currently cannot open vmcore files with ELF64 headers on
+ 32-bit systems. ELF32 headers can be used on non-PAE systems (that is,
+ less than 4GB of memory).
+
+* The "irqpoll" boot parameter reduces driver initialization failures
+ due to shared interrupts in the dump-capture kernel.
+
+* You must specify <root-dev> in the format corresponding to the root
+ device name in the output of mount command.
+
+* "init 1" boots the dump-capture kernel into single-user mode without
+ networking. If you want networking, use "init 3."
+
+
+Kernel Panic
+============
+
+After successfully loading the dump-capture kernel as previously
+described, the system will reboot into the dump-capture kernel if a
+system crash is triggered. Trigger points are located in panic(),
+die(), die_nmi() and in the sysrq handler (ALT-SysRq-c).
+
+The following conditions will execute a crash trigger point:
+
+If a hard lockup is detected and "NMI watchdog" is configured, the system
+will boot into the dump-capture kernel ( die_nmi() ).
+
+If die() is called, and it happens to be a thread with pid 0 or 1, or die()
+is called inside interrupt context or die() is called and panic_on_oops is set,
+the system will boot into the dump-capture kernel.
+
+On powererpc systems when a soft-reset is generated, die() is called by all cpus and the system system will boot into the dump-capture kernel.
+
+For testing purposes, you can trigger a crash by using "ALT-SysRq-c",
+"echo c > /proc/sysrq-trigger or write a module to force the panic.
+
+Write Out the Dump File
+=======================
+
+After the dump-capture kernel is booted, write out the dump file with
+the following command:
cp /proc/vmcore <dump-file>
- Dump memory can also be accessed as a /dev/oldmem device for a linear/raw
- view. To create the device, type:
+You can also access dumped memory as a /dev/oldmem device for a linear
+and raw view. To create the device, use the following command:
- mknod /dev/oldmem c 1 12
+ mknod /dev/oldmem c 1 12
- Use "dd" with suitable options for count, bs and skip to access specific
- portions of the dump.
+Use the dd command with suitable options for count, bs, and skip to
+access specific portions of the dump.
- Entire memory: dd if=/dev/oldmem of=oldmem.001
+To see the entire memory, use the following command:
+ dd if=/dev/oldmem of=oldmem.001
-ANALYSIS
+
+Analysis
========
-Limited analysis can be done using gdb on the dump file copied out of
-/proc/vmcore. Use vmlinux built with -g and run
- gdb vmlinux <dump-file>
+Before analyzing the dump image, you should reboot into a stable kernel.
+
+You can do limited analysis using GDB on the dump file copied out of
+/proc/vmcore. Use the debug vmlinux built with -g and run the following
+command:
+
+ gdb vmlinux <dump-file>
-Stack trace for the task on processor 0, register display, memory display
-work fine.
+Stack trace for the task on processor 0, register display, and memory
+display work fine.
-Note: gdb cannot analyse core files generated in ELF64 format for i386.
+Note: GDB cannot analyze core files generated in ELF64 format for x86.
+On systems with a maximum of 4GB of memory, you can generate
+ELF32-format headers using the --elf32-core-headers kernel option on the
+dump kernel.
-Latest "crash" (crash-4.0-2.18) as available on Dave Anderson's site
-http://people.redhat.com/~anderson/ works well with kdump format.
+You can also use the Crash utility to analyze dump files in Kdump
+format. Crash is available on Dave Anderson's site at the following URL:
+ http://people.redhat.com/~anderson/
+
+
+To Do
+=====
-TODO
-====
-1) Provide a kernel pages filtering mechanism so that core file size is not
- insane on systems having huge memory banks.
-2) Relocatable kernel can help in maintaining multiple kernels for crashdump
- and same kernel as the first kernel can be used to capture the dump.
+1) Provide a kernel pages filtering mechanism, so core file size is not
+ extreme on systems with huge memory banks.
+2) Relocatable kernel can help in maintaining multiple kernels for
+ crash_dump, and the same kernel as the system kernel can be used to
+ capture the dump.
-CONTACT
+
+Contact
=======
+
Vivek Goyal (vgoyal@in.ibm.com)
Maneesh Soni (maneesh@in.ibm.com)
+
+
+Trademark
+=========
+
+Linux is a trademark of Linus Torvalds in the United States, other
+countries, or both.
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 4710845dbac..cf0d5416a4c 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -262,9 +262,14 @@ What is required is some way of intervening to instruct the compiler and the
CPU to restrict the order.
Memory barriers are such interventions. They impose a perceived partial
-ordering between the memory operations specified on either side of the barrier.
-They request that the sequence of memory events generated appears to other
-parts of the system as if the barrier is effective on that CPU.
+ordering over the memory operations on either side of the barrier.
+
+Such enforcement is important because the CPUs and other devices in a system
+can use a variety of tricks to improve performance - including reordering,
+deferral and combination of memory operations; speculative loads; speculative
+branch prediction and various types of caching. Memory barriers are used to
+override or suppress these tricks, allowing the code to sanely control the
+interaction of multiple CPUs and/or devices.
VARIETIES OF MEMORY BARRIER
@@ -282,7 +287,7 @@ Memory barriers come in four basic varieties:
A write barrier is a partial ordering on stores only; it is not required
to have any effect on loads.
- A CPU can be viewed as as commiting a sequence of store operations to the
+ A CPU can be viewed as committing a sequence of store operations to the
memory system as time progresses. All stores before a write barrier will
occur in the sequence _before_ all the stores after the write barrier.
@@ -413,7 +418,7 @@ There are certain things that the Linux kernel memory barriers do not guarantee:
indirect effect will be the order in which the second CPU sees the effects
of the first CPU's accesses occur, but see the next point:
- (*) There is no guarantee that the a CPU will see the correct order of effects
+ (*) There is no guarantee that a CPU will see the correct order of effects
from a second CPU's accesses, even _if_ the second CPU uses a memory
barrier, unless the first CPU _also_ uses a matching memory barrier (see
the subsection on "SMP Barrier Pairing").
@@ -461,8 +466,8 @@ Whilst this may seem like a failure of coherency or causality maintenance, it
isn't, and this behaviour can be observed on certain real CPUs (such as the DEC
Alpha).
-To deal with this, a data dependency barrier must be inserted between the
-address load and the data load:
+To deal with this, a data dependency barrier or better must be inserted
+between the address load and the data load:
CPU 1 CPU 2
=============== ===============
@@ -484,7 +489,7 @@ lines. The pointer P might be stored in an odd-numbered cache line, and the
variable B might be stored in an even-numbered cache line. Then, if the
even-numbered bank of the reading CPU's cache is extremely busy while the
odd-numbered bank is idle, one can see the new value of the pointer P (&B),
-but the old value of the variable B (1).
+but the old value of the variable B (2).
Another example of where data dependency barriers might by required is where a
@@ -744,7 +749,7 @@ some effectively random order, despite the write barrier issued by CPU 1:
: :
-If, however, a read barrier were to be placed between the load of E and the
+If, however, a read barrier were to be placed between the load of B and the
load of A on CPU 2:
CPU 1 CPU 2
@@ -1461,9 +1466,8 @@ instruction itself is complete.
On a UP system - where this wouldn't be a problem - the smp_mb() is just a
compiler barrier, thus making sure the compiler emits the instructions in the
-right order without actually intervening in the CPU. Since there there's only
-one CPU, that CPU's dependency ordering logic will take care of everything
-else.
+right order without actually intervening in the CPU. Since there's only one
+CPU, that CPU's dependency ordering logic will take care of everything else.
ATOMIC OPERATIONS
@@ -1640,9 +1644,9 @@ functions:
The PCI bus, amongst others, defines an I/O space concept - which on such
CPUs as i386 and x86_64 cpus readily maps to the CPU's concept of I/O
- space. However, it may also mapped as a virtual I/O space in the CPU's
- memory map, particularly on those CPUs that don't support alternate
- I/O spaces.
+ space. However, it may also be mapped as a virtual I/O space in the CPU's
+ memory map, particularly on those CPUs that don't support alternate I/O
+ spaces.
Accesses to this space may be fully synchronous (as on i386), but
intermediary bridges (such as the PCI host bridge) may not fully honour
diff --git a/Documentation/rtc.txt b/Documentation/rtc.txt
index 95d17b3e2ee..2a58f985795 100644
--- a/Documentation/rtc.txt
+++ b/Documentation/rtc.txt
@@ -44,8 +44,10 @@ normal timer interrupt, which is 100Hz.
Programming and/or enabling interrupt frequencies greater than 64Hz is
only allowed by root. This is perhaps a bit conservative, but we don't want
an evil user generating lots of IRQs on a slow 386sx-16, where it might have
-a negative impact on performance. Note that the interrupt handler is only
-a few lines of code to minimize any possibility of this effect.
+a negative impact on performance. This 64Hz limit can be changed by writing
+a different value to /proc/sys/dev/rtc/max-user-freq. Note that the
+interrupt handler is only a few lines of code to minimize any possibility
+of this effect.
Also, if the kernel time is synchronized with an external source, the
kernel will write the time back to the CMOS clock every 11 minutes. In
@@ -81,6 +83,7 @@ that will be using this driver.
*/
#include <stdio.h>
+#include <stdlib.h>
#include <linux/rtc.h>
#include <sys/ioctl.h>
#include <sys/time.h>
diff --git a/Documentation/sysrq.txt b/Documentation/sysrq.txt
index ad0bedf678b..e0188a23fd5 100644
--- a/Documentation/sysrq.txt
+++ b/Documentation/sysrq.txt
@@ -115,8 +115,9 @@ trojan program is running at console and which could grab your password
when you would try to login. It will kill all programs on given console
and thus letting you make sure that the login prompt you see is actually
the one from init, not some trojan program.
-IMPORTANT:In its true form it is not a true SAK like the one in :IMPORTANT
-IMPORTANT:c2 compliant systems, and it should be mistook as such. :IMPORTANT
+IMPORTANT: In its true form it is not a true SAK like the one in a :IMPORTANT
+IMPORTANT: c2 compliant system, and it should not be mistaken as :IMPORTANT
+IMPORTANT: such. :IMPORTANT
It seems other find it useful as (System Attention Key) which is
useful when you want to exit a program that will not let you switch consoles.
(For example, X or a svgalib program.)