4 files changed, 490 insertions, 20 deletions
diff --git a/Documentation/Changes b/Documentation/Changes
index 5eaab0441d7..27232be26e1 100644
--- a/Documentation/Changes
+++ b/Documentation/Changes
@@ -237,6 +237,12 @@ udev
 udev is a userspace application for populating /dev dynamically with
 only entries for devices actually present. udev replaces devfs.
 
+FUSE
+----
+
+Needs libfuse 2.4.0 or later.  Absolute minimum is 2.3.0 but mount
+options 'direct_io' and 'kernel_cache' won't work.
+
 Networking
 ==========
 
@@ -390,6 +396,10 @@ udev
 ----
 o <http://www.kernel.org/pub/linux/utils/kernel/hotplug/udev.html>
 
+FUSE
+----
+o <http://sourceforge.net/projects/fuse>
+
 Networking
 **********
 
diff --git a/Documentation/DocBook/libata.tmpl b/Documentation/DocBook/libata.tmpl
index 375ae760dc1..b2ec780bcda 100644
--- a/Documentation/DocBook/libata.tmpl
+++ b/Documentation/DocBook/libata.tmpl
@@ -415,6 +415,362 @@ and other resources, etc.
      </sect1>
   </chapter>
 
+  <chapter id="libataEH">
+        <title>Error handling</title>
+
+	<para>
+	This chapter describes how errors are handled under libata.
+	Readers are advised to read SCSI EH
+	(Documentation/scsi/scsi_eh.txt) and ATA exceptions doc first.
+	</para>
+
+	<sect1><title>Origins of commands</title>
+	<para>
+	In libata, a command is represented with struct ata_queued_cmd
+	or qc.  qc's are preallocated during port initialization and
+	repetitively used for command executions.  Currently only one
+	qc is allocated per port but yet-to-be-merged NCQ branch
+	allocates one for each tag and maps each qc to NCQ tag 1-to-1.
+	</para>
+	<para>
+	libata commands can originate from two sources - libata itself
+	and SCSI midlayer.  libata internal commands are used for
+	initialization and error handling.  All normal blk requests
+	and commands for SCSI emulation are passed as SCSI commands
+	through queuecommand callback of SCSI host template.
+	</para>
+	</sect1>
+
+	<sect1><title>How commands are issued</title>
+
+	<variablelist>
+
+	<varlistentry><term>Internal commands</term>
+	<listitem>
+	<para>
+	First, qc is allocated and initialized using
+	ata_qc_new_init().  Although ata_qc_new_init() doesn't
+	implement any wait or retry mechanism when qc is not
+	available, internal commands are currently issued only during
+	initialization and error recovery, so no other command is
+	active and allocation is guaranteed to succeed.
+	</para>
+	<para>
+	Once allocated qc's taskfile is initialized for the command to
+	be executed.  qc currently has two mechanisms to notify
+	completion.  One is via qc->complete_fn() callback and the
+	other is completion qc->waiting.  qc->complete_fn() callback
+	is the asynchronous path used by normal SCSI translated
+	commands and qc->waiting is the synchronous (issuer sleeps in
+	process context) path used by internal commands.
+	</para>
+	<para>
+	Once initialization is complete, host_set lock is acquired
+	and the qc is issued.
+	</para>
+	</listitem>
+	</varlistentry>
+
+	<varlistentry><term>SCSI commands</term>
+	<listitem>
+	<para>
+	All libata drivers use ata_scsi_queuecmd() as
+	hostt->queuecommand callback.  scmds can either be simulated
+	or translated.  No qc is involved in processing a simulated
+	scmd.  The result is computed right away and the scmd is
+	completed.
+	</para>
+	<para>
+	For a translated scmd, ata_qc_new_init() is invoked to
+	allocate a qc and the scmd is translated into the qc.  SCSI
+	midlayer's completion notification function pointer is stored
+	into qc->scsidone.
+	</para>
+	<para>
+	qc->complete_fn() callback is used for completion
+	notification.  ATA commands use ata_scsi_qc_complete() while
+	ATAPI commands use atapi_qc_complete().  Both functions end up
+	calling qc->scsidone to notify upper layer when the qc is
+	finished.  After translation is completed, the qc is issued
+	with ata_qc_issue().
+	</para>
+	<para>
+	Note that SCSI midlayer invokes hostt->queuecommand while
+	holding host_set lock, so all above occur while holding
+	host_set lock.
+	</para>
+	</listitem>
+	</varlistentry>
+
+	</variablelist>
+	</sect1>
+
+	<sect1><title>How commands are processed</title>
+	<para>
+	Depending on which protocol and which controller are used,
+	commands are processed differently.  For the purpose of
+	discussion, a controller which uses taskfile interface and all
+	standard callbacks is assumed.
+	</para>
+	<para>
+	Currently 6 ATA command protocols are used.  They can be
+	sorted into the following four categories according to how
+	they are processed.
+	</para>
+
+	<variablelist>
+	   <varlistentry><term>ATA NO DATA or DMA</term>
+	   <listitem>
+	   <para>
+	   ATA_PROT_NODATA and ATA_PROT_DMA fall into this category.
+	   These types of commands don't require any software
+	   intervention once issued.  Device will raise interrupt on
+	   completion.
+	   </para>
+	   </listitem>
+	   </varlistentry>
+
+	   <varlistentry><term>ATA PIO</term>
+	   <listitem>
+	   <para>
+	   ATA_PROT_PIO is in this category.  libata currently
+	   implements PIO with polling.  ATA_NIEN bit is set to turn
+	   off interrupt and pio_task on ata_wq performs polling and
+	   IO.
+	   </para>
+	   </listitem>
+	   </varlistentry>
+
+	   <varlistentry><term>ATAPI NODATA or DMA</term>
+	   <listitem>
+	   <para>
+	   ATA_PROT_ATAPI_NODATA and ATA_PROT_ATAPI_DMA are in this
+	   category.  packet_task is used to poll BSY bit after
+	   issuing PACKET command.  Once BSY is turned off by the
+	   device, packet_task transfers CDB and hands off processing
+	   to interrupt handler.
+	   </para>
+	   </listitem>
+	   </varlistentry>
+
+	   <varlistentry><term>ATAPI PIO</term>
+	   <listitem>
+	   <para>
+	   ATA_PROT_ATAPI is in this category.  ATA_NIEN bit is set
+	   and, as in ATAPI NODATA or DMA, packet_task submits cdb.
+	   However, after submitting cdb, further processing (data
+	   transfer) is handed off to pio_task.
+	   </para>
+	   </listitem>
+	   </varlistentry>
+	</variablelist>
+        </sect1>
+
+	<sect1><title>How commands are completed</title>
+	<para>
+	Once issued, all qc's are either completed with
+	ata_qc_complete() or time out.  For commands which are handled
+	by interrupts, ata_host_intr() invokes ata_qc_complete(), and,
+	for PIO tasks, pio_task invokes ata_qc_complete().  In error
+	cases, packet_task may also complete commands.
+	</para>
+	<para>
+	ata_qc_complete() does the following.
+	</para>
+
+	<orderedlist>
+
+	<listitem>
+	<para>
+	DMA memory is unmapped.
+	</para>
+	</listitem>
+
+	<listitem>
+	<para>
+	ATA_QCFLAG_ACTIVE is clared from qc->flags.
+	</para>
+	</listitem>
+
+	<listitem>
+	<para>
+	qc->complete_fn() callback is invoked.  If the return value of
+	the callback is not zero.  Completion is short circuited and
+	ata_qc_complete() returns.
+	</para>
+	</listitem>
+
+	<listitem>
+	<para>
+	__ata_qc_complete() is called, which does
+	   <orderedlist>
+
+	   <listitem>
+	   <para>
+	   qc->flags is cleared to zero.
+	   </para>
+	   </listitem>
+
+	   <listitem>
+	   <para>
+	   ap->active_tag and qc->tag are poisoned.
+	   </para>
+	   </listitem>
+
+	   <listitem>
+	   <para>
+	   qc->waiting is claread &amp; completed (in that order).
+	   </para>
+	   </listitem>
+
+	   <listitem>
+	   <para>
+	   qc is deallocated by clearing appropriate bit in ap->qactive.
+	   </para>
+	   </listitem>
+
+	   </orderedlist>
+	</para>
+	</listitem>
+
+	</orderedlist>
+
+	<para>
+	So, it basically notifies upper layer and deallocates qc.  One
+	exception is short-circuit path in #3 which is used by
+	atapi_qc_complete().
+	</para>
+	<para>
+	For all non-ATAPI commands, whether it fails or not, almost
+	the same code path is taken and very little error handling
+	takes place.  A qc is completed with success status if it
+	succeeded, with failed status otherwise.
+	</para>
+	<para>
+	However, failed ATAPI commands require more handling as
+	REQUEST SENSE is needed to acquire sense data.  If an ATAPI
+	command fails, ata_qc_complete() is invoked with error status,
+	which in turn invokes atapi_qc_complete() via
+	qc->complete_fn() callback.
+	</para>
+	<para>
+	This makes atapi_qc_complete() set scmd->result to
+	SAM_STAT_CHECK_CONDITION, complete the scmd and return 1.  As
+	the sense data is empty but scmd->result is CHECK CONDITION,
+	SCSI midlayer will invoke EH for the scmd, and returning 1
+	makes ata_qc_complete() to return without deallocating the qc.
+	This leads us to ata_scsi_error() with partially completed qc.
+	</para>
+
+	</sect1>
+
+	<sect1><title>ata_scsi_error()</title>
+	<para>
+	ata_scsi_error() is the current hostt->eh_strategy_handler()
+	for libata.  As discussed above, this will be entered in two
+	cases - timeout and ATAPI error completion.  This function
+	calls low level libata driver's eng_timeout() callback, the
+	standard callback for which is ata_eng_timeout().  It checks
+	if a qc is active and calls ata_qc_timeout() on the qc if so.
+	Actual error handling occurs in ata_qc_timeout().
+	</para>
+	<para>
+	If EH is invoked for timeout, ata_qc_timeout() stops BMDMA and
+	completes the qc.  Note that as we're currently in EH, we
+	cannot call scsi_done.  As described in SCSI EH doc, a
+	recovered scmd should be either retried with
+	scsi_queue_insert() or finished with scsi_finish_command().
+	Here, we override qc->scsidone with scsi_finish_command() and
+	calls ata_qc_complete().
+	</para>
+	<para>
+	If EH is invoked due to a failed ATAPI qc, the qc here is
+	completed but not deallocated.  The purpose of this
+	half-completion is to use the qc as place holder to make EH
+	code reach this place.  This is a bit hackish, but it works.
+	</para>
+	<para>
+	Once control reaches here, the qc is deallocated by invoking
+	__ata_qc_complete() explicitly.  Then, internal qc for REQUEST
+	SENSE is issued.  Once sense data is acquired, scmd is
+	finished by directly invoking scsi_finish_command() on the
+	scmd.  Note that as we already have completed and deallocated
+	the qc which was associated with the scmd, we don't need
+	to/cannot call ata_qc_complete() again.
+	</para>
+
+	</sect1>
+
+	<sect1><title>Problems with the current EH</title>
+
+	<itemizedlist>
+
+	<listitem>
+	<para>
+	Error representation is too crude.  Currently any and all
+	error conditions are represented with ATA STATUS and ERROR
+	registers.  Errors which aren't ATA device errors are treated
+	as ATA device errors by setting ATA_ERR bit.  Better error
+	descriptor which can properly represent ATA and other
+	errors/exceptions is needed.
+	</para>
+	</listitem>
+
+	<listitem>
+	<para>
+	When handling timeouts, no action is taken to make device
+	forget about the timed out command and ready for new commands.
+	</para>
+	</listitem>
+
+	<listitem>
+	<para>
+	EH handling via ata_scsi_error() is not properly protected
+	from usual command processing.  On EH entrance, the device is
+	not in quiescent state.  Timed out commands may succeed or
+	fail any time.  pio_task and atapi_task may still be running.
+	</para>
+	</listitem>
+
+	<listitem>
+	<para>
+	Too weak error recovery.  Devices / controllers causing HSM
+	mismatch errors and other errors quite often require reset to
+	return to known state.  Also, advanced error handling is
+	necessary to support features like NCQ and hotplug.
+	</para>
+	</listitem>
+
+	<listitem>
+	<para>
+	ATA errors are directly handled in the interrupt handler and
+	PIO errors in pio_task.  This is problematic for advanced
+	error handling for the following reasons.
+	</para>
+	<para>
+	First, advanced error handling often requires context and
+	internal qc execution.
+	</para>
+	<para>
+	Second, even a simple failure (say, CRC error) needs
+	information gathering and could trigger complex error handling
+	(say, resetting &amp; reconfiguring).  Having multiple code
+	paths to gather information, enter EH and trigger actions
+	makes life painful.
+	</para>
+	<para>
+	Third, scattered EH code makes implementing low level drivers
+	difficult.  Low level drivers override libata callbacks.  If
+	EH is scattered over several places, each affected callbacks
+	should perform its part of error handling.  This can be error
+	prone and painful.
+	</para>
+	</listitem>
+
+	</itemizedlist>
+	</sect1>
+  </chapter>
+
   <chapter id="libataExt">
      <title>libata Library</title>
 !Edrivers/scsi/libata-core.c
diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches
index 7f43b040311..1d96efec5e8 100644
--- a/Documentation/SubmittingPatches
+++ b/Documentation/SubmittingPatches
@@ -301,8 +301,68 @@ now, but you can do this to mark internal company procedures or just
 point out some special detail about the sign-off. 
 
 
+12) The canonical patch format
 
-12) More references for submitting patches
+The canonical patch subject line is:
+
+    Subject: [PATCH 001/123] [<area>:] <explanation>
+
+The canonical patch message body contains the following:
+
+  - A "from" line specifying the patch author.
+
+  - An empty line.
+
+  - The body of the explanation, which will be copied to the
+    permanent changelog to describe this patch.
+
+  - The "Signed-off-by:" lines, described above, which will
+    also go in the changelog.
+
+  - A marker line containing simply "---".
+
+  - Any additional comments not suitable for the changelog.
+
+  - The actual patch (diff output).
+
+The Subject line format makes it very easy to sort the emails
+alphabetically by subject line - pretty much any email reader will
+support that - since because the sequence number is zero-padded,
+the numerical and alphabetic sort is the same.
+
+See further details on how to phrase the "<explanation>" in the
+"Subject:" line in Andrew Morton's "The perfect patch", referenced
+below.
+
+The "from" line must be the very first line in the message body,
+and has the form:
+
+        From: Original Author <author@example.com>
+
+The "from" line specifies who will be credited as the author of the
+patch in the permanent changelog.  If the "from" line is missing,
+then the "From:" line from the email header will be used to determine
+the patch author in the changelog.
+
+The explanation body will be committed to the permanent source
+changelog, so should make sense to a competent reader who has long
+since forgotten the immediate details of the discussion that might
+have led to this patch.
+
+The "---" marker line serves the essential purpose of marking for patch
+handling tools where the changelog message ends.
+
+One good use for the additional comments after the "---" marker is for
+a diffstat, to show what files have changed, and the number of inserted
+and deleted lines per file.  A diffstat is especially useful on bigger
+patches.  Other comments relevant only to the moment or the maintainer,
+not suitable for the permanent changelog, should also go here.
+
+See more details on the proper patch format in the following
+references.
+
+
+13) More references for submitting patches
 
 Andrew Morton, "The perfect patch" (tpp).
   <http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt>
@@ -310,6 +370,14 @@ Andrew Morton, "The perfect patch" (tpp).
 Jeff Garzik, "Linux kernel patch submission format."
   <http://linux.yyz.us/patch-format.html>
 
+Greg KH, "How to piss off a kernel subsystem maintainer"
+  <http://www.kroah.com/log/2005/03/31/>
+
+Kernel Documentation/CodingStyle
+  <http://sosdg.org/~coywolf/lxr/source/Documentation/CodingStyle>
+
+Linus Torvald's mail on the canonical patch format:
+  <http://lkml.org/lkml/2005/4/7/183>
 
 
 -----------------------------------
diff --git a/Documentation/keys.txt b/Documentation/keys.txt
index 0321ded4b9a..b22e7c8d059 100644
--- a/Documentation/keys.txt
+++ b/Documentation/keys.txt
@@ -195,8 +195,8 @@ KEY ACCESS PERMISSIONS
 ======================
 
 Keys have an owner user ID, a group access ID, and a permissions mask. The mask
-has up to eight bits each for user, group and other access. Only five of each
-set of eight bits are defined. These permissions granted are:
+has up to eight bits each for possessor, user, group and other access. Only
+five of each set of eight bits are defined. These permissions granted are:
 
  (*) View
 
@@ -241,16 +241,16 @@ about the status of the key service:
      type, description and permissions. The payload of the key is not available
      this way:
 
-	SERIAL   FLAGS  USAGE EXPY PERM   UID   GID   TYPE      DESCRIPTION: SUMMARY
-	00000001 I-----    39 perm 1f0000     0     0 keyring   _uid_ses.0: 1/4
-	00000002 I-----     2 perm 1f0000     0     0 keyring   _uid.0: empty
-	00000007 I-----     1 perm 1f0000     0     0 keyring   _pid.1: empty
-	0000018d I-----     1 perm 1f0000     0     0 keyring   _pid.412: empty
-	000004d2 I--Q--     1 perm 1f0000    32    -1 keyring   _uid.32: 1/4
-	000004d3 I--Q--     3 perm 1f0000    32    -1 keyring   _uid_ses.32: empty
-	00000892 I--QU-     1 perm 1f0000     0     0 user      metal:copper: 0
-	00000893 I--Q-N     1  35s 1f0000     0     0 user      metal:silver: 0
-	00000894 I--Q--     1  10h 1f0000     0     0 user      metal:gold: 0
+	SERIAL   FLAGS  USAGE EXPY PERM     UID   GID   TYPE      DESCRIPTION: SUMMARY
+	00000001 I-----    39 perm 1f1f0000     0     0 keyring   _uid_ses.0: 1/4
+	00000002 I-----     2 perm 1f1f0000     0     0 keyring   _uid.0: empty
+	00000007 I-----     1 perm 1f1f0000     0     0 keyring   _pid.1: empty
+	0000018d I-----     1 perm 1f1f0000     0     0 keyring   _pid.412: empty
+	000004d2 I--Q--     1 perm 1f1f0000    32    -1 keyring   _uid.32: 1/4
+	000004d3 I--Q--     3 perm 1f1f0000    32    -1 keyring   _uid_ses.32: empty
+	00000892 I--QU-     1 perm 1f000000     0     0 user      metal:copper: 0
+	00000893 I--Q-N     1  35s 1f1f0000     0     0 user      metal:silver: 0
+	00000894 I--Q--     1  10h 001f0000     0     0 user      metal:gold: 0
 
      The flags are:
 
@@ -637,6 +637,34 @@ call, and the key released upon close. How to deal with conflicting keys due to
 two different users opening the same file is left to the filesystem author to
 solve.
 
+Note that there are two different types of pointers to keys that may be
+encountered:
+
+ (*) struct key *
+
+     This simply points to the key structure itself. Key structures will be at
+     least four-byte aligned.
+
+ (*) key_ref_t
+
+     This is equivalent to a struct key *, but the least significant bit is set
+     if the caller "possesses" the key. By "possession" it is meant that the
+     calling processes has a searchable link to the key from one of its
+     keyrings. There are three functions for dealing with these:
+
+	key_ref_t make_key_ref(const struct key *key,
+			       unsigned long possession);
+
+	struct key *key_ref_to_ptr(const key_ref_t key_ref);
+
+	unsigned long is_key_possessed(const key_ref_t key_ref);
+
+     The first function constructs a key reference from a key pointer and
+     possession information (which must be 0 or 1 and not any other value).
+
+     The second function retrieves the key pointer from a reference and the
+     third retrieves the possession flag.
+
 When accessing a key's payload contents, certain precautions must be taken to
 prevent access vs modification races. See the section "Notes on accessing
 payload contents" for more information.
@@ -665,7 +693,11 @@ payload contents" for more information.
 
 	void key_put(struct key *key);
 
-    This can be called from interrupt context. If CONFIG_KEYS is not set then
+    Or:
+
+	void key_ref_put(key_ref_t key_ref);
+
+    These can be called from interrupt context. If CONFIG_KEYS is not set then
     the argument will not be parsed.
 
 
@@ -689,13 +721,17 @@ payload contents" for more information.
 
 (*) If a keyring was found in the search, this can be further searched by:
 
-	struct key *keyring_search(struct key *keyring,
-				   const struct key_type *type,
-				   const char *description)
+	key_ref_t keyring_search(key_ref_t keyring_ref,
+				 const struct key_type *type,
+				 const char *description)
 
     This searches the keyring tree specified for a matching key. Error ENOKEY
-    is returned upon failure. If successful, the returned key will need to be
-    released.
+    is returned upon failure (use IS_ERR/PTR_ERR to determine). If successful,
+    the returned key will need to be released.
+
+    The possession attribute from the keyring reference is used to control
+    access through the permissions mask and is propagated to the returned key
+    reference pointer if successful.
 
 
 (*) To check the validity of a key, this function can be called:
@@ -732,7 +768,7 @@ More complex payload contents must be allocated and a pointer to them set in
 key->payload.data. One of the following ways must be selected to access the
 data:
 
- (1) Unmodifyable key type.
+ (1) Unmodifiable key type.
 
      If the key type does not have a modify method, then the key's payload can
      be accessed without any form of locking, provided that it's known to be