diff options
Diffstat (limited to 'Documentation')
35 files changed, 1939 insertions, 251 deletions
diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index d05737aaa84..06b982affe7 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX @@ -82,6 +82,8 @@ block/ - info on the Block I/O (BIO) layer. blockdev/ - info on block devices & drivers +btmrvl.txt + - info on Marvell Bluetooth driver usage. cachetlb.txt - describes the cache/TLB flushing interfaces Linux uses. cdrom/ diff --git a/Documentation/RCU/RTFP.txt b/Documentation/RCU/RTFP.txt index 9f711d2df91..d2b85237c76 100644 --- a/Documentation/RCU/RTFP.txt +++ b/Documentation/RCU/RTFP.txt @@ -743,3 +743,80 @@ Revised: RCU, realtime RCU, sleepable RCU, performance. " } + +@article{PaulEMcKenney2008RCUOSR +,author="Paul E. McKenney and Jonathan Walpole" +,title="Introducing technology into the {Linux} kernel: a case study" +,Year="2008" +,journal="SIGOPS Oper. Syst. Rev." +,volume="42" +,number="5" +,pages="4--17" +,issn="0163-5980" +,doi={http://doi.acm.org/10.1145/1400097.1400099} +,publisher="ACM" +,address="New York, NY, USA" +,annotation={ + Linux changed RCU to a far greater degree than RCU has changed Linux. +} +} + +@unpublished{PaulEMcKenney2008HierarchicalRCU +,Author="Paul E. McKenney" +,Title="Hierarchical {RCU}" +,month="November" +,day="3" +,year="2008" +,note="Available: +\url{http://lwn.net/Articles/305782/} +[Viewed November 6, 2008]" +,annotation=" + RCU with combining-tree-based grace-period detection, + permitting it to handle thousands of CPUs. +" +} + +@conference{PaulEMcKenney2009MaliciousURCU +,Author="Paul E. McKenney" +,Title="Using a Malicious User-Level {RCU} to Torture {RCU}-Based Algorithms" +,Booktitle="linux.conf.au 2009" +,month="January" +,year="2009" +,address="Hobart, Australia" +,note="Available: +\url{http://www.rdrop.com/users/paulmck/RCU/urcutorture.2009.01.22a.pdf} +[Viewed February 2, 2009]" +,annotation=" + Realtime RCU and torture-testing RCU uses. +" +} + +@unpublished{MathieuDesnoyers2009URCU +,Author="Mathieu Desnoyers" +,Title="[{RFC} git tree] Userspace {RCU} (urcu) for {Linux}" +,month="February" +,day="5" +,year="2009" +,note="Available: +\url{http://lkml.org/lkml/2009/2/5/572} +\url{git://lttng.org/userspace-rcu.git} +[Viewed February 20, 2009]" +,annotation=" + Mathieu Desnoyers's user-space RCU implementation. + git://lttng.org/userspace-rcu.git +" +} + +@unpublished{PaulEMcKenney2009BloatWatchRCU +,Author="Paul E. McKenney" +,Title="{RCU}: The {Bloatwatch} Edition" +,month="March" +,day="17" +,year="2009" +,note="Available: +\url{http://lwn.net/Articles/323929/} +[Viewed March 20, 2009]" +,annotation=" + Uniprocessor assumptions allow simplified RCU implementation. +" +} diff --git a/Documentation/RCU/UP.txt b/Documentation/RCU/UP.txt index aab4a9ec393..90ec5341ee9 100644 --- a/Documentation/RCU/UP.txt +++ b/Documentation/RCU/UP.txt @@ -2,14 +2,13 @@ RCU on Uniprocessor Systems A common misconception is that, on UP systems, the call_rcu() primitive -may immediately invoke its function, and that the synchronize_rcu() -primitive may return immediately. The basis of this misconception +may immediately invoke its function. The basis of this misconception is that since there is only one CPU, it should not be necessary to wait for anything else to get done, since there are no other CPUs for anything else to be happening on. Although this approach will -sort- -of- work a surprising amount of the time, it is a very bad idea in general. -This document presents three examples that demonstrate exactly how bad an -idea this is. +This document presents three examples that demonstrate exactly how bad +an idea this is. Example 1: softirq Suicide @@ -82,11 +81,18 @@ Quick Quiz #2: What locking restriction must RCU callbacks respect? Summary -Permitting call_rcu() to immediately invoke its arguments or permitting -synchronize_rcu() to immediately return breaks RCU, even on a UP system. -So do not do it! Even on a UP system, the RCU infrastructure -must- -respect grace periods, and -must- invoke callbacks from a known environment -in which no locks are held. +Permitting call_rcu() to immediately invoke its arguments breaks RCU, +even on a UP system. So do not do it! Even on a UP system, the RCU +infrastructure -must- respect grace periods, and -must- invoke callbacks +from a known environment in which no locks are held. + +It -is- safe for synchronize_sched() and synchronize_rcu_bh() to return +immediately on an UP system. It is also safe for synchronize_rcu() +to return immediately on UP systems, except when running preemptable +RCU. + +Quick Quiz #3: Why can't synchronize_rcu() return immediately on + UP systems running preemptable RCU? Answer to Quick Quiz #1: @@ -117,3 +123,13 @@ Answer to Quick Quiz #2: callbacks acquire locks directly. However, a great many RCU callbacks do acquire locks -indirectly-, for example, via the kfree() primitive. + +Answer to Quick Quiz #3: + Why can't synchronize_rcu() return immediately on UP systems + running preemptable RCU? + + Because some other task might have been preempted in the middle + of an RCU read-side critical section. If synchronize_rcu() + simply immediately returned, it would prematurely signal the + end of the grace period, which would come as a nasty shock to + that other thread when it started running again. diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt index accfe2f5247..51525a30e8b 100644 --- a/Documentation/RCU/checklist.txt +++ b/Documentation/RCU/checklist.txt @@ -11,7 +11,10 @@ over a rather long period of time, but improvements are always welcome! structure is updated more than about 10% of the time, then you should strongly consider some other approach, unless detailed performance measurements show that RCU is nonetheless - the right tool for the job. + the right tool for the job. Yes, you might think of RCU + as simply cutting overhead off of the readers and imposing it + on the writers. That is exactly why normal uses of RCU will + do much more reading than updating. Another exception is where performance is not an issue, and RCU provides a simpler implementation. An example of this situation @@ -240,10 +243,11 @@ over a rather long period of time, but improvements are always welcome! instead need to use synchronize_irq() or synchronize_sched(). 12. Any lock acquired by an RCU callback must be acquired elsewhere - with irq disabled, e.g., via spin_lock_irqsave(). Failing to - disable irq on a given acquisition of that lock will result in - deadlock as soon as the RCU callback happens to interrupt that - acquisition's critical section. + with softirq disabled, e.g., via spin_lock_irqsave(), + spin_lock_bh(), etc. Failing to disable irq on a given + acquisition of that lock will result in deadlock as soon as the + RCU callback happens to interrupt that acquisition's critical + section. 13. RCU callbacks can be and are executed in parallel. In many cases, the callback code simply wrappers around kfree(), so that this @@ -310,3 +314,9 @@ over a rather long period of time, but improvements are always welcome! Because these primitives only wait for pre-existing readers, it is the caller's responsibility to guarantee safety to any subsequent readers. + +16. The various RCU read-side primitives do -not- contain memory + barriers. The CPU (and in some cases, the compiler) is free + to reorder code into and out of RCU read-side critical sections. + It is the responsibility of the RCU update-side primitives to + deal with this. diff --git a/Documentation/RCU/rcu.txt b/Documentation/RCU/rcu.txt index 7aa2002ade7..2a23523ce47 100644 --- a/Documentation/RCU/rcu.txt +++ b/Documentation/RCU/rcu.txt @@ -36,7 +36,7 @@ o How can the updater tell when a grace period has completed executed in user mode, or executed in the idle loop, we can safely free up that item. - Preemptible variants of RCU (CONFIG_PREEMPT_RCU) get the + Preemptible variants of RCU (CONFIG_TREE_PREEMPT_RCU) get the same effect, but require that the readers manipulate CPU-local counters. These counters allow limited types of blocking within RCU read-side critical sections. SRCU also uses @@ -79,10 +79,10 @@ o I hear that RCU is patented? What is with that? o I hear that RCU needs work in order to support realtime kernels? This work is largely completed. Realtime-friendly RCU can be - enabled via the CONFIG_PREEMPT_RCU kernel configuration parameter. - However, work is in progress for enabling priority boosting of - preempted RCU read-side critical sections. This is needed if you - have CPU-bound realtime threads. + enabled via the CONFIG_TREE_PREEMPT_RCU kernel configuration + parameter. However, work is in progress for enabling priority + boosting of preempted RCU read-side critical sections. This is + needed if you have CPU-bound realtime threads. o Where can I find more information on RCU? diff --git a/Documentation/RCU/rcubarrier.txt b/Documentation/RCU/rcubarrier.txt index 909602d409b..e439a0edee2 100644 --- a/Documentation/RCU/rcubarrier.txt +++ b/Documentation/RCU/rcubarrier.txt @@ -170,6 +170,13 @@ module invokes call_rcu() from timers, you will need to first cancel all the timers, and only then invoke rcu_barrier() to wait for any remaining RCU callbacks to complete. +Of course, if you module uses call_rcu_bh(), you will need to invoke +rcu_barrier_bh() before unloading. Similarly, if your module uses +call_rcu_sched(), you will need to invoke rcu_barrier_sched() before +unloading. If your module uses call_rcu(), call_rcu_bh(), -and- +call_rcu_sched(), then you will need to invoke each of rcu_barrier(), +rcu_barrier_bh(), and rcu_barrier_sched(). + Implementing rcu_barrier() diff --git a/Documentation/RCU/torture.txt b/Documentation/RCU/torture.txt index a342b6e1cc1..9dba3bb90e6 100644 --- a/Documentation/RCU/torture.txt +++ b/Documentation/RCU/torture.txt @@ -76,8 +76,10 @@ torture_type The type of RCU to test: "rcu" for the rcu_read_lock() API, "rcu_sync" for rcu_read_lock() with synchronous reclamation, "rcu_bh" for the rcu_read_lock_bh() API, "rcu_bh_sync" for rcu_read_lock_bh() with synchronous reclamation, "srcu" for - the "srcu_read_lock()" API, and "sched" for the use of - preempt_disable() together with synchronize_sched(). + the "srcu_read_lock()" API, "sched" for the use of + preempt_disable() together with synchronize_sched(), + and "sched_expedited" for the use of preempt_disable() + with synchronize_sched_expedited(). verbose Enable debug printk()s. Default is disabled. @@ -162,6 +164,23 @@ of the "old" and "current" counters for the corresponding CPU. The "idx" value maps the "old" and "current" values to the underlying array, and is useful for debugging. +Similarly, sched_expedited RCU provides the following: + + sched_expedited-torture: rtc: d0000000016c1880 ver: 1090796 tfle: 0 rta: 1090796 rtaf: 0 rtf: 1090787 rtmbe: 0 nt: 27713319 + sched_expedited-torture: Reader Pipe: 12660320201 95875 0 0 0 0 0 0 0 0 0 + sched_expedited-torture: Reader Batch: 12660424885 0 0 0 0 0 0 0 0 0 0 + sched_expedited-torture: Free-Block Circulation: 1090795 1090795 1090794 1090793 1090792 1090791 1090790 1090789 1090788 1090787 0 + state: -1 / 0:0 3:0 4:0 + +As before, the first four lines are similar to those for RCU. +The last line shows the task-migration state. The first number is +-1 if synchronize_sched_expedited() is idle, -2 if in the process of +posting wakeups to the migration kthreads, and N when waiting on CPU N. +Each of the colon-separated fields following the "/" is a CPU:state pair. +Valid states are "0" for idle, "1" for waiting for quiescent state, +"2" for passed through quiescent state, and "3" when a race with a +CPU-hotplug event forces use of the synchronize_sched() primitive. + USAGE diff --git a/Documentation/RCU/trace.txt b/Documentation/RCU/trace.txt index 02cced183b2..187bbf10c92 100644 --- a/Documentation/RCU/trace.txt +++ b/Documentation/RCU/trace.txt @@ -191,8 +191,7 @@ rcu/rcuhier (which displays the struct rcu_node hierarchy). The output of "cat rcu/rcudata" looks as follows: -rcu: -rcu: +rcu_sched: 0 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=10951/1 dn=0 df=1101 of=0 ri=36 ql=0 b=10 1 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=16117/1 dn=0 df=1015 of=0 ri=0 ql=0 b=10 2 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=1445/1 dn=0 df=1839 of=0 ri=0 ql=0 b=10 @@ -306,7 +305,7 @@ comma-separated-variable spreadsheet format. The output of "cat rcu/rcugp" looks as follows: -rcu: completed=33062 gpnum=33063 +rcu_sched: completed=33062 gpnum=33063 rcu_bh: completed=464 gpnum=464 Again, this output is for both "rcu" and "rcu_bh". The fields are @@ -413,7 +412,7 @@ o Each element of the form "1/1 0:127 ^0" represents one struct The output of "cat rcu/rcu_pending" looks as follows: -rcu: +rcu_sched: 0 np=255892 qsp=53936 cbr=0 cng=14417 gpc=10033 gps=24320 nf=6445 nn=146741 1 np=261224 qsp=54638 cbr=0 cng=25723 gpc=16310 gps=2849 nf=5912 nn=155792 2 np=237496 qsp=49664 cbr=0 cng=2762 gpc=45478 gps=1762 nf=1201 nn=136629 diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt index 96170824a71..e41a7fecf0d 100644 --- a/Documentation/RCU/whatisRCU.txt +++ b/Documentation/RCU/whatisRCU.txt @@ -136,10 +136,10 @@ rcu_read_lock() Used by a reader to inform the reclaimer that the reader is entering an RCU read-side critical section. It is illegal to block while in an RCU read-side critical section, though - kernels built with CONFIG_PREEMPT_RCU can preempt RCU read-side - critical sections. Any RCU-protected data structure accessed - during an RCU read-side critical section is guaranteed to remain - unreclaimed for the full duration of that critical section. + kernels built with CONFIG_TREE_PREEMPT_RCU can preempt RCU + read-side critical sections. Any RCU-protected data structure + accessed during an RCU read-side critical section is guaranteed to + remain unreclaimed for the full duration of that critical section. Reference counts may be used in conjunction with RCU to maintain longer-term references to data structures. @@ -785,6 +785,7 @@ RCU pointer/list traversal: rcu_dereference list_for_each_entry_rcu hlist_for_each_entry_rcu + hlist_nulls_for_each_entry_rcu list_for_each_continue_rcu (to be deprecated in favor of new list_for_each_entry_continue_rcu) @@ -807,19 +808,23 @@ RCU: Critical sections Grace period Barrier rcu_read_lock synchronize_net rcu_barrier rcu_read_unlock synchronize_rcu + synchronize_rcu_expedited call_rcu bh: Critical sections Grace period Barrier rcu_read_lock_bh call_rcu_bh rcu_barrier_bh - rcu_read_unlock_bh + rcu_read_unlock_bh synchronize_rcu_bh + synchronize_rcu_bh_expedited sched: Critical sections Grace period Barrier - [preempt_disable] synchronize_sched rcu_barrier_sched - [and friends] call_rcu_sched + rcu_read_lock_sched synchronize_sched rcu_barrier_sched + rcu_read_unlock_sched call_rcu_sched + [preempt_disable] synchronize_sched_expedited + [and friends] SRCU: Critical sections Grace period Barrier @@ -827,6 +832,9 @@ SRCU: Critical sections Grace period Barrier srcu_read_lock synchronize_srcu N/A srcu_read_unlock +SRCU: Initialization/cleanup + init_srcu_struct + cleanup_srcu_struct See the comment headers in the source code (or the docbook generated from them) for more information. diff --git a/Documentation/btmrvl.txt b/Documentation/btmrvl.txt new file mode 100644 index 00000000000..34916a46c09 --- /dev/null +++ b/Documentation/btmrvl.txt @@ -0,0 +1,119 @@ +======================================================================= + README for btmrvl driver +======================================================================= + + +All commands are used via debugfs interface. + +===================== +Set/get driver configurations: + +Path: /debug/btmrvl/config/ + +gpiogap=[n] +hscfgcmd + These commands are used to configure the host sleep parameters. + bit 8:0 -- Gap + bit 16:8 -- GPIO + + where GPIO is the pin number of GPIO used to wake up the host. + It could be any valid GPIO pin# (e.g. 0-7) or 0xff (SDIO interface + wakeup will be used instead). + + where Gap is the gap in milli seconds between wakeup signal and + wakeup event, or 0xff for special host sleep setting. + + Usage: + # Use SDIO interface to wake up the host and set GAP to 0x80: + echo 0xff80 > /debug/btmrvl/config/gpiogap + echo 1 > /debug/btmrvl/config/hscfgcmd + + # Use GPIO pin #3 to wake up the host and set GAP to 0xff: + echo 0x03ff > /debug/btmrvl/config/gpiogap + echo 1 > /debug/btmrvl/config/hscfgcmd + +psmode=[n] +pscmd + These commands are used to enable/disable auto sleep mode + + where the option is: + 1 -- Enable auto sleep mode + 0 -- Disable auto sleep mode + + Usage: + # Enable auto sleep mode + echo 1 > /debug/btmrvl/config/psmode + echo 1 > /debug/btmrvl/config/pscmd + + # Disable auto sleep mode + echo 0 > /debug/btmrvl/config/psmode + echo 1 > /debug/btmrvl/config/pscmd + + +hsmode=[n] +hscmd + These commands are used to enable host sleep or wake up firmware + + where the option is: + 1 -- Enable host sleep + 0 -- Wake up firmware + + Usage: + # Enable host sleep + echo 1 > /debug/btmrvl/config/hsmode + echo 1 > /debug/btmrvl/config/hscmd + + # Wake up firmware + echo 0 > /debug/btmrvl/config/hsmode + echo 1 > /debug/btmrvl/config/hscmd + + +====================== +Get driver status: + +Path: /debug/btmrvl/status/ + +Usage: + cat /debug/btmrvl/status/<args> + +where the args are: + +curpsmode + This command displays current auto sleep status. + +psstate + This command display the power save state. + +hsstate + This command display the host sleep state. + +txdnldrdy + This command displays the value of Tx download ready flag. + + +===================== + +Use hcitool to issue raw hci command, refer to hcitool manual + + Usage: Hcitool cmd <ogf> <ocf> [Parameters] + + Interface Control Command + hcitool cmd 0x3f 0x5b 0xf5 0x01 0x00 --Enable All interface + hcitool cmd 0x3f 0x5b 0xf5 0x01 0x01 --Enable Wlan interface + hcitool cmd 0x3f 0x5b 0xf5 0x01 0x02 --Enable BT interface + hcitool cmd 0x3f 0x5b 0xf5 0x00 0x00 --Disable All interface + hcitool cmd 0x3f 0x5b 0xf5 0x00 0x01 --Disable Wlan interface + hcitool cmd 0x3f 0x5b 0xf5 0x00 0x02 --Disable BT interface + +======================================================================= + + +SD8688 firmware: + +/lib/firmware/sd8688_helper.bin +/lib/firmware/sd8688.bin + + +The images can be downloaded from: + +git.infradead.org/users/dwmw2/linux-firmware.git/libertas/ diff --git a/Documentation/connector/Makefile b/Documentation/connector/Makefile index 8df1a7285a0..d98e4df98e2 100644 --- a/Documentation/connector/Makefile +++ b/Documentation/connector/Makefile @@ -9,3 +9,8 @@ hostprogs-y := ucon always := $(hostprogs-y) HOSTCFLAGS_ucon.o += -I$(objtree)/usr/include + +all: modules + +modules clean: + $(MAKE) -C ../.. SUBDIRS=$(PWD) $@ diff --git a/Documentation/connector/cn_test.c b/Documentation/connector/cn_test.c index 6a5be5d5c8e..1711adc3337 100644 --- a/Documentation/connector/cn_test.c +++ b/Documentation/connector/cn_test.c @@ -19,6 +19,8 @@ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ +#define pr_fmt(fmt) "cn_test: " fmt + #include <linux/kernel.h> #include <linux/module.h> #include <linux/moduleparam.h> @@ -27,18 +29,17 @@ #include <linux/connector.h> -static struct cb_id cn_test_id = { 0x123, 0x456 }; +static struct cb_id cn_test_id = { CN_NETLINK_USERS + 3, 0x456 }; static char cn_test_name[] = "cn_test"; static struct sock *nls; static struct timer_list cn_test_timer; -void cn_test_callback(void *data) +static void cn_test_callback(struct cn_msg *msg) { - struct cn_msg *msg = (struct cn_msg *)data; - - printk("%s: %lu: idx=%x, val=%x, seq=%u, ack=%u, len=%d: %s.\n", - __func__, jiffies, msg->id.idx, msg->id.val, - msg->seq, msg->ack, msg->len, (char *)msg->data); + pr_info("%s: %lu: idx=%x, val=%x, seq=%u, ack=%u, len=%d: %s.\n", + __func__, jiffies, msg->id.idx, msg->id.val, + msg->seq, msg->ack, msg->len, + msg->len ? (char *)msg->data : ""); } /* @@ -63,9 +64,7 @@ static int cn_test_want_notify(void) skb = alloc_skb(size, GFP_ATOMIC); if (!skb) { - printk(KERN_ERR "Failed to allocate new skb with size=%u.\n", - size); - + pr_err("failed to allocate new skb with size=%u\n", size); return -ENOMEM; } @@ -114,12 +113,12 @@ static int cn_test_want_notify(void) //netlink_broadcast(nls, skb, 0, ctl->group, GFP_ATOMIC); netlink_unicast(nls, skb, 0, 0); - printk(KERN_INFO "Request was sent. Group=0x%x.\n", ctl->group); + pr_info("request was sent: group=0x%x\n", ctl->group); return 0; nlmsg_failure: - printk(KERN_ERR "Failed to send %u.%u\n", msg->seq, msg->ack); + pr_err("failed to send %u.%u\n", msg->seq, msg->ack); kfree_skb(skb); return -EINVAL; } @@ -131,6 +130,8 @@ static void cn_test_timer_func(unsigned long __data) struct cn_msg *m; char data[32]; + pr_debug("%s: timer fired with data %lu\n", __func__, __data); + m = kzalloc(sizeof(*m) + sizeof(data), GFP_ATOMIC); if (m) { @@ -150,7 +151,7 @@ static void cn_test_timer_func(unsigned long __data) cn_test_timer_counter++; - mod_timer(&cn_test_timer, jiffies + HZ); + mod_timer(&cn_test_timer, jiffies + msecs_to_jiffies(1000)); } static int cn_test_init(void) @@ -168,8 +169,10 @@ static int cn_test_init(void) } setup_timer(&cn_test_timer, cn_test_timer_func, 0); - cn_test_timer.expires = jiffies + HZ; - add_timer(&cn_test_timer); + mod_timer(&cn_test_timer, jiffies + msecs_to_jiffies(1000)); + + pr_info("initialized with id={%u.%u}\n", + cn_test_id.idx, cn_test_id.val); return 0; diff --git a/Documentation/connector/connector.txt b/Documentation/connector/connector.txt index ad6e0ba7b38..81e6bf6ead5 100644 --- a/Documentation/connector/connector.txt +++ b/Documentation/connector/connector.txt @@ -5,10 +5,10 @@ Kernel Connector. Kernel connector - new netlink based userspace <-> kernel space easy to use communication module. -Connector driver adds possibility to connect various agents using -netlink based network. One must register callback and -identifier. When driver receives special netlink message with -appropriate identifier, appropriate callback will be called. +The Connector driver makes it easy to connect various agents using a +netlink based network. One must register a callback and an identifier. +When the driver receives a special netlink message with the appropriate +identifier, the appropriate callback will be called. From the userspace point of view it's quite straightforward: @@ -17,10 +17,10 @@ From the userspace point of view it's quite straightforward: send(); recv(); -But if kernelspace want to use full power of such connections, driver -writer must create special sockets, must know about struct sk_buff -handling... Connector allows any kernelspace agents to use netlink -based networking for inter-process communication in a significantly +But if kernelspace wants to use the full power of such connections, the +driver writer must create special sockets, must know about struct sk_buff +handling, etc... The Connector driver allows any kernelspace agents to use +netlink based networking for inter-process communication in a significantly easier way: int cn_add_callback(struct cb_id *id, char *name, void (*callback) (void *)); @@ -32,15 +32,15 @@ struct cb_id __u32 val; }; -idx and val are unique identifiers which must be registered in -connector.h for in-kernel usage. void (*callback) (void *) - is a -callback function which will be called when message with above idx.val -will be received by connector core. Argument for that function must +idx and val are unique identifiers which must be registered in the +connector.h header for in-kernel usage. void (*callback) (void *) is a +callback function which will be called when a message with above idx.val +is received by the connector core. The argument for that function must be dereferenced to struct cn_msg *. struct cn_msg { - struct cb_id id; + struct cb_id id; __u32 seq; __u32 ack; @@ -55,92 +55,95 @@ Connector interfaces. int cn_add_callback(struct cb_id *id, char *name, void (*callback) (void *)); -Registers new callback with connector core. + Registers new callback with connector core. -struct cb_id *id - unique connector's user identifier. - It must be registered in connector.h for legal in-kernel users. -char *name - connector's callback symbolic name. -void (*callback) (void *) - connector's callback. + struct cb_id *id - unique connector's user identifier. + It must be registered in connector.h for legal in-kernel users. + char *name - connector's callback symbolic name. + void (*callback) (void *) - connector's callback. Argument must be dereferenced to struct cn_msg *. + void cn_del_callback(struct cb_id *id); -Unregisters new callback with connector core. + Unregisters new callback with connector core. + + struct cb_id *id - unique connector's user identifier. -struct cb_id *id - unique connector's user identifier. int cn_netlink_send(struct cn_msg *msg, u32 __groups, int gfp_mask); -Sends message to the specified groups. It can be safely called from -softirq context, but may silently fail under strong memory pressure. -If there are no listeners for given group -ESRCH can be returned. + Sends message to the specified groups. It can be safely called from + softirq context, but may silently fail under strong memory pressure. + If there are no listeners for given group -ESRCH can be returned. -struct cn_msg * - message header(with attached data). -u32 __group - destination group. + struct cn_msg * - message header(with attached data). + u32 __group - destination group. If __group is zero, then appropriate group will be searched through all registered connector users, and message will be delivered to the group which was created for user with the same ID as in msg. If __group is not zero, then message will be delivered to the specified group. -int gfp_mask - GFP mask. + int gfp_mask - GFP mask. -Note: When registering new callback user, connector core assigns -netlink group to the user which is equal to it's id.idx. + Note: When registering new callback user, connector core assigns + netlink group to the user which is equal to it's id.idx. /*****************************************/ Protocol description. /*****************************************/ -Current offers transport layer with fixed header. Recommended -protocol which uses such header is following: +The current framework offers a transport layer with fixed headers. The +recommended protocol which uses such a header is as following: msg->seq and msg->ack are used to determine message genealogy. When -someone sends message it puts there locally unique sequence and random -acknowledge numbers. Sequence number may be copied into +someone sends a message, they use a locally unique sequence and random +acknowledge number. The sequence number may be copied into nlmsghdr->nlmsg_seq too. -Sequence number is incremented with each message to be sent. +The sequence number is incremented with each message sent. -If we expect reply to our message, then sequence number in received -message MUST be the same as in original message, and acknowledge -number MUST be the same + 1. +If you expect a reply to the message, then the sequence number in the +received message MUST be the same as in the original message, and the +acknowledge number MUST be the same + 1. -If we receive message and it's sequence number is not equal to one we -are expecting, then it is new message. If we receive message and it's -sequence number is the same as one we are expecting, but it's -acknowledge is not equal acknowledge number in original message + 1, -then it is new message. +If we receive a message and its sequence number is not equal to one we +are expecting, then it is a new message. If we receive a message and +its sequence number is the same as one we are expecting, but its +acknowledge is not equal to the acknowledge number in the original +message + 1, then it is a new message. -Obviously, protocol header contains above id. +Obviously, the protocol header contains the above id. -connector allows event notification in the following form: kernel +The connector allows event notification in the following form: kernel driver or userspace process can ask connector to notify it when -selected id's will be turned on or off(registered or unregistered it's -callback). It is done by sending special command to connector -driver(it also registers itself with id={-1, -1}). +selected ids will be turned on or off (registered or unregistered its +callback). It is done by sending a special command to the connector +driver (it also registers itself with id={-1, -1}). -As example of usage Documentation/connector now contains cn_test.c - -testing module which uses connector to request notification and to -send messages. +As example of this usage can be found in the cn_test.c module which +uses the connector to request notification and to send messages. /*****************************************/ Reliability. /*****************************************/ -Netlink itself is not reliable protocol, that means that messages can +Netlink itself is not a reliable protocol. That means that messages can be lost due to memory pressure or process' receiving queue overflowed, -so caller is warned must be prepared. That is why struct cn_msg [main -connector's message header] contains u32 seq and u32 ack fields. +so caller is warned that it must be prepared. That is why the struct +cn_msg [main connector's message header] contains u32 seq and u32 ack +fields. /*****************************************/ Userspace usage. /*****************************************/ + 2.6.14 has a new netlink socket implementation, which by default does not -allow to send data to netlink groups other than 1. -So, if to use netlink socket (for example using connector) -with different group number userspace application must subscribe to -that group. It can be achieved by following pseudocode: +allow people to send data to netlink groups other than 1. +So, if you wish to use a netlink socket (for example using connector) +with a different group number, the userspace application must subscribe to +that group first. It can be achieved by the following pseudocode: s = socket(PF_NETLINK, SOCK_DGRAM, NETLINK_CONNECTOR); @@ -160,8 +163,8 @@ if (bind(s, (struct sockaddr *)&l_local, sizeof(struct sockaddr_nl)) == -1) { } Where 270 above is SOL_NETLINK, and 1 is a NETLINK_ADD_MEMBERSHIP socket -option. To drop multicast subscription one should call above socket option -with NETLINK_DROP_MEMBERSHIP parameter which is defined as 0. +option. To drop a multicast subscription, one should call the above socket +option with the NETLINK_DROP_MEMBERSHIP parameter which is defined as 0. 2.6.14 netlink code only allows to select a group which is less or equal to the maximum group number, which is used at netlink_kernel_create() time. diff --git a/Documentation/connector/ucon.c b/Documentation/connector/ucon.c index c5092ad0ce4..4848db8c71f 100644 --- a/Documentation/connector/ucon.c +++ b/Documentation/connector/ucon.c @@ -30,18 +30,24 @@ #include <arpa/inet.h> +#include <stdbool.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <string.h> #include <errno.h> #include <time.h> +#include <getopt.h> #include <linux/connector.h> #define DEBUG #define NETLINK_CONNECTOR 11 +/* Hopefully your userspace connector.h matches this kernel */ +#define CN_TEST_IDX CN_NETLINK_USERS + 3 +#define CN_TEST_VAL 0x456 + #ifdef DEBUG #define ulog(f, a...) fprintf(stdout, f, ##a) #else @@ -83,6 +89,25 @@ static int netlink_send(int s, struct cn_msg *msg) return err; } +static void usage(void) +{ + printf( + "Usage: ucon [options] [output file]\n" + "\n" + "\t-h\tthis help screen\n" + "\t-s\tsend buffers to the test module\n" + "\n" + "The default behavior of ucon is to subscribe to the test module\n" + "and wait for state messages. Any ones received are dumped to the\n" + "specified output file (or stdout). The test module is assumed to\n" + "have an id of {%u.%u}\n" + "\n" + "If you get no output, then verify the cn_test module id matches\n" + "the expected id above.\n" + , CN_TEST_IDX, CN_TEST_VAL + ); +} + int main(int argc, char *argv[]) { int s; @@ -94,17 +119,34 @@ int main(int argc, char *argv[]) FILE *out; time_t tm; struct pollfd pfd; + bool send_msgs = false; - if (argc < 2) - out = stdout; - else { - out = fopen(argv[1], "a+"); + while ((s = getopt(argc, argv, "hs")) != -1) { + switch (s) { + case 's': + send_msgs = true; + break; + + case 'h': + usage(); + return 0; + + default: + /* getopt() outputs an error for us */ + usage(); + return 1; + } + } + + if (argc != optind) { + out = fopen(argv[optind], "a+"); if (!out) { ulog("Unable to open %s for writing: %s\n", argv[1], strerror(errno)); out = stdout; } - } + } else + out = stdout; memset(buf, 0, sizeof(buf)); @@ -115,9 +157,11 @@ int main(int argc, char *argv[]) } l_local.nl_family = AF_NETLINK; - l_local.nl_groups = 0x123; /* bitmask of requested groups */ + l_local.nl_groups = -1; /* bitmask of requested groups */ l_local.nl_pid = 0; + ulog("subscribing to %u.%u\n", CN_TEST_IDX, CN_TEST_VAL); + if (bind(s, (struct sockaddr *)&l_local, sizeof(struct sockaddr_nl)) == -1) { perror("bind"); close(s); @@ -130,15 +174,15 @@ int main(int argc, char *argv[]) setsockopt(s, SOL_NETLINK, NETLINK_ADD_MEMBERSHIP, &on, sizeof(on)); } #endif - if (0) { + if (send_msgs) { int i, j; memset(buf, 0, sizeof(buf)); data = (struct cn_msg *)buf; - data->id.idx = 0x123; - data->id.val = 0x456; + data->id.idx = CN_TEST_IDX; + data->id.val = CN_TEST_VAL; data->seq = seq++; data->ack = 0; data->len = 0; diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index 09e031c5588..503d21216d5 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -6,6 +6,35 @@ be removed from this file. --------------------------- +What: PRISM54 +When: 2.6.34 + +Why: prism54 FullMAC PCI / Cardbus devices used to be supported only by the + prism54 wireless driver. After Intersil stopped selling these + devices in preference for the newer more flexible SoftMAC devices + a SoftMAC device driver was required and prism54 did not support + them. The p54pci driver now exists and has been present in the kernel for + a while. This driver supports both SoftMAC devices and FullMAC devices. + The main difference between these devices was the amount of memory which + could be used for the firmware. The SoftMAC devices support a smaller + amount of memory. Because of this the SoftMAC firmware fits into FullMAC + devices's memory. p54pci supports not only PCI / Cardbus but also USB + and SPI. Since p54pci supports all devices prism54 supports + you will have a conflict. I'm not quite sure how distributions are + handling this conflict right now. prism54 was kept around due to + claims users may experience issues when using the SoftMAC driver. + Time has passed users have not reported issues. If you use prism54 + and for whatever reason you cannot use p54pci please let us know! + E-mail us at: linux-wireless@vger.kernel.org + + For more information see the p54 wiki page: + + http://wireless.kernel.org/en/users/Drivers/p54 + +Who: Luis R. Rodriguez <lrodriguez@atheros.com> + +--------------------------- + What: IRQF_SAMPLE_RANDOM Check: IRQF_SAMPLE_RANDOM When: July 2009 @@ -206,24 +235,6 @@ Who: Len Brown <len.brown@intel.com> --------------------------- -What: libata spindown skipping and warning -When: Dec 2008 -Why: Some halt(8) implementations synchronize caches for and spin - down libata disks because libata didn't use to spin down disk on - system halt (only synchronized caches). - Spin down on system halt is now implemented. sysfs node - /sys/class/scsi_disk/h:c:i:l/manage_start_stop is present if - spin down support is available. - Because issuing spin down command to an already spun down disk - makes some disks spin up just to spin down again, libata tracks - device spindown status to skip the extra spindown command and - warn about it. - This is to give userspace tools the time to get updated and will - be removed after userspace is reasonably updated. -Who: Tejun Heo <htejun@gmail.com> - ---------------------------- - What: i386/x86_64 bzImage symlinks When: April 2010 @@ -235,31 +246,6 @@ Who: Thomas Gleixner <tglx@linutronix.de> --------------------------- What (Why): - - include/linux/netfilter_ipv4/ipt_TOS.h ipt_tos.h header files - (superseded by xt_TOS/xt_tos target & match) - - - "forwarding" header files like ipt_mac.h in - include/linux/netfilter_ipv4/ and include/linux/netfilter_ipv6/ - - - xt_CONNMARK match revision 0 - (superseded by xt_CONNMARK match revision 1) - - - xt_MARK target revisions 0 and 1 - (superseded by xt_MARK match revision 2) - - - xt_connmark match revision 0 - (superseded by xt_connmark match revision 1) - - - xt_conntrack match revision 0 - (superseded by xt_conntrack match revision 1) - - - xt_iprange match revision 0, - include/linux/netfilter_ipv4/ipt_iprange.h - (superseded by xt_iprange match revision 1) - - - xt_mark match revision 0 - (superseded by xt_mark match revision 1) - - xt_recent: the old ipt_recent proc dir (superseded by /proc/net/xt_recent) @@ -394,15 +380,6 @@ Who: Thomas Gleixner <tglx@linutronix.de> ----------------------------- -What: obsolete generic irq defines and typedefs -When: 2.6.30 -Why: The defines and typedefs (hw_interrupt_type, no_irq_type, irq_desc_t) - have been kept around for migration reasons. After more than two years - it's time to remove them finally -Who: Thomas Gleixner <tglx@linutronix.de> - ---------------------------- - What: fakephp and associated sysfs files in /sys/bus/pci/slots/ When: 2011 Why: In 2.6.27, the semantics of /sys/bus/pci/slots was redefined to @@ -468,3 +445,27 @@ Why: cpu_policy_rwsem has a new cleaner definition making it local to cpufreq core and contained inside cpufreq.c. Other dependent drivers should not use it in order to safely avoid lockdep issues. Who: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> + +---------------------------- + +What: sound-slot/service-* module aliases and related clutters in + sound/sound_core.c +When: August 2010 +Why: OSS sound_core grabs all legacy minors (0-255) of SOUND_MAJOR + (14) and requests modules using custom sound-slot/service-* + module aliases. The only benefit of doing this is allowing + use of custom module aliases which might as well be considered + a bug at this point. This preemptive claiming prevents + alternative OSS implementations. + + Till the feature is removed, the kernel will be requesting + both sound-slot/service-* and the standard char-major-* module + aliases and allow turning off the pre-claiming selectively via + CONFIG_SOUND_OSS_CORE_PRECLAIM and soundcore.preclaim_oss + kernel parameter. + + After the transition phase is complete, both the custom module + aliases and switches to disable it will go away. This removal + will also allow making ALSA OSS emulation independent of + sound_core. The dependency will be broken then too. +Who: Tejun Heo <tj@kernel.org> diff --git a/Documentation/filesystems/9p.txt b/Documentation/filesystems/9p.txt index bf8080640eb..6208f55c44c 100644 --- a/Documentation/filesystems/9p.txt +++ b/Documentation/filesystems/9p.txt @@ -123,6 +123,9 @@ available from the same CVS repository. There are user and developer mailing lists available through the v9fs project on sourceforge (http://sourceforge.net/projects/v9fs). +A stand-alone version of the module (which should build for any 2.6 kernel) +is available via (http://github.com/ericvh/9p-sac/tree/master) + News and other information is maintained on SWiK (http://swik.net/v9fs). Bug reports may be issued through the kernel.org bugzilla diff --git a/Documentation/filesystems/afs.txt b/Documentation/filesystems/afs.txt index 12ad6c7f4e5..ffef91c4e0d 100644 --- a/Documentation/filesystems/afs.txt +++ b/Documentation/filesystems/afs.txt @@ -23,15 +23,13 @@ it does support include: (*) Security (currently only AFS kaserver and KerberosIV tickets). - (*) File reading. + (*) File reading and writing. (*) Automounting. -It does not yet support the following AFS features: - - (*) Write support. + (*) Local caching (via fscache). - (*) Local caching. +It does not yet support the following AFS features: (*) pioctl() system call. @@ -56,7 +54,7 @@ They permit the debugging messages to be turned on dynamically by manipulating the masks in the following files: /sys/module/af_rxrpc/parameters/debug - /sys/module/afs/parameters/debug + /sys/module/kafs/parameters/debug ===== @@ -66,9 +64,9 @@ USAGE When inserting the driver modules the root cell must be specified along with a list of volume location server IP addresses: - insmod af_rxrpc.o - insmod rxkad.o - insmod kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91 + modprobe af_rxrpc + modprobe rxkad + modprobe kafs rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91 The first module is the AF_RXRPC network protocol driver. This provides the RxRPC remote operation protocol and may also be accessed from userspace. See: @@ -81,7 +79,7 @@ is the actual filesystem driver for the AFS filesystem. Once the module has been loaded, more modules can be added by the following procedure: - echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells + echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells Where the parameters to the "add" command are the name of a cell and a list of volume location servers within that cell, with the latter separated by colons. @@ -101,7 +99,7 @@ The name of the volume can be suffixes with ".backup" or ".readonly" to specify connection to only volumes of those types. The name of the cell is optional, and if not given during a mount, then the -named volume will be looked up in the cell specified during insmod. +named volume will be looked up in the cell specified during modprobe. Additional cells can be added through /proc (see later section). @@ -163,14 +161,14 @@ THE CELL DATABASE The filesystem maintains an internal database of all the cells it knows and the IP addresses of the volume location servers for those cells. The cell to which -the system belongs is added to the database when insmod is performed by the +the system belongs is added to the database when modprobe is performed by the "rootcell=" argument or, if compiled in, using a "kafs.rootcell=" argument on the kernel command line. Further cells can be added by commands similar to the following: echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells - echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells + echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells No other cell database operations are available at this time. @@ -233,7 +231,7 @@ insmod /tmp/kafs.o rootcell=cambridge.redhat.com:172.16.18.91 mount -t afs \%root.afs. /afs mount -t afs \%cambridge.redhat.com:root.cell. /afs/cambridge.redhat.com/ -echo add grand.central.org 18.7.14.88:128.2.191.224 > /proc/fs/afs/cells +echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 > /proc/fs/afs/cells mount -t afs "#grand.central.org:root.cell." /afs/grand.central.org/ mount -t afs "#grand.central.org:root.archive." /afs/grand.central.org/archive mount -t afs "#grand.central.org:root.contrib." /afs/grand.central.org/contrib diff --git a/Documentation/filesystems/nfs.txt b/Documentation/filesystems/nfs.txt new file mode 100644 index 00000000000..f50f26ce6cd --- /dev/null +++ b/Documentation/filesystems/nfs.txt @@ -0,0 +1,98 @@ + +The NFS client +============== + +The NFS version 2 protocol was first documented in RFC1094 (March 1989). +Since then two more major releases of NFS have been published, with NFSv3 +being documented in RFC1813 (June 1995), and NFSv4 in RFC3530 (April +2003). + +The Linux NFS client currently supports all the above published versions, +and work is in progress on adding support for minor version 1 of the NFSv4 +protocol. + +The purpose of this document is to provide information on some of the +upcall interfaces that are used in order to provide the NFS client with +some of the information that it requires in order to fully comply with +the NFS spec. + +The DNS resolver +================ + +NFSv4 allows for one server to refer the NFS client to data that has been +migrated onto another server by means of the special "fs_locations" +attribute. See + http://tools.ietf.org/html/rfc3530#section-6 +and + http://tools.ietf.org/html/draft-ietf-nfsv4-referrals-00 + +The fs_locations information can take the form of either an ip address and +a path, or a DNS hostname and a path. The latter requires the NFS client to +do a DNS lookup in order to mount the new volume, and hence the need for an +upcall to allow userland to provide this service. + +Assuming that the user has the 'rpc_pipefs' filesystem mounted in the usual +/var/lib/nfs/rpc_pipefs, the upcall consists of the following steps: + + (1) The process checks the dns_resolve cache to see if it contains a + valid entry. If so, it returns that entry and exits. + + (2) If no valid entry exists, the helper script '/sbin/nfs_cache_getent' + (may be changed using the 'nfs.cache_getent' kernel boot parameter) + is run, with two arguments: + - the cache name, "dns_resolve" + - the hostname to resolve + + (3) After looking up the corresponding ip address, the helper script + writes the result into the rpc_pipefs pseudo-file + '/var/lib/nfs/rpc_pipefs/cache/dns_resolve/channel' + in the following (text) format: + + "<ip address> <hostname> <ttl>\n" + + Where <ip address> is in the usual IPv4 (123.456.78.90) or IPv6 + (ffee:ddcc:bbaa:9988:7766:5544:3322:1100, ffee::1100, ...) format. + <hostname> is identical to the second argument of the helper + script, and <ttl> is the 'time to live' of this cache entry (in + units of seconds). + + Note: If <ip address> is invalid, say the string "0", then a negative + entry is created, which will cause the kernel to treat the hostname + as having no valid DNS translation. + + + + +A basic sample /sbin/nfs_cache_getent +===================================== + +#!/bin/bash +# +ttl=600 +# +cut=/usr/bin/cut +getent=/usr/bin/getent +rpc_pipefs=/var/lib/nfs/rpc_pipefs +# +die() +{ + echo "Usage: $0 cache_name entry_name" + exit 1 +} + +[ $# -lt 2 ] && die +cachename="$1" +cache_path=${rpc_pipefs}/cache/${cachename}/channel + +case "${cachename}" in + dns_resolve) + name="$2" + result="$(${getent} hosts ${name} | ${cut} -f1 -d\ )" + [ -z "${result}" ] && result="0" + ;; + *) + die + ;; +esac +echo "${result} ${name} ${ttl}" >${cache_path} + diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index fad18f9456e..ffead13f944 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -1167,13 +1167,11 @@ CHAPTER 3: PER-PROCESS PARAMETERS 3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score ------------------------------------------------------ -This file can be used to adjust the score used to select which processes should -be killed in an out-of-memory situation. The oom_adj value is a characteristic -of the task's mm, so all threads that share an mm with pid will have the same -oom_adj value. A high value will increase the likelihood of this process being -killed by the oom-killer. Valid values are in the range -16 to +15 as -explained below and a special value of -17, which disables oom-killing -altogether for threads sharing pid's mm. +This file can be used to adjust the score used to select which processes +should be killed in an out-of-memory situation. Giving it a high score will +increase the likelihood of this process being killed by the oom-killer. Valid +values are in the range -16 to +15, plus the special value -17, which disables +oom-killing altogether for this process. The process to be killed in an out-of-memory situation is selected among all others based on its badness score. This value equals the original memory size of the process @@ -1187,9 +1185,6 @@ the parent's score if they do not share the same memory. Thus forking servers are the prime candidates to be killed. Having only one 'hungry' child will make parent less preferable than the child. -/proc/<pid>/oom_adj cannot be changed for kthreads since they are immune from -oom-killing already. - /proc/<pid>/oom_score shows process' current badness score. The following heuristics are then applied: diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt index dbea4f95fc8..1c058b552e9 100644 --- a/Documentation/ioctl/ioctl-number.txt +++ b/Documentation/ioctl/ioctl-number.txt @@ -121,6 +121,7 @@ Code Seq# Include File Comments 'c' 00-7F linux/comstats.h conflict! 'c' 00-7F linux/coda.h conflict! 'c' 80-9F arch/s390/include/asm/chsc.h +'c' A0-AF arch/x86/include/asm/msr.h 'd' 00-FF linux/char/drm/drm/h conflict! 'd' F0-FF linux/digi1.h 'e' all linux/digi1.h conflict! diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 7936b801fe6..cb3a169e372 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1503,6 +1503,14 @@ and is between 256 and 4096 characters. It is defined in the file [NFS] set the TCP port on which the NFSv4 callback channel should listen. + nfs.cache_getent= + [NFS] sets the pathname to the program which is used + to update the NFS client cache entries. + + nfs.cache_getent_timeout= + [NFS] sets the timeout after which an attempt to + update a cache entry is deemed to have failed. + nfs.idmap_cache_timeout= [NFS] set the maximum lifetime for idmapper cache entries. @@ -1535,6 +1543,11 @@ and is between 256 and 4096 characters. It is defined in the file symbolic names: lapic and ioapic Example: nmi_watchdog=2 or nmi_watchdog=panic,lapic + netpoll.carrier_timeout= + [NET] Specifies amount of time (in seconds) that + netpoll should wait for a carrier. By default netpoll + waits 4 seconds. + no387 [BUGS=X86-32] Tells the kernel to use the 387 maths emulation library even if a 387 maths coprocessor is present. @@ -2395,6 +2408,18 @@ and is between 256 and 4096 characters. It is defined in the file stifb= [HW] Format: bpp:<bpp1>[:<bpp2>[:<bpp3>...]] + sunrpc.min_resvport= + sunrpc.max_resvport= + [NFS,SUNRPC] + SunRPC servers often require that client requests + originate from a privileged port (i.e. a port in the + range 0 < portnr < 1024). + An administrator who wishes to reserve some of these + ports for other uses may adjust the range that the + kernel's sunrpc client considers to be privileged + using these two parameters to set the minimum and + maximum port values. + sunrpc.pool_mode= [NFS] Control how the NFS server code allocates CPUs to @@ -2411,6 +2436,15 @@ and is between 256 and 4096 characters. It is defined in the file pernode one pool for each NUMA node (equivalent to global on non-NUMA machines) + sunrpc.tcp_slot_table_entries= + sunrpc.udp_slot_table_entries= + [NFS,SUNRPC] + Sets the upper limit on the number of simultaneous + RPC calls that can be sent from the client to a + server. Increasing these values may allow you to + improve throughput, but will also increase the + amount of memory reserved for use by the client. + swiotlb= [IA-64] Number of I/O TLB slabs switches= [HW,M68k] @@ -2480,6 +2514,11 @@ and is between 256 and 4096 characters. It is defined in the file trace_buf_size=nn[KMG] [FTRACE] will set tracing buffer size. + trace_event=[event-list] + [FTRACE] Set and start specified trace events in order + to facilitate early boot debugging. + See also Documentation/trace/events.txt + trix= [HW,OSS] MediaTrix AudioTrix Pro Format: <io>,<irq>,<dma>,<dma2>,<sb_io>,<sb_irq>,<sb_dma>,<mpu_io>,<mpu_irq> diff --git a/Documentation/keys.txt b/Documentation/keys.txt index b56aacc1fff..e4dbbdb1bd9 100644 --- a/Documentation/keys.txt +++ b/Documentation/keys.txt @@ -26,7 +26,7 @@ This document has the following sections: - Notes on accessing payload contents - Defining a key type - Request-key callback service - - Key access filesystem + - Garbage collection ============ @@ -113,6 +113,9 @@ Each key has a number of attributes: (*) Dead. The key's type was unregistered, and so the key is now useless. +Keys in the last three states are subject to garbage collection. See the +section on "Garbage collection". + ==================== KEY SERVICE OVERVIEW @@ -754,6 +757,26 @@ The keyctl syscall functions are: successful. + (*) Install the calling process's session keyring on its parent. + + long keyctl(KEYCTL_SESSION_TO_PARENT); + + This functions attempts to install the calling process's session keyring + on to the calling process's parent, replacing the parent's current session + keyring. + + The calling process must have the same ownership as its parent, the + keyring must have the same ownership as the calling process, the calling + process must have LINK permission on the keyring and the active LSM module + mustn't deny permission, otherwise error EPERM will be returned. + + Error ENOMEM will be returned if there was insufficient memory to complete + the operation, otherwise 0 will be returned to indicate success. + + The keyring will be replaced next time the parent process leaves the + kernel and resumes executing userspace. + + =============== KERNEL SERVICES =============== @@ -1231,3 +1254,17 @@ by executing: In this case, the program isn't required to actually attach the key to a ring; the rings are provided for reference. + + +================== +GARBAGE COLLECTION +================== + +Dead keys (for which the type has been removed) will be automatically unlinked +from those keyrings that point to them and deleted as soon as possible by a +background garbage collector. + +Similarly, revoked and expired keys will be garbage collected, but only after a +certain amount of time has passed. This time is set as a number of seconds in: + + /proc/sys/kernel/keys/gc_delay diff --git a/Documentation/kmemleak.txt b/Documentation/kmemleak.txt index 89068030b01..34f6638aa5a 100644 --- a/Documentation/kmemleak.txt +++ b/Documentation/kmemleak.txt @@ -27,6 +27,13 @@ To trigger an intermediate memory scan: # echo scan > /sys/kernel/debug/kmemleak +To clear the list of all current possible memory leaks: + + # echo clear > /sys/kernel/debug/kmemleak + +New leaks will then come up upon reading /sys/kernel/debug/kmemleak +again. + Note that the orphan objects are listed in the order they were allocated and one object at the beginning of the list may cause other subsequent objects to be reported as orphan. @@ -42,6 +49,9 @@ Memory scanning parameters can be modified at run-time by writing to the scan=<secs> - set the automatic memory scanning period in seconds (default 600, 0 to stop the automatic scanning) scan - trigger a memory scan + clear - clear list of current memory leak suspects, done by + marking all current reported unreferenced objects grey + dump=<addr> - dump information about the object found at <addr> Kmemleak can also be disabled at boot-time by passing "kmemleak=off" on the kernel command line. @@ -86,6 +96,27 @@ avoid this, kmemleak can also store the number of values pointing to an address inside the block address range that need to be found so that the block is not considered a leak. One example is __vmalloc(). +Testing specific sections with kmemleak +--------------------------------------- + +Upon initial bootup your /sys/kernel/debug/kmemleak output page may be +quite extensive. This can also be the case if you have very buggy code +when doing development. To work around these situations you can use the +'clear' command to clear all reported unreferenced objects from the +/sys/kernel/debug/kmemleak output. By issuing a 'scan' after a 'clear' +you can find new unreferenced objects; this should help with testing +specific sections of code. + +To test a critical section on demand with a clean kmemleak do: + + # echo clear > /sys/kernel/debug/kmemleak + ... test your kernel or modules ... + # echo scan > /sys/kernel/debug/kmemleak + +Then as usual to get your report with: + + # cat /sys/kernel/debug/kmemleak + Kmemleak API ------------ diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX index 1634c6dceca..50189bf07d5 100644 --- a/Documentation/networking/00-INDEX +++ b/Documentation/networking/00-INDEX @@ -60,6 +60,8 @@ framerelay.txt - info on using Frame Relay/Data Link Connection Identifier (DLCI). generic_netlink.txt - info on Generic Netlink +ieee802154.txt + - Linux IEEE 802.15.4 implementation, API and drivers ip-sysctl.txt - /proc/sys/net/ipv4/* variables ip_dynaddr.txt diff --git a/Documentation/networking/ieee802154.txt b/Documentation/networking/ieee802154.txt index a0280ad2edc..23c995e6403 100644 --- a/Documentation/networking/ieee802154.txt +++ b/Documentation/networking/ieee802154.txt @@ -22,7 +22,7 @@ int sd = socket(PF_IEEE802154, SOCK_DGRAM, 0); ..... The address family, socket addresses etc. are defined in the -include/net/ieee802154/af_ieee802154.h header or in the special header +include/net/af_ieee802154.h header or in the special header in our userspace package (see either linux-zigbee sourceforge download page or git tree at git://linux-zigbee.git.sourceforge.net/gitroot/linux-zigbee). @@ -33,7 +33,7 @@ MLME - MAC Level Management ============================ Most of IEEE 802.15.4 MLME interfaces are directly mapped on netlink commands. -See the include/net/ieee802154/nl802154.h header. Our userspace tools package +See the include/net/nl802154.h header. Our userspace tools package (see above) provides CLI configuration utility for radio interfaces and simple coordinator for IEEE 802.15.4 networks as an example users of MLME protocol. @@ -54,10 +54,14 @@ Those types of devices require different approach to be hooked into Linux kernel HardMAC ======= -See the header include/net/ieee802154/netdevice.h. You have to implement Linux +See the header include/net/ieee802154_netdev.h. You have to implement Linux net_device, with .type = ARPHRD_IEEE802154. Data is exchanged with socket family -code via plain sk_buffs. The control block of sk_buffs will contain additional -info as described in the struct ieee802154_mac_cb. +code via plain sk_buffs. On skb reception skb->cb must contain additional +info as described in the struct ieee802154_mac_cb. During packet transmission +the skb->cb is used to provide additional data to device's header_ops->create +function. Be aware, that this data can be overriden later (when socket code +submits skb to qdisc), so if you need something from that cb later, you should +store info in the skb->data on your own. To hook the MLME interface you have to populate the ml_priv field of your net_device with a pointer to struct ieee802154_mlme_ops instance. All fields are @@ -69,8 +73,8 @@ We provide an example of simple HardMAC driver at drivers/ieee802154/fakehard.c SoftMAC ======= -We are going to provide intermediate layer impelementing IEEE 802.15.4 MAC +We are going to provide intermediate layer implementing IEEE 802.15.4 MAC in software. This is currently WIP. -See header include/net/ieee802154/mac802154.h and several drivers in -drivers/ieee802154/ +See header include/net/mac802154.h and several drivers in drivers/ieee802154/. + diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 8be76235fe6..fbe427a6580 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -311,9 +311,12 @@ tcp_no_metrics_save - BOOLEAN connections. tcp_orphan_retries - INTEGER - How may times to retry before killing TCP connection, closed - by our side. Default value 7 corresponds to ~50sec-16min - depending on RTO. If you machine is loaded WEB server, + This value influences the timeout of a locally closed TCP connection, + when RTO retransmissions remain unacknowledged. + See tcp_retries2 for more details. + + The default value is 7. + If your machine is a loaded WEB server, you should think about lowering this value, such sockets may consume significant resources. Cf. tcp_max_orphans. @@ -327,16 +330,28 @@ tcp_retrans_collapse - BOOLEAN certain TCP stacks. tcp_retries1 - INTEGER - How many times to retry before deciding that something is wrong - and it is necessary to report this suspicion to network layer. - Minimal RFC value is 3, it is default, which corresponds - to ~3sec-8min depending on RTO. + This value influences the time, after which TCP decides, that + something is wrong due to unacknowledged RTO retransmissions, + and reports this suspicion to the network layer. + See tcp_retries2 for more details. + + RFC 1122 recommends at least 3 retransmissions, which is the + default. tcp_retries2 - INTEGER - How may times to retry before killing alive TCP connection. - RFC1122 says that the limit should be longer than 100 sec. - It is too small number. Default value 15 corresponds to ~13-30min - depending on RTO. + This value influences the timeout of an alive TCP connection, + when RTO retransmissions remain unacknowledged. + Given a value of N, a hypothetical TCP connection following + exponential backoff with an initial RTO of TCP_RTO_MIN would + retransmit N times before killing the connection at the (N+1)th RTO. + + The default value of 15 yields a hypothetical timeout of 924.6 + seconds and is a lower bound for the effective timeout. + TCP will effectively time out at the first RTO which exceeds the + hypothetical timeout. + + RFC 1122 recommends at least 100 seconds for the timeout, + which corresponds to a value of at least 8. tcp_rfc1337 - BOOLEAN If set, the TCP stack behaves conforming to RFC1337. If unset, @@ -1282,6 +1297,16 @@ sctp_rmem - vector of 3 INTEGERs: min, default, max sctp_wmem - vector of 3 INTEGERs: min, default, max See tcp_wmem for a description. +addr_scope_policy - INTEGER + Control IPv4 address scoping - draft-stewart-tsvwg-sctp-ipv4-00 + + 0 - Disable IPv4 address scoping + 1 - Enable IPv4 address scoping + 2 - Follow draft but allow IPv4 private addresses + 3 - Follow draft but allow IPv4 link local addresses + + Default: 1 + /proc/sys/net/core/* dev_weight - INTEGER diff --git a/Documentation/s390/s390dbf.txt b/Documentation/s390/s390dbf.txt index 2d10053dd97..ae66f9b90a2 100644 --- a/Documentation/s390/s390dbf.txt +++ b/Documentation/s390/s390dbf.txt @@ -495,6 +495,13 @@ and for each vararg a long value. So e.g. for a debug entry with a format string plus two varargs one would need to allocate a (3 * sizeof(long)) byte data area in the debug_register() function. +IMPORTANT: Using "%s" in sprintf event functions is dangerous. You can only +use "%s" in the sprintf event functions, if the memory for the passed string is +available as long as the debug feature exists. The reason behind this is that +due to performance considerations only a pointer to the string is stored in +the debug feature. If you log a string that is freed afterwards, you will get +an OOPS when inspecting the debug feature, because then the debug feature will +access the already freed memory. NOTE: If using the sprintf view do NOT use other event/exception functions than the sprintf-event and -exception functions. diff --git a/Documentation/sound/alsa/ALSA-Configuration.txt b/Documentation/sound/alsa/ALSA-Configuration.txt index 4252697a95d..1c8eb4518ce 100644 --- a/Documentation/sound/alsa/ALSA-Configuration.txt +++ b/Documentation/sound/alsa/ALSA-Configuration.txt @@ -60,6 +60,12 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. slots - Reserve the slot index for the given driver. This option takes multiple strings. See "Module Autoloading Support" section for details. + debug - Specifies the debug message level + (0 = disable debug prints, 1 = normal debug messages, + 2 = verbose debug messages) + This option appears only when CONFIG_SND_DEBUG=y. + This option can be dynamically changed via sysfs + /sys/modules/snd/parameters/debug file. Module snd-pcm-oss ------------------ @@ -513,6 +519,26 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. or input, but you may use this module for any application which requires a sound card (like RealPlayer). + pcm_devs - Number of PCM devices assigned to each card + (default = 1, up to 4) + pcm_substreams - Number of PCM substreams assigned to each PCM + (default = 8, up to 16) + hrtimer - Use hrtimer (=1, default) or system timer (=0) + fake_buffer - Fake buffer allocations (default = 1) + + When multiple PCM devices are created, snd-dummy gives different + behavior to each PCM device: + 0 = interleaved with mmap support + 1 = non-interleaved with mmap support + 2 = interleaved without mmap + 3 = non-interleaved without mmap + + As default, snd-dummy drivers doesn't allocate the real buffers + but either ignores read/write or mmap a single dummy page to all + buffer pages, in order to save the resouces. If your apps need + the read/ written buffer data to be consistent, pass fake_buffer=0 + option. + The power-management is supported. Module snd-echo3g @@ -768,6 +794,10 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. bdl_pos_adj - Specifies the DMA IRQ timing delay in samples. Passing -1 will make the driver to choose the appropriate value based on the controller chip. + patch - Specifies the early "patch" files to modify the HD-audio + setup before initializing the codecs. This option is + available only when CONFIG_SND_HDA_PATCH_LOADER=y is set. + See HD-Audio.txt for details. [Single (global) options] single_cmd - Use single immediate commands to communicate with diff --git a/Documentation/sound/alsa/HD-Audio-Models.txt b/Documentation/sound/alsa/HD-Audio-Models.txt index 939a3dd5814..97eebd63bed 100644 --- a/Documentation/sound/alsa/HD-Audio-Models.txt +++ b/Documentation/sound/alsa/HD-Audio-Models.txt @@ -114,8 +114,8 @@ ALC662/663/272 samsung-nc10 Samsung NC10 mini notebook auto auto-config reading BIOS (default) -ALC882/885 -========== +ALC882/883/885/888/889 +====================== 3stack-dig 3-jack with SPDIF I/O 6stack-dig 6-jack digital with SPDIF I/O arima Arima W820Di1 @@ -127,12 +127,8 @@ ALC882/885 mbp3 Macbook Pro rev3 imac24 iMac 24'' with jack detection w2jc ASUS W2JC - auto auto-config reading BIOS (default) - -ALC883/888 -========== - 3stack-dig 3-jack with SPDIF I/O - 6stack-dig 6-jack digital with SPDIF I/O + 3stack-2ch-dig 3-jack with SPDIF I/O (ALC883) + alc883-6stack-dig 6-jack digital with SPDIF I/O (ALC883) 3stack-6ch 3-jack 6-channel 3stack-6ch-dig 3-jack 6-channel with SPDIF I/O 6stack-dig-demo 6-jack digital for Intel demo board @@ -140,6 +136,7 @@ ALC883/888 acer-aspire Acer Aspire 9810 acer-aspire-4930g Acer Aspire 4930G acer-aspire-6530g Acer Aspire 6530G + acer-aspire-7730g Acer Aspire 7730G acer-aspire-8930g Acer Aspire 8930G medion Medion Laptops medion-md2 Medion MD2 @@ -155,10 +152,13 @@ ALC883/888 3stack-hp HP machines with 3stack (Lucknow, Samba boards) 6stack-dell Dell machines with 6stack (Inspiron 530) mitac Mitac 8252D + clevo-m540r Clevo M540R (6ch + digital) clevo-m720 Clevo M720 laptop series fujitsu-pi2515 Fujitsu AMILO Pi2515 fujitsu-xa3530 Fujitsu AMILO XA3530 3stack-6ch-intel Intel DG33* boards + intel-alc889a Intel IbexPeak with ALC889A + intel-x58 Intel DX58 with ALC889 asus-p5q ASUS P5Q-EM boards mb31 MacBook 3,1 sony-vaio-tt Sony VAIO TT @@ -229,7 +229,7 @@ AD1984 ====== basic default configuration thinkpad Lenovo Thinkpad T61/X61 - dell Dell T3400 + dell_desktop Dell T3400 AD1986A ======= @@ -258,6 +258,7 @@ Conexant 5045 laptop-micsense Laptop with Mic sense (old model fujitsu) laptop-hpmicsense Laptop with HP and Mic senses benq Benq R55E + laptop-hp530 HP 530 laptop test for testing/debugging purpose, almost all controls can be adjusted. Appearing only when compiled with $CONFIG_SND_DEBUG=y @@ -278,9 +279,16 @@ Conexant 5051 hp-dv6736 HP dv6736 lenovo-x200 Lenovo X200 laptop +Conexant 5066 +============= + laptop Basic Laptop config (default) + dell-laptop Dell laptops + olpc-xo-1_5 OLPC XO 1.5 + STAC9200 ======== ref Reference board + oqo OQO Model 2 dell-d21 Dell (unknown) dell-d22 Dell (unknown) dell-d23 Dell (unknown) @@ -368,10 +376,12 @@ STAC92HD73* =========== ref Reference board no-jd BIOS setup but without jack-detection + intel Intel DG45* mobos dell-m6-amic Dell desktops/laptops with analog mics dell-m6-dmic Dell desktops/laptops with digital mics dell-m6 Dell desktops/laptops with both type of mics dell-eq Dell desktops/laptops + alienware Alienware M17x auto BIOS setup (default) STAC92HD83* @@ -385,3 +395,8 @@ STAC9872 ======== vaio VAIO laptop without SPDIF auto BIOS setup (default) + +Cirrus Logic CS4206/4207 +======================== + mbp55 MacBook Pro 5,5 + auto BIOS setup (default) diff --git a/Documentation/sound/alsa/HD-Audio.txt b/Documentation/sound/alsa/HD-Audio.txt index 71ac995b191..7b8a5f947d1 100644 --- a/Documentation/sound/alsa/HD-Audio.txt +++ b/Documentation/sound/alsa/HD-Audio.txt @@ -139,6 +139,10 @@ The driver checks PCI SSID and looks through the static configuration table until any matching entry is found. If you have a new machine, you may see a message like below: ------------------------------------------------------------------------ + hda_codec: ALC880: BIOS auto-probing. +------------------------------------------------------------------------ +Meanwhile, in the earlier versions, you would see a message like: +------------------------------------------------------------------------ hda_codec: Unknown model for ALC880, trying auto-probe from BIOS... ------------------------------------------------------------------------ Even if you see such a message, DON'T PANIC. Take a deep breath and @@ -403,6 +407,66 @@ re-configure based on that state, run like below: ------------------------------------------------------------------------ +Early Patching +~~~~~~~~~~~~~~ +When CONFIG_SND_HDA_PATCH_LOADER=y is set, you can pass a "patch" as a +firmware file for modifying the HD-audio setup before initializing the +codec. This can work basically like the reconfiguration via sysfs in +the above, but it does it before the first codec configuration. + +A patch file is a plain text file which looks like below: + +------------------------------------------------------------------------ + [codec] + 0x12345678 0xabcd1234 2 + + [model] + auto + + [pincfg] + 0x12 0x411111f0 + + [verb] + 0x20 0x500 0x03 + 0x20 0x400 0xff + + [hint] + hp_detect = yes +------------------------------------------------------------------------ + +The file needs to have a line `[codec]`. The next line should contain +three numbers indicating the codec vendor-id (0x12345678 in the +example), the codec subsystem-id (0xabcd1234) and the address (2) of +the codec. The rest patch entries are applied to this specified codec +until another codec entry is given. + +The `[model]` line allows to change the model name of the each codec. +In the example above, it will be changed to model=auto. +Note that this overrides the module option. + +After the `[pincfg]` line, the contents are parsed as the initial +default pin-configurations just like `user_pin_configs` sysfs above. +The values can be shown in user_pin_configs sysfs file, too. + +Similarly, the lines after `[verb]` are parsed as `init_verbs` +sysfs entries, and the lines after `[hint]` are parsed as `hints` +sysfs entries, respectively. + +The hd-audio driver reads the file via request_firmware(). Thus, +a patch file has to be located on the appropriate firmware path, +typically, /lib/firmware. For example, when you pass the option +`patch=hda-init.fw`, the file /lib/firmware/hda-init-fw must be +present. + +The patch module option is specific to each card instance, and you +need to give one file name for each instance, separated by commas. +For example, if you have two cards, one for an on-board analog and one +for an HDMI video board, you may pass patch option like below: +------------------------------------------------------------------------ + options snd-hda-intel patch=on-board-patch,hdmi-patch +------------------------------------------------------------------------ + + Power-Saving ~~~~~~~~~~~~ The power-saving is a kind of auto-suspend of the device. When the diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 322a00bb99d..2dbff53369d 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -19,6 +19,7 @@ Currently, these files might (depending on your configuration) show up in /proc/sys/kernel: - acpi_video_flags - acct +- callhome [ S390 only ] - auto_msgmni - core_pattern - core_uses_pid @@ -91,6 +92,21 @@ valid for 30 seconds. ============================================================== +callhome: + +Controls the kernel's callhome behavior in case of a kernel panic. + +The s390 hardware allows an operating system to send a notification +to a service organization (callhome) in case of an operating system panic. + +When the value in this file is 0 (which is the default behavior) +nothing happens in case of a kernel panic. If this value is set to "1" +the complete kernel oops message is send to the IBM customer service +organization in case the mainframe the Linux operating system is running +on has a service contract with IBM. + +============================================================== + core_pattern: core_pattern is used to specify a core dumpfile pattern name. diff --git a/Documentation/trace/events.txt b/Documentation/trace/events.txt index f157d7594ea..2bcc8d4dea2 100644 --- a/Documentation/trace/events.txt +++ b/Documentation/trace/events.txt @@ -83,6 +83,15 @@ When reading one of these enable files, there are four results: X - there is a mixture of events enabled and disabled ? - this file does not affect any event +2.3 Boot option +--------------- + +In order to facilitate early boot debugging, use boot option: + + trace_event=[event-list] + +The format of this boot option is the same as described in section 2.1. + 3. Defining an event-enabled tracepoint ======================================= diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt index a39b3c749de..355d0f1f8c5 100644 --- a/Documentation/trace/ftrace.txt +++ b/Documentation/trace/ftrace.txt @@ -85,26 +85,19 @@ of ftrace. Here is a list of some of the key files: This file holds the output of the trace in a human readable format (described below). - latency_trace: - - This file shows the same trace but the information - is organized more to display possible latencies - in the system (described below). - trace_pipe: The output is the same as the "trace" file but this file is meant to be streamed with live tracing. - Reads from this file will block until new data - is retrieved. Unlike the "trace" and "latency_trace" - files, this file is a consumer. This means reading - from this file causes sequential reads to display - more current data. Once data is read from this - file, it is consumed, and will not be read - again with a sequential read. The "trace" and - "latency_trace" files are static, and if the - tracer is not adding more data, they will display - the same information every time they are read. + Reads from this file will block until new data is + retrieved. Unlike the "trace" file, this file is a + consumer. This means reading from this file causes + sequential reads to display more current data. Once + data is read from this file, it is consumed, and + will not be read again with a sequential read. The + "trace" file is static, and if the tracer is not + adding more data,they will display the same + information every time they are read. trace_options: @@ -117,10 +110,10 @@ of ftrace. Here is a list of some of the key files: Some of the tracers record the max latency. For example, the time interrupts are disabled. This time is saved in this file. The max trace - will also be stored, and displayed by either - "trace" or "latency_trace". A new max trace will - only be recorded if the latency is greater than - the value in this file. (in microseconds) + will also be stored, and displayed by "trace". + A new max trace will only be recorded if the + latency is greater than the value in this + file. (in microseconds) buffer_size_kb: @@ -210,7 +203,7 @@ Here is the list of current tracers that may be configured. the trace with the longest max latency. See tracing_max_latency. When a new max is recorded, it replaces the old trace. It is best to view this - trace via the latency_trace file. + trace with the latency-format option enabled. "preemptoff" @@ -307,8 +300,8 @@ the lowest priority thread (pid 0). Latency trace format -------------------- -For traces that display latency times, the latency_trace file -gives somewhat more information to see why a latency happened. +When the latency-format option is enabled, the trace file gives +somewhat more information to see why a latency happened. Here is a typical trace. # tracer: irqsoff @@ -380,9 +373,10 @@ explains which is which. The above is mostly meaningful for kernel developers. - time: This differs from the trace file output. The trace file output - includes an absolute timestamp. The timestamp used by the - latency_trace file is relative to the start of the trace. + time: When the latency-format option is enabled, the trace file + output includes a timestamp relative to the start of the + trace. This differs from the output when latency-format + is disabled, which includes an absolute timestamp. delay: This is just to help catch your eye a bit better. And needs to be fixed to be only relative to the same CPU. @@ -440,7 +434,8 @@ Here are the available options: sym-addr: bash-4000 [01] 1477.606694: simple_strtoul <c0339346> - verbose - This deals with the latency_trace file. + verbose - This deals with the trace file when the + latency-format option is enabled. bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \ (+0.000ms): simple_strtoul (strict_strtoul) @@ -472,7 +467,7 @@ Here are the available options: the app is no longer running The lookup is performed when you read - trace,trace_pipe,latency_trace. Example: + trace,trace_pipe. Example: a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0 x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] @@ -481,6 +476,11 @@ x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] every scheduling event. Will add overhead if there's a lot of tasks running at once. + latency-format - This option changes the trace. When + it is enabled, the trace displays + additional information about the + latencies, as described in "Latency + trace format". sched_switch ------------ @@ -596,12 +596,13 @@ To reset the maximum, echo 0 into tracing_max_latency. Here is an example: # echo irqsoff > current_tracer + # echo latency-format > trace_options # echo 0 > tracing_max_latency # echo 1 > tracing_enabled # ls -ltr [...] # echo 0 > tracing_enabled - # cat latency_trace + # cat trace # tracer: irqsoff # irqsoff latency trace v1.1.5 on 2.6.26 @@ -703,12 +704,13 @@ which preemption was disabled. The control of preemptoff tracer is much like the irqsoff tracer. # echo preemptoff > current_tracer + # echo latency-format > trace_options # echo 0 > tracing_max_latency # echo 1 > tracing_enabled # ls -ltr [...] # echo 0 > tracing_enabled - # cat latency_trace + # cat trace # tracer: preemptoff # preemptoff latency trace v1.1.5 on 2.6.26-rc8 @@ -850,12 +852,13 @@ Again, using this trace is much like the irqsoff and preemptoff tracers. # echo preemptirqsoff > current_tracer + # echo latency-format > trace_options # echo 0 > tracing_max_latency # echo 1 > tracing_enabled # ls -ltr [...] # echo 0 > tracing_enabled - # cat latency_trace + # cat trace # tracer: preemptirqsoff # preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8 @@ -1012,11 +1015,12 @@ Instead of performing an 'ls', we will run 'sleep 1' under 'chrt' which changes the priority of the task. # echo wakeup > current_tracer + # echo latency-format > trace_options # echo 0 > tracing_max_latency # echo 1 > tracing_enabled # chrt -f 5 sleep 1 # echo 0 > tracing_enabled - # cat latency_trace + # cat trace # tracer: wakeup # wakeup latency trace v1.1.5 on 2.6.26-rc8 diff --git a/Documentation/trace/function-graph-fold.vim b/Documentation/trace/function-graph-fold.vim new file mode 100644 index 00000000000..0544b504c8b --- /dev/null +++ b/Documentation/trace/function-graph-fold.vim @@ -0,0 +1,42 @@ +" Enable folding for ftrace function_graph traces. +" +" To use, :source this file while viewing a function_graph trace, or use vim's +" -S option to load from the command-line together with a trace. You can then +" use the usual vim fold commands, such as "za", to open and close nested +" functions. While closed, a fold will show the total time taken for a call, +" as would normally appear on the line with the closing brace. Folded +" functions will not include finish_task_switch(), so folding should remain +" relatively sane even through a context switch. +" +" Note that this will almost certainly only work well with a +" single-CPU trace (e.g. trace-cmd report --cpu 1). + +function! FunctionGraphFoldExpr(lnum) + let line = getline(a:lnum) + if line[-1:] == '{' + if line =~ 'finish_task_switch() {$' + return '>1' + endif + return 'a1' + elseif line[-1:] == '}' + return 's1' + else + return '=' + endif +endfunction + +function! FunctionGraphFoldText() + let s = split(getline(v:foldstart), '|', 1) + if getline(v:foldend+1) =~ 'finish_task_switch() {$' + let s[2] = ' task switch ' + else + let e = split(getline(v:foldend), '|', 1) + let s[2] = e[2] + endif + return join(s, '|') +endfunction + +setlocal foldexpr=FunctionGraphFoldExpr(v:lnum) +setlocal foldtext=FunctionGraphFoldText() +setlocal foldcolumn=12 +setlocal foldmethod=expr diff --git a/Documentation/trace/ring-buffer-design.txt b/Documentation/trace/ring-buffer-design.txt new file mode 100644 index 00000000000..5b1d23d604c --- /dev/null +++ b/Documentation/trace/ring-buffer-design.txt @@ -0,0 +1,955 @@ + Lockless Ring Buffer Design + =========================== + +Copyright 2009 Red Hat Inc. + Author: Steven Rostedt <srostedt@redhat.com> + License: The GNU Free Documentation License, Version 1.2 + (dual licensed under the GPL v2) +Reviewers: Mathieu Desnoyers, Huang Ying, Hidetoshi Seto, + and Frederic Weisbecker. + + +Written for: 2.6.31 + +Terminology used in this Document +--------------------------------- + +tail - where new writes happen in the ring buffer. + +head - where new reads happen in the ring buffer. + +producer - the task that writes into the ring buffer (same as writer) + +writer - same as producer + +consumer - the task that reads from the buffer (same as reader) + +reader - same as consumer. + +reader_page - A page outside the ring buffer used solely (for the most part) + by the reader. + +head_page - a pointer to the page that the reader will use next + +tail_page - a pointer to the page that will be written to next + +commit_page - a pointer to the page with the last finished non nested write. + +cmpxchg - hardware assisted atomic transaction that performs the following: + + A = B iff previous A == C + + R = cmpxchg(A, C, B) is saying that we replace A with B if and only if + current A is equal to C, and we put the old (current) A into R + + R gets the previous A regardless if A is updated with B or not. + + To see if the update was successful a compare of R == C may be used. + +The Generic Ring Buffer +----------------------- + +The ring buffer can be used in either an overwrite mode or in +producer/consumer mode. + +Producer/consumer mode is where the producer were to fill up the +buffer before the consumer could free up anything, the producer +will stop writing to the buffer. This will lose most recent events. + +Overwrite mode is where the produce were to fill up the buffer +before the consumer could free up anything, the producer will +overwrite the older data. This will lose the oldest events. + +No two writers can write at the same time (on the same per cpu buffer), +but a writer may interrupt another writer, but it must finish writing +before the previous writer may continue. This is very important to the +algorithm. The writers act like a "stack". The way interrupts works +enforces this behavior. + + + writer1 start + <preempted> writer2 start + <preempted> writer3 start + writer3 finishes + writer2 finishes + writer1 finishes + +This is very much like a writer being preempted by an interrupt and +the interrupt doing a write as well. + +Readers can happen at any time. But no two readers may run at the +same time, nor can a reader preempt/interrupt another reader. A reader +can not preempt/interrupt a writer, but it may read/consume from the +buffer at the same time as a writer is writing, but the reader must be +on another processor to do so. A reader may read on its own processor +and can be preempted by a writer. + +A writer can preempt a reader, but a reader can not preempt a writer. +But a reader can read the buffer at the same time (on another processor) +as a writer. + +The ring buffer is made up of a list of pages held together by a link list. + +At initialization a reader page is allocated for the reader that is not +part of the ring buffer. + +The head_page, tail_page and commit_page are all initialized to point +to the same page. + +The reader page is initialized to have its next pointer pointing to +the head page, and its previous pointer pointing to a page before +the head page. + +The reader has its own page to use. At start up time, this page is +allocated but is not attached to the list. When the reader wants +to read from the buffer, if its page is empty (like it is on start up) +it will swap its page with the head_page. The old reader page will +become part of the ring buffer and the head_page will be removed. +The page after the inserted page (old reader_page) will become the +new head page. + +Once the new page is given to the reader, the reader could do what +it wants with it, as long as a writer has left that page. + +A sample of how the reader page is swapped: Note this does not +show the head page in the buffer, it is for demonstrating a swap +only. + + +------+ + |reader| RING BUFFER + |page | + +------+ + +---+ +---+ +---+ + | |-->| |-->| | + | |<--| |<--| | + +---+ +---+ +---+ + ^ | ^ | + | +-------------+ | + +-----------------+ + + + +------+ + |reader| RING BUFFER + |page |-------------------+ + +------+ v + | +---+ +---+ +---+ + | | |-->| |-->| | + | | |<--| |<--| |<-+ + | +---+ +---+ +---+ | + | ^ | ^ | | + | | +-------------+ | | + | +-----------------+ | + +------------------------------------+ + + +------+ + |reader| RING BUFFER + |page |-------------------+ + +------+ <---------------+ v + | ^ +---+ +---+ +---+ + | | | |-->| |-->| | + | | | | | |<--| |<-+ + | | +---+ +---+ +---+ | + | | | ^ | | + | | +-------------+ | | + | +-----------------------------+ | + +------------------------------------+ + + +------+ + |buffer| RING BUFFER + |page |-------------------+ + +------+ <---------------+ v + | ^ +---+ +---+ +---+ + | | | | | |-->| | + | | New | | | |<--| |<-+ + | | Reader +---+ +---+ +---+ | + | | page ----^ | | + | | | | + | +-----------------------------+ | + +------------------------------------+ + + + +It is possible that the page swapped is the commit page and the tail page, +if what is in the ring buffer is less than what is held in a buffer page. + + + reader page commit page tail page + | | | + v | | + +---+ | | + | |<----------+ | + | |<------------------------+ + | |------+ + +---+ | + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +This case is still valid for this algorithm. +When the writer leaves the page, it simply goes into the ring buffer +since the reader page still points to the next location in the ring +buffer. + + +The main pointers: + + reader page - The page used solely by the reader and is not part + of the ring buffer (may be swapped in) + + head page - the next page in the ring buffer that will be swapped + with the reader page. + + tail page - the page where the next write will take place. + + commit page - the page that last finished a write. + +The commit page only is updated by the outer most writer in the +writer stack. A writer that preempts another writer will not move the +commit page. + +When data is written into the ring buffer, a position is reserved +in the ring buffer and passed back to the writer. When the writer +is finished writing data into that position, it commits the write. + +Another write (or a read) may take place at anytime during this +transaction. If another write happens it must finish before continuing +with the previous write. + + + Write reserve: + + Buffer page + +---------+ + |written | + +---------+ <--- given back to writer (current commit) + |reserved | + +---------+ <--- tail pointer + | empty | + +---------+ + + Write commit: + + Buffer page + +---------+ + |written | + +---------+ + |written | + +---------+ <--- next positon for write (current commit) + | empty | + +---------+ + + + If a write happens after the first reserve: + + Buffer page + +---------+ + |written | + +---------+ <-- current commit + |reserved | + +---------+ <--- given back to second writer + |reserved | + +---------+ <--- tail pointer + + After second writer commits: + + + Buffer page + +---------+ + |written | + +---------+ <--(last full commit) + |reserved | + +---------+ + |pending | + |commit | + +---------+ <--- tail pointer + + When the first writer commits: + + Buffer page + +---------+ + |written | + +---------+ + |written | + +---------+ + |written | + +---------+ <--(last full commit and tail pointer) + + +The commit pointer points to the last write location that was +committed without preempting another write. When a write that +preempted another write is committed, it only becomes a pending commit +and will not be a full commit till all writes have been committed. + +The commit page points to the page that has the last full commit. +The tail page points to the page with the last write (before +committing). + +The tail page is always equal to or after the commit page. It may +be several pages ahead. If the tail page catches up to the commit +page then no more writes may take place (regardless of the mode +of the ring buffer: overwrite and produce/consumer). + +The order of pages are: + + head page + commit page + tail page + +Possible scenario: + tail page + head page commit page | + | | | + v v v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +There is a special case that the head page is after either the commit page +and possibly the tail page. That is when the commit (and tail) page has been +swapped with the reader page. This is because the head page is always +part of the ring buffer, but the reader page is not. When ever there +has been less than a full page that has been committed inside the ring buffer, +and a reader swaps out a page, it will be swapping out the commit page. + + + reader page commit page tail page + | | | + v | | + +---+ | | + | |<----------+ | + | |<------------------------+ + | |------+ + +---+ | + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + ^ + | + head page + + +In this case, the head page will not move when the tail and commit +move back into the ring buffer. + +The reader can not swap a page into the ring buffer if the commit page +is still on that page. If the read meets the last commit (real commit +not pending or reserved), then there is nothing more to read. +The buffer is considered empty until another full commit finishes. + +When the tail meets the head page, if the buffer is in overwrite mode, +the head page will be pushed ahead one. If the buffer is in producer/consumer +mode, the write will fail. + +Overwrite mode: + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + ^ + | + head page + + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + ^ + | + head page + + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + ^ + | + head page + +Note, the reader page will still point to the previous head page. +But when a swap takes place, it will use the most recent head page. + + +Making the Ring Buffer Lockless: +-------------------------------- + +The main idea behind the lockless algorithm is to combine the moving +of the head_page pointer with the swapping of pages with the reader. +State flags are placed inside the pointer to the page. To do this, +each page must be aligned in memory by 4 bytes. This will allow the 2 +least significant bits of the address to be used as flags. Since +they will always be zero for the address. To get the address, +simply mask out the flags. + + MASK = ~3 + + address & MASK + +Two flags will be kept by these two bits: + + HEADER - the page being pointed to is a head page + + UPDATE - the page being pointed to is being updated by a writer + and was or is about to be a head page. + + + reader page + | + v + +---+ + | |------+ + +---+ | + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-H->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + +The above pointer "-H->" would have the HEADER flag set. That is +the next page is the next page to be swapped out by the reader. +This pointer means the next page is the head page. + +When the tail page meets the head pointer, it will use cmpxchg to +change the pointer to the UPDATE state: + + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-H->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +"-U->" represents a pointer in the UPDATE state. + +Any access to the reader will need to take some sort of lock to serialize +the readers. But the writers will never take a lock to write to the +ring buffer. This means we only need to worry about a single reader, +and writes only preempt in "stack" formation. + +When the reader tries to swap the page with the ring buffer, it +will also use cmpxchg. If the flag bit in the pointer to the +head page does not have the HEADER flag set, the compare will fail +and the reader will need to look for the new head page and try again. +Note, the flag UPDATE and HEADER are never set at the same time. + +The reader swaps the reader page as follows: + + +------+ + |reader| RING BUFFER + |page | + +------+ + +---+ +---+ +---+ + | |--->| |--->| | + | |<---| |<---| | + +---+ +---+ +---+ + ^ | ^ | + | +---------------+ | + +-----H-------------+ + +The reader sets the reader page next pointer as HEADER to the page after +the head page. + + + +------+ + |reader| RING BUFFER + |page |-------H-----------+ + +------+ v + | +---+ +---+ +---+ + | | |--->| |--->| | + | | |<---| |<---| |<-+ + | +---+ +---+ +---+ | + | ^ | ^ | | + | | +---------------+ | | + | +-----H-------------+ | + +--------------------------------------+ + +It does a cmpxchg with the pointer to the previous head page to make it +point to the reader page. Note that the new pointer does not have the HEADER +flag set. This action atomically moves the head page forward. + + +------+ + |reader| RING BUFFER + |page |-------H-----------+ + +------+ v + | ^ +---+ +---+ +---+ + | | | |-->| |-->| | + | | | |<--| |<--| |<-+ + | | +---+ +---+ +---+ | + | | | ^ | | + | | +-------------+ | | + | +-----------------------------+ | + +------------------------------------+ + +After the new head page is set, the previous pointer of the head page is +updated to the reader page. + + +------+ + |reader| RING BUFFER + |page |-------H-----------+ + +------+ <---------------+ v + | ^ +---+ +---+ +---+ + | | | |-->| |-->| | + | | | | | |<--| |<-+ + | | +---+ +---+ +---+ | + | | | ^ | | + | | +-------------+ | | + | +-----------------------------+ | + +------------------------------------+ + + +------+ + |buffer| RING BUFFER + |page |-------H-----------+ <--- New head page + +------+ <---------------+ v + | ^ +---+ +---+ +---+ + | | | | | |-->| | + | | New | | | |<--| |<-+ + | | Reader +---+ +---+ +---+ | + | | page ----^ | | + | | | | + | +-----------------------------+ | + +------------------------------------+ + +Another important point. The page that the reader page points back to +by its previous pointer (the one that now points to the new head page) +never points back to the reader page. That is because the reader page is +not part of the ring buffer. Traversing the ring buffer via the next pointers +will always stay in the ring buffer. Traversing the ring buffer via the +prev pointers may not. + +Note, the way to determine a reader page is simply by examining the previous +pointer of the page. If the next pointer of the previous page does not +point back to the original page, then the original page is a reader page: + + + +--------+ + | reader | next +----+ + | page |-------->| |<====== (buffer page) + +--------+ +----+ + | | ^ + | v | next + prev | +----+ + +------------->| | + +----+ + +The way the head page moves forward: + +When the tail page meets the head page and the buffer is in overwrite mode +and more writes take place, the head page must be moved forward before the +writer may move the tail page. The way this is done is that the writer +performs a cmpxchg to convert the pointer to the head page from the HEADER +flag to have the UPDATE flag set. Once this is done, the reader will +not be able to swap the head page from the buffer, nor will it be able to +move the head page, until the writer is finished with the move. + +This eliminates any races that the reader can have on the writer. The reader +must spin, and this is why the reader can not preempt the writer. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-H->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The following page will be made into the new head page. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +After the new head page has been set, we can set the old head page +pointer back to NORMAL. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +After the head page has been moved, the tail page may now move forward. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + +The above are the trivial updates. Now for the more complex scenarios. + + +As stated before, if enough writes preempt the first write, the +tail page may make it all the way around the buffer and meet the commit +page. At this time, we must start dropping writes (usually with some kind +of warning to the user). But what happens if the commit was still on the +reader page? The commit page is not part of the ring buffer. The tail page +must account for this. + + + reader page commit page + | | + v | + +---+ | + | |<----------+ + | | + | |------+ + +---+ | + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-H->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + ^ + | + tail page + +If the tail page were to simply push the head page forward, the commit when +leaving the reader page would not be pointing to the correct page. + +The solution to this is to test if the commit page is on the reader page +before pushing the head page. If it is, then it can be assumed that the +tail page wrapped the buffer, and we must drop new writes. + +This is not a race condition, because the commit page can only be moved +by the outter most writer (the writer that was preempted). +This means that the commit will not move while a writer is moving the +tail page. The reader can not swap the reader page if it is also being +used as the commit page. The reader can simply check that the commit +is off the reader page. Once the commit page leaves the reader page +it will never go back on it unless a reader does another swap with the +buffer page that is also the commit page. + + +Nested writes +------------- + +In the pushing forward of the tail page we must first push forward +the head page if the head page is the next page. If the head page +is not the next page, the tail page is simply updated with a cmpxchg. + +Only writers move the tail page. This must be done atomically to protect +against nested writers. + + temp_page = tail_page + next_page = temp_page->next + cmpxchg(tail_page, temp_page, next_page) + +The above will update the tail page if it is still pointing to the expected +page. If this fails, a nested write pushed it forward, the the current write +does not need to push it. + + + temp page + | + v + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +Nested write comes in and moves the tail page forward: + + tail page (moved by nested writer) + temp page | + | | + v v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The above would fail the cmpxchg, but since the tail page has already +been moved forward, the writer will just try again to reserve storage +on the new tail page. + +But the moving of the head page is a bit more complex. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-H->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The write converts the head page pointer to UPDATE. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +But if a nested writer preempts here. It will see that the next +page is a head page, but it is also nested. It will detect that +it is nested and will save that information. The detection is the +fact that it sees the UPDATE flag instead of a HEADER or NORMAL +pointer. + +The nested writer will set the new head page pointer. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +But it will not reset the update back to normal. Only the writer +that converted a pointer from HEAD to UPDATE will convert it back +to NORMAL. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +After the nested writer finishes, the outer most writer will convert +the UPDATE pointer to NORMAL. + + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + +It can be even more complex if several nested writes came in and moved +the tail page ahead several pages: + + +(first writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-H->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The write converts the head page pointer to UPDATE. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +Next writer comes in, and sees the update and sets up the new +head page. + +(second writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The nested writer moves the tail page forward. But does not set the old +update page to NORMAL because it is not the outer most writer. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +Another writer preempts and sees the page after the tail page is a head page. +It changes it from HEAD to UPDATE. + +(third writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-U->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The writer will move the head page forward: + + +(third writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-U->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +But now that the third writer did change the HEAD flag to UPDATE it +will convert it to normal: + + +(third writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + +Then it will move the tail page, and return back to the second writer. + + +(second writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + +The second writer will fail to move the tail page because it was already +moved, so it will try again and add its data to the new tail page. +It will return to the first writer. + + +(first writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The first writer can not know atomically test if the tail page moved +while it updates the HEAD page. It will then update the head page to +what it thinks is the new head page. + + +(first writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +Since the cmpxchg returns the old value of the pointer the first writer +will see it succeeded in updating the pointer from NORMAL to HEAD. +But as we can see, this is not good enough. It must also check to see +if the tail page is either where it use to be or on the next page: + + +(first writer) + + A B tail page + | | | + v v v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +If tail page != A and tail page does not equal B, then it must reset the +pointer back to NORMAL. The fact that it only needs to worry about +nested writers, it only needs to check this after setting the HEAD page. + + +(first writer) + + A B tail page + | | | + v v v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +Now the writer can update the head page. This is also why the head page must +remain in UPDATE and only reset by the outer most writer. This prevents +the reader from seeing the incorrect head page. + + +(first writer) + + A B tail page + | | | + v v v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + |