From 97e1c18e8d17bd87e1e383b2e9d9fc740332c8e2 Mon Sep 17 00:00:00 2001 From: Mathieu Desnoyers Date: Fri, 18 Jul 2008 12:16:16 -0400 Subject: tracing: Kernel Tracepoints Implementation of kernel tracepoints. Inspired from the Linux Kernel Markers. Allows complete typing verification by declaring both tracing statement inline functions and probe registration/unregistration static inline functions within the same macro "DEFINE_TRACE". No format string is required. See the tracepoint Documentation and Samples patches for usage examples. Taken from the documentation patch : "A tracepoint placed in code provides a hook to call a function (probe) that you can provide at runtime. A tracepoint can be "on" (a probe is connected to it) or "off" (no probe is attached). When a tracepoint is "off" it has no effect, except for adding a tiny time penalty (checking a condition for a branch) and space penalty (adding a few bytes for the function call at the end of the instrumented function and adds a data structure in a separate section). When a tracepoint is "on", the function you provide is called each time the tracepoint is executed, in the execution context of the caller. When the function provided ends its execution, it returns to the caller (continuing from the tracepoint site). You can put tracepoints at important locations in the code. They are lightweight hooks that can pass an arbitrary number of parameters, which prototypes are described in a tracepoint declaration placed in a header file." Addition and removal of tracepoints is synchronized by RCU using the scheduler (and preempt_disable) as guarantees to find a quiescent state (this is really RCU "classic"). The update side uses rcu_barrier_sched() with call_rcu_sched() and the read/execute side uses "preempt_disable()/preempt_enable()". We make sure the previous array containing probes, which has been scheduled for deletion by the rcu callback, is indeed freed before we proceed to the next update. It therefore limits the rate of modification of a single tracepoint to one update per RCU period. The objective here is to permit fast batch add/removal of probes on _different_ tracepoints. Changelog : - Use #name ":" #proto as string to identify the tracepoint in the tracepoint table. This will make sure not type mismatch happens due to connexion of a probe with the wrong type to a tracepoint declared with the same name in a different header. - Add tracepoint_entry_free_old. - Change __TO_TRACE to get rid of the 'i' iterator. Masami Hiramatsu : Tested on x86-64. Performance impact of a tracepoint : same as markers, except that it adds about 70 bytes of instructions in an unlikely branch of each instrumented function (the for loop, the stack setup and the function call). It currently adds a memory read, a test and a conditional branch at the instrumentation site (in the hot path). Immediate values will eventually change this into a load immediate, test and branch, which removes the memory read which will make the i-cache impact smaller (changing the memory read for a load immediate removes 3-4 bytes per site on x86_32 (depending on mov prefixes), or 7-8 bytes on x86_64, it also saves the d-cache hit). About the performance impact of tracepoints (which is comparable to markers), even without immediate values optimizations, tests done by Hideo Aoki on ia64 show no regression. His test case was using hackbench on a kernel where scheduler instrumentation (about 5 events in code scheduler code) was added. Quoting Hideo Aoki about Markers : I evaluated overhead of kernel marker using linux-2.6-sched-fixes git tree, which includes several markers for LTTng, using an ia64 server. While the immediate trace mark feature isn't implemented on ia64, there is no major performance regression. So, I think that we don't have any issues to propose merging marker point patches into Linus's tree from the viewpoint of performance impact. I prepared two kernels to evaluate. The first one was compiled without CONFIG_MARKERS. The second one was enabled CONFIG_MARKERS. I downloaded the original hackbench from the following URL: http://devresources.linux-foundation.org/craiger/hackbench/src/hackbench.c I ran hackbench 5 times in each condition and calculated the average and difference between the kernels. The parameter of hackbench: every 50 from 50 to 800 The number of CPUs of the server: 2, 4, and 8 Below is the results. As you can see, major performance regression wasn't found in any case. Even if number of processes increases, differences between marker-enabled kernel and marker- disabled kernel doesn't increase. Moreover, if number of CPUs increases, the differences doesn't increase either. Curiously, marker-enabled kernel is better than marker-disabled kernel in more than half cases, although I guess it comes from the difference of memory access pattern. * 2 CPUs Number of | without | with | diff | diff | processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] | -------------------------------------------------------------- 50 | 4.811 | 4.872 | +0.061 | +1.27 | 100 | 9.854 | 10.309 | +0.454 | +4.61 | 150 | 15.602 | 15.040 | -0.562 | -3.6 | 200 | 20.489 | 20.380 | -0.109 | -0.53 | 250 | 25.798 | 25.652 | -0.146 | -0.56 | 300 | 31.260 | 30.797 | -0.463 | -1.48 | 350 | 36.121 | 35.770 | -0.351 | -0.97 | 400 | 42.288 | 42.102 | -0.186 | -0.44 | 450 | 47.778 | 47.253 | -0.526 | -1.1 | 500 | 51.953 | 52.278 | +0.325 | +0.63 | 550 | 58.401 | 57.700 | -0.701 | -1.2 | 600 | 63.334 | 63.222 | -0.112 | -0.18 | 650 | 68.816 | 68.511 | -0.306 | -0.44 | 700 | 74.667 | 74.088 | -0.579 | -0.78 | 750 | 78.612 | 79.582 | +0.970 | +1.23 | 800 | 85.431 | 85.263 | -0.168 | -0.2 | -------------------------------------------------------------- * 4 CPUs Number of | without | with | diff | diff | processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] | -------------------------------------------------------------- 50 | 2.586 | 2.584 | -0.003 | -0.1 | 100 | 5.254 | 5.283 | +0.030 | +0.56 | 150 | 8.012 | 8.074 | +0.061 | +0.76 | 200 | 11.172 | 11.000 | -0.172 | -1.54 | 250 | 13.917 | 14.036 | +0.119 | +0.86 | 300 | 16.905 | 16.543 | -0.362 | -2.14 | 350 | 19.901 | 20.036 | +0.135 | +0.68 | 400 | 22.908 | 23.094 | +0.186 | +0.81 | 450 | 26.273 | 26.101 | -0.172 | -0.66 | 500 | 29.554 | 29.092 | -0.461 | -1.56 | 550 | 32.377 | 32.274 | -0.103 | -0.32 | 600 | 35.855 | 35.322 | -0.533 | -1.49 | 650 | 39.192 | 38.388 | -0.804 | -2.05 | 700 | 41.744 | 41.719 | -0.025 | -0.06 | 750 | 45.016 | 44.496 | -0.520 | -1.16 | 800 | 48.212 | 47.603 | -0.609 | -1.26 | -------------------------------------------------------------- * 8 CPUs Number of | without | with | diff | diff | processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] | -------------------------------------------------------------- 50 | 2.094 | 2.072 | -0.022 | -1.07 | 100 | 4.162 | 4.273 | +0.111 | +2.66 | 150 | 6.485 | 6.540 | +0.055 | +0.84 | 200 | 8.556 | 8.478 | -0.078 | -0.91 | 250 | 10.458 | 10.258 | -0.200 | -1.91 | 300 | 12.425 | 12.750 | +0.325 | +2.62 | 350 | 14.807 | 14.839 | +0.032 | +0.22 | 400 | 16.801 | 16.959 | +0.158 | +0.94 | 450 | 19.478 | 19.009 | -0.470 | -2.41 | 500 | 21.296 | 21.504 | +0.208 | +0.98 | 550 | 23.842 | 23.979 | +0.137 | +0.57 | 600 | 26.309 | 26.111 | -0.198 | -0.75 | 650 | 28.705 | 28.446 | -0.259 | -0.9 | 700 | 31.233 | 31.394 | +0.161 | +0.52 | 750 | 34.064 | 33.720 | -0.344 | -1.01 | 800 | 36.320 | 36.114 | -0.206 | -0.57 | -------------------------------------------------------------- Signed-off-by: Mathieu Desnoyers Acked-by: Masami Hiramatsu Acked-by: 'Peter Zijlstra' Signed-off-by: Ingo Molnar --- init/Kconfig | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index c11da38837e..70082678a91 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -771,6 +771,13 @@ config PROFILING Say Y here to enable the extended profiling support mechanisms used by profilers such as OProfile. +config TRACEPOINTS + bool "Activate tracepoints" + default y + help + Place an empty function call at each tracepoint site. Can be + dynamically changed for a probe function. + config MARKERS bool "Activate markers" help -- cgit v1.2.3 From fa340d9c050e78fb21a142b617304214ae5e0c2d Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Wed, 23 Jul 2008 13:38:00 +0200 Subject: tracing: disable tracepoints by default while it's arguably low overhead, we dont enable new features by default. Signed-off-by: Ingo Molnar --- init/Kconfig | 1 - 1 file changed, 1 deletion(-) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index 70082678a91..d5994490b0b 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -773,7 +773,6 @@ config PROFILING config TRACEPOINTS bool "Activate tracepoints" - default y help Place an empty function call at each tracepoint site. Can be dynamically changed for a probe function. -- cgit v1.2.3 From 5f87f1121895dc09d2d1c1db5f14af6aa4ce3e94 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Wed, 23 Jul 2008 14:15:22 +0200 Subject: tracing: clean up tracepoints kconfig structure do not expose users to CONFIG_TRACEPOINTS - tracers can select it just fine. update ftrace to select CONFIG_TRACEPOINTS. Signed-off-by: Ingo Molnar --- init/Kconfig | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index d5994490b0b..031344f954f 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -771,11 +771,12 @@ config PROFILING Say Y here to enable the extended profiling support mechanisms used by profilers such as OProfile. +# +# Place an empty function call at each tracepoint site. Can be +# dynamically changed for a probe function. +# config TRACEPOINTS - bool "Activate tracepoints" - help - Place an empty function call at each tracepoint site. Can be - dynamically changed for a probe function. + bool config MARKERS bool "Activate markers" -- cgit v1.2.3 From 68bf21aa15c85d2e9b623dcda2b1ed8893275fa1 Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Thu, 14 Aug 2008 15:45:08 -0400 Subject: ftrace: mcount call site on boot nops core This is the infrastructure to the converting the mcount call sites recorded by the __mcount_loc section into nops on boot. It also allows for using these sites to enable tracing as normal. When the __mcount_loc section is used, the "ftraced" kernel thread is disabled. This uses the current infrastructure to record the mcount call sites as well as convert them to nops. The mcount function is kept as a stub on boot up and not converted to the ftrace_record_ip function. We use the ftrace_record_ip to only record from the table. This patch does not handle modules. That comes with a later patch. Signed-off-by: Steven Rostedt Signed-off-by: Ingo Molnar --- init/main.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'init') diff --git a/init/main.c b/init/main.c index 3820323c4c8..ded1fae965a 100644 --- a/init/main.c +++ b/init/main.c @@ -60,6 +60,7 @@ #include #include #include +#include #include #include @@ -687,6 +688,8 @@ asmlinkage void __init start_kernel(void) acpi_early_init(); /* before LAPIC and SMP init */ + ftrace_init(); + /* Do the rest non-__init'ed, we're now alive */ rest_init(); } -- cgit v1.2.3 From aa5d9151f745b6ee6a236a1f109118034277eb92 Mon Sep 17 00:00:00 2001 From: Arjan van de Ven Date: Sat, 13 Sep 2008 09:36:06 -0700 Subject: tracing/fastboot: add a script to visualize the kernel boot process / time When optimizing the kernel boot time, it's very valuable to visualize what is going on at which time. In addition, with the fastboot asynchronous initcall level, it's very valuable to see which initcall gets run where and when. This patch adds a script to turn a dmesg into a SVG graph (that can be shown with tools such as InkScape, Gimp or Firefox) and a small change to the initcall code to print the PID of the thread calling the initcall (so that the script can work out the parallelism). Signed-off-by: Arjan van de Ven --- init/main.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'init') diff --git a/init/main.c b/init/main.c index ded1fae965a..16abba05c82 100644 --- a/init/main.c +++ b/init/main.c @@ -711,7 +711,8 @@ int do_one_initcall(initcall_t fn) int result; if (initcall_debug) { - printk("calling %pF\n", fn); + printk("calling %pF", fn); + printk(" @ %i\n", task_pid_nr(current)); t0 = ktime_get(); } -- cgit v1.2.3 From 3bf77af6e1fef1124bf71d81f9f84885f0ee0dea Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Fr=C3=A9d=C3=A9ric=20Weisbecker?= Date: Tue, 23 Sep 2008 11:38:18 +0100 Subject: tracing/ftrace: launch boot tracing after pre-smp initcalls Launch the boot tracing inside the initcall_debug area. Old printk have not been removed to keep the old way of initcall tracing for backward compatibility. [ mingo@elte.hu: resolved conflicts ] Signed-off-by: Frederic Weisbecker Signed-off-by: Ingo Molnar --- init/main.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) (limited to 'init') diff --git a/init/main.c b/init/main.c index 16abba05c82..1e39a1eab19 100644 --- a/init/main.c +++ b/init/main.c @@ -709,10 +709,12 @@ int do_one_initcall(initcall_t fn) ktime_t t0, t1, delta; char msgbuf[64]; int result; + struct boot_trace it; if (initcall_debug) { - printk("calling %pF", fn); - printk(" @ %i\n", task_pid_nr(current)); + it.caller = task_pid_nr(current); + it.func = fn; + printk("calling %pF @ %i\n", fn, it.caller); t0 = ktime_get(); } @@ -721,10 +723,11 @@ int do_one_initcall(initcall_t fn) if (initcall_debug) { t1 = ktime_get(); delta = ktime_sub(t1, t0); - - printk("initcall %pF returned %d after %Ld msecs\n", - fn, result, - (unsigned long long) delta.tv64 >> 20); + it.result = result; + it.duration = (unsigned long long) delta.tv64 >> 20; + printk("initcall %pF returned %d after %Ld msecs\n", fn, + result, it.duration); + trace_boot(&it); } msgbuf[0] = 0; @@ -859,6 +862,7 @@ static int __init kernel_init(void * unused) smp_prepare_cpus(setup_max_cpus); do_pre_smp_initcalls(); + start_boot_trace(); smp_init(); sched_init_smp(); -- cgit v1.2.3 From cb5ab74204a6e2579d1119bf1348eb806526b12b Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Thu, 2 Oct 2008 12:59:20 +0200 Subject: tracing/fastboot: change the printing of boot tracer according to bootgraph.pl Change the boot tracer printing to make it parsable for the scripts/bootgraph.pl script. We have now to output two lines for each initcall, according to the printk in do_one_initcall() in init/main.c We need now the call's time and the return's time. Signed-off-by: Frederic Weisbecker Signed-off-by: Ingo Molnar --- init/main.c | 20 +++++++++----------- 1 file changed, 9 insertions(+), 11 deletions(-) (limited to 'init') diff --git a/init/main.c b/init/main.c index 1e39a1eab19..61eb6615939 100644 --- a/init/main.c +++ b/init/main.c @@ -706,34 +706,32 @@ __setup("initcall_debug", initcall_debug_setup); int do_one_initcall(initcall_t fn) { int count = preempt_count(); - ktime_t t0, t1, delta; + ktime_t delta; char msgbuf[64]; - int result; struct boot_trace it; if (initcall_debug) { it.caller = task_pid_nr(current); it.func = fn; printk("calling %pF @ %i\n", fn, it.caller); - t0 = ktime_get(); + it.calltime = ktime_get(); } - result = fn(); + it.result = fn(); if (initcall_debug) { - t1 = ktime_get(); - delta = ktime_sub(t1, t0); - it.result = result; + it.rettime = ktime_get(); + delta = ktime_sub(it.rettime, it.calltime); it.duration = (unsigned long long) delta.tv64 >> 20; printk("initcall %pF returned %d after %Ld msecs\n", fn, - result, it.duration); + it.result, it.duration); trace_boot(&it); } msgbuf[0] = 0; - if (result && result != -ENODEV && initcall_debug) - sprintf(msgbuf, "error code %d ", result); + if (it.result && it.result != -ENODEV && initcall_debug) + sprintf(msgbuf, "error code %d ", it.result); if (preempt_count() != count) { strlcat(msgbuf, "preemption imbalance ", sizeof(msgbuf)); @@ -747,7 +745,7 @@ int do_one_initcall(initcall_t fn) printk("initcall %pF returned with %s\n", fn, msgbuf); } - return result; + return it.result; } -- cgit v1.2.3 From 5601020feb0c3010e9e3e0131e9697ac6a06777b Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Thu, 2 Oct 2008 13:26:05 +0200 Subject: tracing/fastboot: get the initcall name before it disappears After some initcall traces, some initcall names may be inconsistent. That's because these functions will disappear from the .init section and also their name from the symbols table. So we have to copy the name of the function in a buffer large enough during the trace appending. It is not costly for the ring_buffer because the number of initcall entries is commonly not really large. Signed-off-by: Frederic Weisbecker Signed-off-by: Ingo Molnar --- init/main.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'init') diff --git a/init/main.c b/init/main.c index 61eb6615939..8e96a0ef17f 100644 --- a/init/main.c +++ b/init/main.c @@ -712,7 +712,6 @@ int do_one_initcall(initcall_t fn) if (initcall_debug) { it.caller = task_pid_nr(current); - it.func = fn; printk("calling %pF @ %i\n", fn, it.caller); it.calltime = ktime_get(); } @@ -725,7 +724,7 @@ int do_one_initcall(initcall_t fn) it.duration = (unsigned long long) delta.tv64 >> 20; printk("initcall %pF returned %d after %Ld msecs\n", fn, it.result, it.duration); - trace_boot(&it); + trace_boot(&it, fn); } msgbuf[0] = 0; -- cgit v1.2.3 From 097d036a2f25eecc42435c57e010aaf4a2eed2d9 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Fri, 3 Oct 2008 15:39:21 +0200 Subject: tracing/fastboot: only trace non-module initcalls At this time, only built-in initcalls interest us. We can't really produce a relevant graph if we include the modules initcall too. I had good results after this patch (see svg in attachment). Signed-off-by: Frederic Weisbecker Signed-off-by: Ingo Molnar --- init/main.c | 1 + 1 file changed, 1 insertion(+) (limited to 'init') diff --git a/init/main.c b/init/main.c index 8e96a0ef17f..e7939de80f3 100644 --- a/init/main.c +++ b/init/main.c @@ -886,6 +886,7 @@ static int __init kernel_init(void * unused) * we're essentially up and running. Get rid of the * initmem segments and start the user-mode stuff.. */ + stop_boot_trace(); init_post(); return 0; } -- cgit v1.2.3 From ca538f6bbe583406f941f3041d40c41f9a13d1de Mon Sep 17 00:00:00 2001 From: Tim Bird Date: Thu, 9 Oct 2008 15:23:05 -0700 Subject: tracing/fastboot: add better resolution to initcall debug/tracing Change the time resolution for initcall_debug to microseconds, from milliseconds. This is handy to determine which initcalls you want to work on for faster booting. One one of my test machines, over 90% of the initcalls are less than a millisecond and (without this patch) these are all reported as 0 msecs. Working on the 900 us ones is more important than the 4 us ones. With 'quiet' on the kernel command line, this adds no significant overhead to kernel boot time. Signed-off-by: Tim Bird Signed-off-by: Andrew Morton Signed-off-by: Ingo Molnar --- init/main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'init') diff --git a/init/main.c b/init/main.c index e7939de80f3..b2e7ff4a534 100644 --- a/init/main.c +++ b/init/main.c @@ -721,8 +721,8 @@ int do_one_initcall(initcall_t fn) if (initcall_debug) { it.rettime = ktime_get(); delta = ktime_sub(it.rettime, it.calltime); - it.duration = (unsigned long long) delta.tv64 >> 20; - printk("initcall %pF returned %d after %Ld msecs\n", fn, + it.duration = (unsigned long long) delta.tv64 >> 10; + printk("initcall %pF returned %d after %Ld usecs\n", fn, it.result, it.duration); trace_boot(&it, fn); } -- cgit v1.2.3