= Troubleshooting Strategies =

Inevitably at some point you will need to track down the source of a correctness or performance problem. This page is intended to serve as inspiration and guide that process. Barrelfish has a limited toolkit which makes troubleshooting challenging however you are bound to have fun and learn along the way!

== Hints and Advice ==
 * Adopt a '''binary search strategy''' to (i) localize the issue and (ii) gradually remove unnecessary code.
  * ''Who'': Is it the server or client-side involved? Perhaps only with a specific thread or core?
  * ''When'': Does the problem occur repeatedly, sporadically or only during initialization?

 * '''Interpreting a crash message.'''
  * Disassemble the executable ({{{objdump -d}}}) and search for the assembly code corresponding to the instruction pointer.
  * Page fault: look at the address.  Is it on the heap, stack, executable code?  If the value is small (e.g. 0x30) the likely culprit is accessing some field of a NULL pointer.
  * A warning will be issued if the stack bounds are exceeded. Keep in mind that the default stack size on Barrelfish is significantly smaller (64 KB) than on Linux (8 MB). If you need to override this use {{{thread_create_varstack}}}.
  * If several errors occur simultaneously the messages may be garbled. If this happens try to serialize all involved threads.

 * '''Liberally add assertions to catch unexpected cases.'''
 For checks which should not be optimized away in release builds use:
  {{{
if (something_bad_happened)
  USER_PANIC("state = %p", state);
}}}
 If the dispatcher is disabled use {{{assert_disabled}}} ~-([[http://git.barrelfish.org/?p=barrelfish;a=blob;f=include/barrelfish/debug.h;h=43b9356348a9c7c5579d034c9617498a389d3058;hb=HEAD#l38|link]])-~ instead of the standard {{{assert}}} macro.

 * '''Strategically print program state.'''
  * For debugging purposes it is recommended to use {{{debug_printf}}} (or {{{sys_print}}} if running on the dispatcher stack).
  <<FootNote({{{debug_printf}}} and {{{sys_print}}} write their output directly to the serial port (with {{{kputchar}}}) whereas {{{printf}}} sends Flounder messages via the [[http://git.barrelfish.org/?p=barrelfish;a=blob;f=lib/barrelfish/terminal.c;h=fc18f3c198d070a2bd3ce2f302dfe01c8788a6f9;hb=HEAD|terminal emulator]] to the [[http://git.barrelfish.org/?p=barrelfish;a=tree;f=usr/drivers/serial;hb=HEAD|serial driver]]. This process may fail during the early stages of initialization or with a misbehaving program.)>>
  * {{{<barrelfish/debug.h>}}} ~-([[http://git.barrelfish.org/?p=barrelfish;a=blob;f=include/barrelfish/debug.h;h=43b9356348a9c7c5579d034c9617498a389d3058;hb=HEAD|link]])-~ has functions to dump register state, memory, capabilities, etc.
  * The call stack is valuable to understand how some code block is reached. In your code call {{{__builtin_return_address}}}
  <<FootNote(Does not work in some scenarios: frame pointer should not be disabled, non-zero levels may not work on ARM.)>>
  then convert the returned pointer to a meaningful location:
  {{{
$ addr2line -a -p -f -e timeserver 8559aa 0x8ba836
0x00000000008559aa: abort at
../lib/newlib/newlib/libc/sys/barrelfish/syscalls.c:45
0x00000000008ba836: tcp_receive at ../lib/lwip/src/core/tcp_in.c:938
(discriminator 2)
}}}
  C++ users: pipe this output through {{{c++filt}}} to demangle function names. For additional investigation {{{objdump}}} and {{{readelf}}} may also be useful.

 * '''Critically review the code line-by-line.'''
  * Pay attention to anything you don't understand. What effect might this have? (Example: will this trigger an exchange of messages? does this depend on certain subsystems being ready?)
  * Do you notice anything missing? For instance: mutex not initialized, forgot to service a waitset, missing {{{thread_join}}} so program exits early.

 * '''Common patterns:'''
  * Often it is helpful to add a custom flag or syscall to trigger logging only in a specific context or after some timepoint.  This reduces the volume of messages and allows to focus on the most relevant details.
  * As a replacement for Unix signals, you can trigger an action externally by using a separate domain which sends a Flounder message to the target domain.
  * For networking problems use Wireshark to inspect the packets being exchanged.
  Sample scenario: problem with client sockets blocking on {{{connect()}}}, only observed SYN of the TCP three-way handshake. Caused by limited PCB pool in lwIP and all were recently closed so in TIMED-WAIT state.

 * '''Poor Performance?'''
  * Disable assertions: add '{{{-DNDEBUG}}}' to {{{cOptFlags}}} in {{{hake/Config.hs}}}.
  * Try Doug Lea's malloc.
  * Timing code snippets: {{{<bench/bench.h>}}} ~-([[http://git.barrelfish.org/?p=barrelfish;a=blob;f=include/bench/bench.h;h=8f0dea90781821b5b1d047715f6cdd9af9866bca;hb=HEAD|link]])-~, based on underlying hardware cycle counter, subtracts measurement overhead.
  * Performance counters ([[http://git.barrelfish.org/?p=barrelfish;a=blob;f=usr/tests/perfmontest/perfmon.c;h=9ec7aa1fbe207edbe66614662e7a0aea40618bc1;hb=HEAD|example]]): operates in caliper-mode rather than sampling (as with OProfile/Linux perf). Currently only supports AMD processors.
  * Tracing framework: see the [[http://www.barrelfish.org/TN-008-Tracing.pdf|technical note]] and [[http://git.barrelfish.org/?p=barrelfish;a=tree;f=usr/examples/xmpl-trace;h=3bb76ebf4cab8a9ae65c870e241368e9fb8e8c4b;hb=HEAD|example]]. Use [[http://hg.barrelfish.org/aquarium2/|Aquarium2]] for visualization.
  At first just record default events for an overview, then add custom instrumentation as required.

 * Finally, if you get stuck '''jot down''' your understanding of the problem, try to build a reduced example then '''ask for help''' (IRC, mailing list).