Inevitably at some point you will need to track down the source of a correctness or performance problem. This page is intended to serve as inspiration and guide that process. Barrelfish has a limited toolkit which makes troubleshooting challenging however you are bound to have fun and learn along the way!
Hints and Advice
Adopt a binary search strategy to (i) localize the issue and (ii) gradually remove unnecessary code.
Who: Is it the server or client-side involved? Perhaps only with a specific thread or core?
When: Does the problem occur repeatedly, sporadically or only during initialization?
Interpreting a crash message.
Disassemble the executable (objdump -d) and search for the assembly code corresponding to the instruction pointer.
- Page fault: look at the address. Is it on the heap, stack, executable code? If the value is small (e.g. 0x30) the likely culprit is accessing some field of a NULL pointer.
A warning will be issued if the stack bounds are exceeded. Keep in mind that the default stack size on Barrelfish is significantly smaller (64 KB) than on Linux (8 MB). If you need to override this use thread_create_varstack.
- If several errors occur simultaneously the messages may be garbled. If this happens try to serialize all involved threads.
Liberally add assertions to catch unexpected cases. For checks which should not be optimized away in release builds use:
if (something_bad_happened) USER_PANIC("state = %p", state);
If the dispatcher is disabled use assert_disabled (link) instead of the standard assert macro.
Strategically print program state.
For debugging purposes it is recommended to use debug_printf (or sys_print if running on the dispatcher stack). 1
<barrelfish/debug.h> (link) has functions to dump register state, memory, capabilities, etc.
The call stack is valuable to understand how some code block is reached. In your code call __builtin_return_address 2 then convert the returned pointer to a meaningful location:
$ addr2line -a -p -f -e timeserver 8559aa 0x8ba836 0x00000000008559aa: abort at ../lib/newlib/newlib/libc/sys/barrelfish/syscalls.c:45 0x00000000008ba836: tcp_receive at ../lib/lwip/src/core/tcp_in.c:938 (discriminator 2)
C++ users: pipe this output through c++filt to demangle function names. For additional investigation objdump and readelf may also be useful.
Critically review the code line-by-line.
- Pay attention to anything you don't understand. What effect might this have? (Example: will this trigger an exchange of messages? does this depend on certain subsystems being ready?)
Do you notice anything missing? For instance: mutex not initialized, forgot to service a waitset, missing thread_join so program exits early.
- Often it is helpful to add a custom flag or syscall to trigger logging only in a specific context or after some timepoint. This reduces the volume of messages and allows to focus on the most relevant details.
- As a replacement for Unix signals, you can trigger an action externally by using a separate domain which sends a Flounder message to the target domain.
- For networking problems use Wireshark to inspect the packets being exchanged.
Sample scenario: problem with client sockets blocking on connect(), only observed SYN of the TCP three-way handshake. Caused by limited PCB pool in lwIP and all were recently closed so in TIMED-WAIT state.
Disable assertions: add '-DNDEBUG' to cOptFlags in hake/Config.hs.
- Try Doug Lea's malloc.
Timing code snippets: <bench/bench.h> (link), based on underlying hardware cycle counter, subtracts measurement overhead.
Performance counters (example): operates in caliper-mode rather than sampling (as with OProfile/Linux perf). Currently only supports AMD processors.
Finally, if you get stuck jot down your understanding of the problem, try to build a reduced example then ask for help (IRC, mailing list).
debug_printf and sys_print write their output directly to the serial port (with kputchar) whereas printf sends Flounder messages via the terminal emulator to the serial driver. This process may fail during the early stages of initialization or with a misbehaving program. (1)
Does not work in some scenarios: frame pointer should not be disabled, non-zero levels may not work on ARM. (2)