Wednesday, September 29, 2010

A Quick Review of Stack Traces

I'll try to pass on some basic knowledge about those confusing stack traces we sometimes see in the mysql error logs.  What can you tell from them, what are they useful for, and how to validate them?

Debugging Crashes

We tried to improve postmortem debugging of crashes + stack traces in the error log:
o) old versions of mysqld only printed numerical numbers instead of function names (if you're lucky!)
o) some platforms/architectures printed no stack trace what-so-ever!
o) faulty implementations of the crash error reporting and running query.

So, the result is that on modern supported platforms and recent versions of mysqld, you should get a useful stack trace.  A simple example from 5.5.5 on Windows:

mysqld.exe!mysql_admin_table()[sql_table.cc:5150]
mysqld.exe!mysql_optimize_table()[sql_table.cc:5226]
mysqld.exe!mysql_execute_command()[sql_parse.cc:3107]
mysqld.exe!mysql_parse()[sql_parse.cc:5911]
mysqld.exe!dispatch_command()[sql_parse.cc:1138]
mysqld.exe!do_command()[sql_parse.cc:807]
mysqld.exe!do_handle_one_connection()[sql_connect.cc:1196]
mysqld.exe!handle_one_connection()[sql_connect.cc:1136]
mysqld.exe!pthread_start()[my_winthread.c:62]
mysqld.exe!_callthreadstartex()[threadex.c:348]
mysqld.exe!_threadstartex()[threadex.c:331]
kernel32.dll!BaseThreadStart()

It should be obvious that OPTIMIZE TABLE crashed here, since the function name is clear.
The reason I always use windows stack traces in bug reports is because they are readable.  Most GDB and linux error log stack traces are not readable by the human brain at a glance, and therefore not memorable.  One reason is excessive wrapping, another reason is offsets and arguments to the functions are irrelevant and not useful for search engine indexes or the average Joe trying to find a bug report matching a stack trace.

In the old days, a crash only this printed (only 32-bit), and you had to resolve it yourself:

0x81a0705
0x8407eef
0x8408a8f
0x8408b75
0x8408dd2
0x8409af7
0x83e7fc9
0x83eefc3
0x826014c
0x826035c
0x825ec43
0x82106fd
0x81b86da
0x81bd78b
0x81bdd16
0x81bfae2

Which, on a side note is not always possible to do properly with optimized binaries so you get a partially bogus looking stack and you might be tempted to wrongly suspect faulty hardware, foul play, or bad binaries:

0x81a0705 handle_segfault + 805
0x8407eef my_write + 671
0x8408a8f init_key_cache + 1279
0x8408b75 init_key_cache + 1509
0x8408dd2 init_key_cache + 2114
0x8409af7 flush_key_blocks + 55
0x83e7fc9 flush_blocks + 41
0x83eefc3 mi_repair_by_sort + 451
0x826014c ha_myisam::repair(THD*, st_mi_check_param&, bool) + 1900
0x826035c ha_myisam::enable_indexes(unsigned int) + 364
0x825ec43 ha_myisam::end_bulk_insert() + 99
0x82106fd mysql_insert(THD*, TABLE_LIST*, List&, List >&, List&, List&, enum_duplicates, bool) + 5037
0x81b86da mysql_execute_command(THD*) + 9610
0x81bd78b mysql_parse(THD*, char const*, unsigned int, char const**) + 379
0x81bdd16 dispatch_command(enum_server_command, THD*, char*, unsigned int) + 1238
0x81bfae2 handle_one_connection + 2578

Using a technique that involves disassembling the mysqld binary into ASM and piecing together C/C++ source code/comments, it's quite possible to find those inlined functions, expanded macros, or functions that have no name in the symbols file.  A nice topic for another posting.  So, after manually inspecting the binary + numeric offsets, I could get a proper stack trace:

0x81a0705 handle_segfault
0x8407eef unlink_block
0x8408a8f free_block
0x8408b75 flush_cached_blocks
0x8408dd2 flush_key_blocks_int
0x8409af7 flush_key_blocks
0x83e7fc9 flush_blocks
0x83eefc3 mi_repair_by_sort
0x826014c ha_myisam::repair
0x826035c ha_myisam::enable_indexes
0x825ec43 ha_myisam::end_bulk_insert
0x82106fd mysql_insert
0x81b86da mysql_execute_command
0x81bd78b mysql_parse
0x81bdd16 dispatch_command
0x81bfae2 handle_one_connection

This is simply a bulk insert performing a 'repair by sort'.  It crashed in the keycache when flushing blocks, perhaps due to a memory corruption or overrun of something.  I remember fulltext indexes or large table having this problem..

Some Identifying Elements of a Stack Trace
If your 5.1. or 5.5. server ever crashes, please keep the stack trace as it can help identify exactly what the problem is, and you can search google for clues.  
  • prepared statements
Easily distinguishable by looking for the functions similar to this:
.....
mysqld-debug.exe!Prepared_statement::execute()[sql_prepare.cc:3050]
mysqld-debug.exe!mysql_sql_stmt_execute()[sql_prepare.cc:2393]
mysqld-debug.exe!mysql_execute_command()[sql_parse.cc:2935]
.....
  • stored routines and their call depth
Seeing sp_* is a sign of some stored routine activity  You can even see how many SP calls there are nested, whether they called triggers.  Takes some intuition to follow.

.......
06 mysqld_debug!sp_instr_stmt::exec_core
07 mysqld_debug!sp_lex_keeper::reset_lex_and_exec_core
08 mysqld_debug!sp_instr_stmt::execute
09 mysqld_debug!sp_head::execute
0a mysqld_debug!sp_head::execute_procedure
0b mysqld_debug!mysql_execute_command
.......
  • storage engine code (archive, innodb, myisam, merge)
InnoDB mostly asserts, and this is clearly identified in the error log before a stack trace.
"070223 21:47:40  InnoDB: Assertion failure in thread 1655241648 in file row0mysql.c line 3228"

Archive crashes can be easily seen by ha_archive functions:
.....
mysqld.exe!ha_archive::free_share()[ha_archive.cc:411]
mysqld.exe!ha_archive::open()[ha_archive.cc:498]
mysqld.exe!handler::ha_open()[handler.cc:2059]
.....
  • query cache
Any thing involving the Query_Cache class functions:
.....
#5  0x000000000065f390 in Query_cache::insert_table ()
#6  0x000000000065f63a in Query_cache::register_tables_from_list ()
#7  0x000000000065f6a5 in Query_cache::register_all_tables ()
#8  0x000000000065fc0a in Query_cache::store_query ()
#9  0x000000000058cd38 in mysql_execute_command ()
.....
  • mysql functions (string, math, datetime, comparative)
Look out for specific Item_func* methods...
.....
mysqld.exe!Arg_comparator::compare_binary_string()[item_cmpfunc.cc:1158]
mysqld.exe!Item_func_eq::val_int()[item_cmpfunc.cc:1692]
mysqld.exe!Item::val_bool()[item.cc:184]
mysqld.exe!Item_cond_and::val_int()[item_cmpfunc.cc:4222]
.....
  • first calling function in the application
You can nearly always expect a valid stack trace to have these functions at the bottom:
.....
mysqld-debug.exe!mysql_execute_command()[sql_parse.cc:2256]
mysqld-debug.exe!mysql_parse()[sql_parse.cc:5974]
mysqld-debug.exe!dispatch_command()[sql_parse.cc:1233]
mysqld-debug.exe!do_command()[sql_parse.cc:872]
mysqld-debug.exe!handle_one_connection()[sql_connect.cc:1127]
  • last calling functions before the crash
Crashing is usually handled by mysqld's segfault handler.  It depends on the OS and environment.  Most of the time you'll have:

mysqld-debug.exe!my_sigabrt_handler()[mysqld.cc:2048]
mysqld-debug.exe!raise()[winsig.c:597]
mysqld-debug.exe!abort()[abort.c:78]
......

or

0   mysqld   0x00579d3e my_print_stacktrace + 44
1   mysqld   0x00100f78 handle_segfault + 836
......


Debugging Hangs

When mysqld hangs or flatlines the CPU and logging in or killing queries doesn't help, you'd better either create a corefile,  break into the process with a debugger, or just use the PMP.

You'll probably need stack trace of all the threads to determine what is going on:

  1. A single thread is looping endlessly in some loop
  2. Multiple threads are hitting a hot mutex, or totally deadlocked, waiting for each other.


If the deadlock is in innodb you often get useful innodb outputs in the error log
for each waiting thread. But it can be extremely helpful to get full stack traces too.

--Thread 3003468656 has waited at fsp/fsp0fsp.c line 2204 for 556.00 seconds the
semaphore:
X-lock on RW-latch at 0xb759ceb0 created in file fil/fil0fil.c line 1061
That's all for now.!

1 comment:

wlad said...

mysqld-debug.exe!my_sigabrt_handler()[mysqld.cc:2048]
mysqld-debug.exe!raise()[winsig.c:597]
mysqld-debug.exe!abort()[abort.c:78]

This one is likely debug assertion failure. We do not do lot of abort()s otherwise. "normal" crash, e.g access violation would just have the faulty function in the top(or bottom depending on how you count) frame.