Crash dump analysis on (Open)Solaris (mdb)

We've tackled previously how to look at kernel dumps on HP-UX, let's have a look now how to perform them same on OpenSolaris. The kernel debugger is actually 'quite' user-friendly, and gives you mostly enough information how to handle a crash. If your Solaris is too stable to generate crashes, then use the

savecore -L 

command to generate one on the fly. This will generate a dump in /var/adm/crash. Let's have a look at it with mdb :

# mdb -k unix.0 vmcore.0 
Loading modules: [ unix krtld genunix specfs dtrace cpu.AuthenticAMD.15 uppc pcplusmp ufs ip sctp usba lofs zfs random ipc md fcip fctl fcp crypto logindmux ptm nfs ]
>

The ::status command will display high level information regarding this debugging session. This is mostly a one-liner, which reveals the reason of the crash.

> ::status
debugging crash dump vmcore.0 (64-bit) from hostname
operating system: 5.11 snv_43 (i86pc)
panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffe80000ad3d0 addr=0 occurred in module "unix" due to a NULL pointer dereference
dump content: kernel pages only

The ::stack command will prove you with a stack trace, this is the same thing trace you would have seen in syslog or the console.

> ::stack
atomic_add_32()
nfs_async_inactive+0x55(fffffe820d128b80, 0, ffffffffeff0ebcb)
nfs3_inactive+0x38b(fffffe820d128b80, 0)
fop_inactive+0x93(fffffe820d128b80, 0)
vn_rele+0x66(fffffe820d128b80)
snf_smap_desbfree+0x78(fffffe8185e2ff60)
dblk_lastfree_desb+0x25(fffffe817a30f8c0, ffffffffac1d7cc0)
dblk_decref+0x6b(fffffe817a30f8c0, ffffffffac1d7cc0)
freeb+0x89(fffffe817a30f8c0)
tcp_rput_data+0x215f(ffffffffb4af7140, fffffe812085d780, ffffffff993c3c00)
squeue_enter_chain+0x129(ffffffff993c3c00, fffffe812085d780, fffffe812085d780, 1, 1)
ip_input+0x810(ffffffffa23eec68, ffffffffaeab8040, fffffe812085d780, e)

The ::msgbuf command will output the message buffer at the time of crash; the message buffer is most commonly used by sysadmins through the "dmesg" command.

> ::msgbuf
MESSAGE                                                               
....
WARNING: IP: Hardware address '00:14:4f:xxxxxxx' trying to be our address xxxx
WARNING: IP: Hardware address '00:14:4f:xxxx' trying to be our address xxxx

panic[cpu0]/thread=fffffe80000adc80: 
BAD TRAP: type=e (#pf Page fault) rp=fffffe80000ad3d0 addr=0 occurred in module "unix" due to a NULL pointer dereference

sched: 
#pf Page fault
Bad kernel fault at addr=0x0

One of the coolest commands is the cpuinfo -v command, which will show more information about the running processes at the time of the crash, including some nicely ascii-art style formatting :

>  ::cpuinfo -v
 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  1 ffffffff983b3800  1f    1    0  59  yes    no t-0    fffffe80daac2f20 smtpd
                       |    |
            RUNNING   PRI THREAD           PROC
              READY                99 fffffe8000bacc80 sched
           QUIESCED
             EXISTS
             ENABLE

Other interesting commands are the ::ps (info about running processes), and ::panicinfo, which will reveal thread information, which you can further investigate with the ::walkthread option.

In a following article, I'll write about the Solaris Core Analyzer, which is a Q4 comparabe tool on Solaris to walk through kernel dumps.