Hewlett Packard

Crash dump analysis on HP-UX

A crashing Unix server should be a seldom event, which means that postmortem investigation is something you will rarely do. Kernel debuggers are not much fun, and require you basically to have a good knowledge about the kernel internals. Not too difficult if you're a guru in a specific Unix flavour, but if you're housing 3 Unices, each with different kernel versions, then you're into a whole different game ! Luckily, there are admin-friendly scripts nowadays which help you with the task of digging out why your machine crashed.

Let's have a look at HP-UX : this features the adb kernel debugger, but also the Q4 package. This will generally be default installed in the /usr/contrib/Q4 directory. Before first use, you need to copy the initializing script to your homedir :

cp /usr/contrib/Q4/lib/q4lib/sample.q4rc.pl /root/.q4rc.pl

Then you're ready to start up the tool itself :

# q4 -p

HP KWDB 3.2.3 for HP Itanium (32 or 64 bit) and target HP-IPF 11.2x.
Copyright 1986 - 2001 Free Software Foundation, Inc.
Hewlett-Packard KWDB 3.2.3 12-May-2009 21:15 is covered by the
GNU General Public License. Type "show copying" to see the conditions to
change it and/or distribute copies. Type "show warranty" for warranty/support.
..
crashdump information:
  hostname  anduril
  model     ia64 hp server Integrity Virtual Machine
  panic     gexcp_hndlr: Unresolved priv 0 interruption.
  release   @(#) $Revision: vmunix:    B.11.31_LR FLAVOR=perf 
  dumptime  1277458512 Fri Jun  25 11:35:12 METDST 2010
  savetime  1277459539 Fri Jun  25 11:52:19 METDST 2010
  dumptype  Non Compressed 

Event selected is 0. It was a panic
#0  0xe000000001da26c0:0 in panic_save_regs_switchstack+0x110
    (0x4000000000000692, 0xe000000001d9d640, 0x144000206c61009f

The Q4 package contains lots of scripts which can be used for providing you with extra information. The most interesting ones are analyze.pl and whathappened.pl. Beware that these scripts can barf out loads of output ! (you can always redirect the output to a file, as if you were on the command prompt)

q4> include analyze.pl
q4> include whathappened.pl
q4> run Analyze AMUP
...
q4> run WhatHappened
System Name:    HP-UX
Node Name:      anduril
Release:        B.11.31
Version:        U
Model:          no
Machine ID:     123456789
Processors:     1
Architecture:   IA-64
Physical Mem:   1571536 pages

This is a 64 Bit Kernel
The system had been up for 44.12 days (381190776 ticks).
Load averages: 0.76 0.77 0.48.

System went down at: Fri Jun 25 11:35:12 2010

+--------------------------------------------+
| Message Buffer                             |
+--------------------------------------------+
Found adjacent data tr. Growing size.  0x240d000 -> 0x640d000.
Loaded ACPI revision 2.0 tables.
MMIO on this platform supports Write Coalescing.
...
gexcp_hndlr: Reserved Register/Field or Unimplemented Address fault occurs in kernel mode.
gexcp_hndlr: unimplemented data address fault, ISR.ir = 0,
      data memory reference to unimplemented address
******************************************************************************

reg_dump(): Displaying register values (in hex) from the save state at
  ssp  87ffffff_5ffe7200 return_status/reason/flags  0000/0054/00000001

Interruption type: Unimplemented Data Address Fault
panic: gexcp_hndlr: Unresolved priv 0 interruption.

Stack Trace:
  IP                  Function Name
  0xe000000001dea710  gexcp_hndlr+0x2d0
  0xe000000001c0a780  bubbledown+0x0
  0xe000000000afed90  kmem_lpc_alloc+0x2b0
  0xe000000000d6ead0  get_kmem+0x290
  0xe000000000d66070  kmem_arena_xlarge_alloc+0x2f0
  0xe000000000c24e90  kmem_arena_varalloc+0x2d0
  0xe000000000df31c0  vfork_buffer_init+0xb0
  0xe000000000d0b7c0  newproc+0x11f0
  0xe000000000a12930  vfork+0x1440
  0xe000000000c261a0  syscall+0x560
End of Stack Trace

It's not always guaranteed that you'll find an exact reason why the machine crashed (especially if it's really kernel related), but at least it can give you a rough idea what happened.

HP-UX is GNU unfriendly

I'm trying to install a reasonable young rrdtool onto a HP-UX 11.11, but I'm almost giving up in despair : there's a HP-UX depot in the contribute section of the rrdtool downloads, but that's a very old version. There are very few sites which offer prebuild HP-UX GNU binaries, but the HP-UX porting and archive center is the most well-known. Unfortunately, there are three current versions of HP-UX, spread over 2 architectures, which means that the archive is rather thin. A prebuild recent rrdtool version is unavailable, which implies I get the pleasure of building the thing.

HP-UX carries the css cc compiler, which dislikes rrdtool (or the other way around), so configure is barfing out the following :

configure: error:
Your Compiler does not do proper IEEE math ...

Time to install gcc, but that means installing its dependancies too : libiconv, libgcc and zlib; after a successful gcc installation, time for a new configure run :

# export CC=gcc
# ./configure
[...]
checking for gcc... gcc
checking for C compiler default output file name... 
configure: error: C compiler cannot create executables
See `config.log' for more details.

Hmmm, that's weird, let's check out gcc :

# gcc
/usr/lib/dld.sl: Can't open shared library: /usr/local/lib/libintl.sl
/usr/lib/dld.sl: No such file or directory
Abort

That's even weirder; why's gcc linked to this library ? Let's double check :

# ldd `which gcc`
        /usr/lib/libc.2 =>      /usr/lib/libc.2
        /usr/lib/libdld.2 =>    /usr/lib/libdld.2
        /usr/lib/libc.2 =>      /usr/lib/libc.2
        /usr/local/lib/libiconv.sl =>   /usr/local/lib/libiconv.sl
        /usr/lib/libc.2 =>      /usr/lib/libc.2
/usr/lib/dld.sl: Can't open shared library: /usr/local/lib/libintl.sl
/usr/lib/dld.sl: No such file or directory

But there's a libintl.sl lib in the /opt/gnome/lib dir, hopefully that one can be used :

# export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/gnome/lib
sh: LD_LIBRARY_PATH: Parameter not set.
# gcc
/usr/lib/dld.sl: Can't open shared library: /usr/local/lib/libintl.sl
/usr/lib/dld.sl: No such file or directory
Abort

Damned, HP-UX won't accept the $LD_LIBRARY_PATH or $SHLIB parameter; maybe this dirty hack will work :

# ln -s /opt/gnome/lib/libintl.sl /usr/local/lib/libintl.sl
# gcc                                                      
/usr/lib/dld.sl: Unresolved symbol: libintl_bindtextdomain (code)  from gcc
Abort

/me kicks the server. Bah.
We coped with this crap on Linux 10 years ago. Maybe I'm with the ignorant, but does anyone knows a way out of this mess ? Sun is giving away companion CD's with GNU tools on it, maybe HP does the same ?

ServiceGuard

I spent the last week in the middle of the English countryside for a course of HP-UX Serviceguard.

Thick fog, Festive Feasants, Badgers, Guiness and heavy head aches. Luckily we travelled by EuroStar, cause Heathrow Airport was rather chaotic due to the fog.

There's a small collection of pictures I took with my mobile phone, so quality is not overall excellent. It offers a good, misty overview of a week with little or no distraction.

Uptime record

Another uptime record :

[root #/ ] w
9:23am up 1396 days, 17:21, 15 users, load average: 1.36, 1.07, 0.94

This uptime beats by far the previous record. This machine runs an old copy of HP/UX 10.20 and is still fairly used, too. To be fair, I don't like machines with such big uptimes : they're old, have a non-standard setup and configuration and go mostly unpatched through life.