Crash dump analysis on HP-UX

Hewlett Packard

A crashing Unix server should be a seldom event, which means that postmortem investigation is something you will rarely do. Kernel debuggers are not much fun, and require you basically to have a good knowledge about the kernel internals. Not too difficult if you're a guru in a specific Unix flavour, but if you're housing 3 Unices, each with different kernel versions, then you're into a whole different game ! Luckily, there are admin-friendly scripts nowadays which help you with the task of digging out why your machine crashed.

Let's have a look at HP-UX : this features the adb kernel debugger, but also the Q4 package. This will generally be default installed in the /usr/contrib/Q4 directory. Before first use, you need to copy the initializing script to your homedir :

cp /usr/contrib/Q4/lib/q4lib/sample.q4rc.pl /root/.q4rc.pl

Then you're ready to start up the tool itself :

# q4 -p

HP KWDB 3.2.3 for HP Itanium (32 or 64 bit) and target HP-IPF 11.2x.
Copyright 1986 - 2001 Free Software Foundation, Inc.
Hewlett-Packard KWDB 3.2.3 12-May-2009 21:15 is covered by the
GNU General Public License. Type "show copying" to see the conditions to
change it and/or distribute copies. Type "show warranty" for warranty/support.
..
crashdump information:
  hostname  anduril
  model     ia64 hp server Integrity Virtual Machine
  panic     gexcp_hndlr: Unresolved priv 0 interruption.
  release   @(#) $Revision: vmunix:    B.11.31_LR FLAVOR=perf 
  dumptime  1277458512 Fri Jun  25 11:35:12 METDST 2010
  savetime  1277459539 Fri Jun  25 11:52:19 METDST 2010
  dumptype  Non Compressed 

Event selected is 0. It was a panic
#0  0xe000000001da26c0:0 in panic_save_regs_switchstack+0x110
    (0x4000000000000692, 0xe000000001d9d640, 0x144000206c61009f

The Q4 package contains lots of scripts which can be used for providing you with extra information. The most interesting ones are analyze.pl and whathappened.pl. Beware that these scripts can barf out loads of output ! (you can always redirect the output to a file, as if you were on the command prompt)

q4> include analyze.pl
q4> include whathappened.pl
q4> run Analyze AMUP
...
q4> run WhatHappened
System Name:    HP-UX
Node Name:      anduril
Release:        B.11.31
Version:        U
Model:          no
Machine ID:     123456789
Processors:     1
Architecture:   IA-64
Physical Mem:   1571536 pages

This is a 64 Bit Kernel
The system had been up for 44.12 days (381190776 ticks).
Load averages: 0.76 0.77 0.48.

System went down at: Fri Jun 25 11:35:12 2010

+--------------------------------------------+
| Message Buffer                             |
+--------------------------------------------+
Found adjacent data tr. Growing size.  0x240d000 -> 0x640d000.
Loaded ACPI revision 2.0 tables.
MMIO on this platform supports Write Coalescing.
...
gexcp_hndlr: Reserved Register/Field or Unimplemented Address fault occurs in kernel mode.
gexcp_hndlr: unimplemented data address fault, ISR.ir = 0,
      data memory reference to unimplemented address
******************************************************************************

reg_dump(): Displaying register values (in hex) from the save state at
  ssp  87ffffff_5ffe7200 return_status/reason/flags  0000/0054/00000001

Interruption type: Unimplemented Data Address Fault
panic: gexcp_hndlr: Unresolved priv 0 interruption.

Stack Trace:
  IP                  Function Name
  0xe000000001dea710  gexcp_hndlr+0x2d0
  0xe000000001c0a780  bubbledown+0x0
  0xe000000000afed90  kmem_lpc_alloc+0x2b0
  0xe000000000d6ead0  get_kmem+0x290
  0xe000000000d66070  kmem_arena_xlarge_alloc+0x2f0
  0xe000000000c24e90  kmem_arena_varalloc+0x2d0
  0xe000000000df31c0  vfork_buffer_init+0xb0
  0xe000000000d0b7c0  newproc+0x11f0
  0xe000000000a12930  vfork+0x1440
  0xe000000000c261a0  syscall+0x560
End of Stack Trace

It's not always guaranteed that you'll find an exact reason why the machine crashed (especially if it's really kernel related), but at least it can give you a rough idea what happened.

Migrate to ext4

Linux

Since the Ubuntu Lucid upgrade, suspend/resume is not working any more on my desktop, which means I must powercycle every day. This leads to a higher fsck rate on my mounted filesystems, and as those are increasing in size over time, this takes a long time to boot. This is why I decided to migrate everything to ext4, thereby offering me faster fsck times, and an overall better performance.

The procedure to migrate ext3 volumes to ext4 is quite straightforward :

For non-root filesystems :

First, unmount the partition.

umount /dev/sda5

Next, run a filesystem check on it to make sure it is in sane condition. We are still on ext3.

fsck.ext3 -pf /dev/sda5

Enable new features of ext4 on the filesystem.

tune2fs -O extents,uninit_bg,dir_index /dev/sda5

Option "extents" enables the filesystem to use extents instead of bitmap mapping for files, "uninit_bg" reduces file system check times by only checking used portions of the disk, and "dir_index" allows storing the contents of large directories in a htree for faster access. Option "dir_index" is also supported by ext3, so you may already be using it, but it makes no harm to specify it here.

Run a filesystem check. It will find errors. It is normal. Let it fix them.

fsck.ext4 -yfD /dev/sda5

The "-D" parameter will actually enable the "dir_index" option by rebuilding directory index. It can be rebuilt (optimized) at any later time by running the check with the parameter.

Now edit your /etc/fstab file to say "ext4" instead of "ext3" for /home. Other options may differ for your system.

/dev/sda5 /home ext4 defaults 0 2

Try to mount your new ext4 filesystem.

mount /home

If it succeeds, congratulations.

It may seem that the migration from ext3 to ext4 is now complete, and it is almost true. Except that any old files created before the conversion will continue using the bitmap mapping of ext3 instead of extents of ext4. Files will eventually migrate to the new format as they are updated during normal system operation, because on next write they will be saved using extents. Unfortunately many frequently used files (like application binaries) are often read and rarely written to. The outcome is that the files will remain using the old format for a long time, and you will not be able to experience full potential of ext4. Modifying attributes with chattr can be done on multiple files. Although digging trough the entire directory system is not really feasible, so you can use some of the shell magic to accomplish the task.

find /home -xdev -type f -print0 | xargs -0 chattr +e

For root filesystems :


You will need to use a Linux liveCD, or the installation CD that came with your distribution. You might want to check if the kernel on that installation medium has ext4 capabilities. Simply follow the above procedure, but don't forget to change /etc/fstab first, so your root partition is marked as ext4.

Zomerconcert Hamaril

Music

Dit weekend vindt het zomerconcert van de muziekschool plaats, en gisteren was de aftrap. Toch wel wat onder de indruk van de grootte van het evenement : 120 groepjes verspreid over 3 dagen spelen elk een ingestudeerd nummer. Waaronder ook ondergetekende, na 3 maand les en als eerste van twee solo muzikanten - de rest speelde allen met verscheidene mensen. Als concert vuurdoop kan dat tellen. Ik mocht niet klagen : slechts enkele kleine steken laten vallen, en dan voornamelijk omwille dat ik de muziek nauwelijks hoorde. Geleerde les : zelf eigen keyboard meebrengen volgende keer, dat bespaart een hoop geknoei met instellingen van een vreemd keyboard, dat dan nog eens minder toetsen heeft dan het mijne.
Soit, complimenten aan de organisatoren, puik gedaan voor zo'n groot event met een heerlijke sfeer !

OpenSolaris 2010.05

OpenSolaris

Whoever thought that OpenSolaris was dead after the Oracle acquisition, might be wrong : OpenSolaris 2010.05 has been released with some important new features :

  • ZFS deduplication : I've always predicted that this would once become a default feature of file systems, and ZFS is the first to implement this
  • IPS : the new Image Packaging System : the reworked package manager for OpenSolaris, a big deal in bringing the legacy Solaris package management to a higher level
  • USB support for VirtualBox guests
  • Gnome 2.28

Update : seems that this was a link to a draft document.

Larry Ellison about the ex-Sun management

Sun

In this Reuters.com article, Larry Ellison openhearted speaks about his view on the ex-Sun management decisions of the last years. It's quite a critical view :

"Their management made some very bad decisions that damaged their business and allowed us to buy them for a bargain price"

"The underlying engineering teams are so good, but the direction they got was so astonishingly bad that even they couldn't succeed"

Ellison shut down one of Schwartz's pet projects -- development of the "Rock" microprocessor for Sun's high-end SPARC server line, a semiconductor that had struggled in development for five years as engineers sought to overcome a string of technical problems. "This processor had two incredible virtues: It was incredibly slow and it consumed vast amounts of energy."

Ellison says he learned that Sun's pony-tailed chief executive, Jonathan Schwartz, ignored problems as they escalated, made poor strategic decisions and spent too much time working on his blog, which Sun translated into 11 languages.

At least you can't accuse Ellison of not being clear. Much is off course corporate chatter; IBMs Power7 chip runs pretty hot, and is equipped with impressive heat sinks too. The article continues to say that investment is boosting again in Sparc and OpenSolaris, but I'm afraid this will not be enough to restore faith in Solaris for many customers.

Syndicate content