Skip to main content

Crash dump analysis on HP-UX

A crashing Unix server should be a seldom event, which means that postmortem investigation is something you will rarely do. Kernel debuggers are not much fun, and require you basically to have a good knowledge about the kernel internals. Not too difficult if you're a guru in a specific Unix flavour, but if you're housing 3 Unices, each with different kernel versions, then you're into a whole different game ! Luckily, there are admin-friendly scripts nowadays which help you with the task of digging out why your machine crashed.

Let's have a look at HP-UX : this features the adb kernel debugger, but also the Q4 package. This will generally be default installed in the /usr/contrib/Q4 directory. Before first use, you need to copy the initializing script to your homedir :

cp /usr/contrib/Q4/lib/q4lib/sample.q4rc.pl /root/.q4rc.pl

Then you're ready to start up the tool itself :

# q4 -p


HP KWDB 3.2.3 for HP Itanium (32 or 64 bit) and target HP-IPF 11.2x.
Copyright 1986 - 2001 Free Software Foundation, Inc.
Hewlett-Packard KWDB 3.2.3 12-May-2009 21:15 is covered by the
GNU General Public License. Type "show copying" to see the conditions to
change it and/or distribute copies. Type "show warranty" for warranty/support.
..
crashdump information:
hostname anduril
model ia64 hp server Integrity Virtual Machine
panic gexcp_hndlr: Unresolved priv 0 interruption.
release @(#) $Revision: vmunix: B.11.31_LR FLAVOR=perf
dumptime 1277458512 Fri Jun 25 11:35:12 METDST 2010
savetime 1277459539 Fri Jun 25 11:52:19 METDST 2010
dumptype Non Compressed


Event selected is 0. It was a panic
#0 0xe000000001da26c0:0 in panic_save_regs_switchstack+0x110
(0x4000000000000692, 0xe000000001d9d640, 0x144000206c61009f

The Q4 package contains lots of scripts which can be used for providing you with extra information. The most interesting ones are analyze.pl and whathappened.pl. Beware that these scripts can barf out loads of output ! (you can always redirect the output to a file, as if you were on the command prompt)

q4> include analyze.pl
q4> include whathappened.pl
q4> run Analyze AMUP
...
q4> run WhatHappened
System Name: HP-UX
Node Name: anduril
Release: B.11.31
Version: U
Model: no
Machine ID: 123456789
Processors: 1
Architecture: IA-64
Physical Mem: 1571536 pages


This is a 64 Bit Kernel
The system had been up for 44.12 days (381190776 ticks).
Load averages: 0.76 0.77 0.48.


System went down at: Fri Jun 25 11:35:12 2010


+--------------------------------------------+

+--------------------------------------------+
Found adjacent data tr. Growing size. 0x240d000 -> 0x640d000.
Loaded ACPI revision 2.0 tables.
MMIO on this platform supports Write Coalescing.
...
gexcp_hndlr: Reserved Register/Field or Unimplemented Address fault occurs in kernel mode.
gexcp_hndlr: unimplemented data address fault, ISR.ir = 0,
data memory reference to unimplemented address
******************************************************************************


reg_dump(): Displaying register values (in hex) from the save state at
ssp 87ffffff_5ffe7200 return_status/reason/flags 0000/0054/00000001


Interruption type: Unimplemented Data Address Fault
panic: gexcp_hndlr: Unresolved priv 0 interruption.


Stack Trace:
IP Function Name
0xe000000001dea710 gexcp_hndlr+0x2d0
0xe000000001c0a780 bubbledown+0x0
0xe000000000afed90 kmem_lpc_alloc+0x2b0
0xe000000000d6ead0 get_kmem+0x290
0xe000000000d66070 kmem_arena_xlarge_alloc+0x2f0
0xe000000000c24e90 kmem_arena_varalloc+0x2d0
0xe000000000df31c0 vfork_buffer_init+0xb0
0xe000000000d0b7c0 newproc+0x11f0
0xe000000000a12930 vfork+0x1440
0xe000000000c261a0 syscall+0x560
End of Stack Trace

It's not always guaranteed that you'll find an exact reason why the machine crashed (especially if it's really kernel related), but at least it can give you a rough idea what happened.

Migrate to ext4

Since the Ubuntu Lucid upgrade, suspend/resume is not working any more on my desktop, which means I must powercycle every day. This leads to a higher fsck rate on my mounted filesystems, and as those are increasing in size over time, this takes a long time to boot. This is why I decided to migrate everything to ext4, thereby offering me faster fsck times, and an overall better performance.
The procedure to migrate ext3 volumes to ext4 is quite straightforward :

For non-root filesystems :


First, unmount the partition.

umount /dev/sda5

Next, run a filesystem check on it to make sure it is in sane condition. We are still on ext3.

fsck.ext3 -pf /dev/sda5

Enable new features of ext4 on the filesystem.

tune2fs -O extents,uninit_bg,dir_index /dev/sda5

Option "extents" enables the filesystem to use extents instead of bitmap mapping for files, "uninit_bg" reduces file system check times by only checking used portions of the disk, and "dir_index" allows storing the contents of large directories in a htree for faster access. Option "dir_index" is also supported by ext3, so you may already be using it, but it makes no harm to specify it here.
Run a filesystem check. It will find errors. It is normal. Let it fix them.

fsck.ext4 -yfD /dev/sda5

The "-D" parameter will actually enable the "dir_index" option by rebuilding directory index. It can be rebuilt (optimized) at any later time by running the check with the parameter.
Now edit your /etc/fstab file to say "ext4" instead of "ext3" for /home. Other options may differ for your system.

/dev/sda5 /home ext4 defaults 0 2

Try to mount your new ext4 filesystem.

mount /home

If it succeeds, congratulations.
It may seem that the migration from ext3 to ext4 is now complete, and it is almost true. Except that any old files created before the conversion will continue using the bitmap mapping of ext3 instead of extents of ext4. Files will eventually migrate to the new format as they are updated during normal system operation, because on next write they will be saved using extents. Unfortunately many frequently used files (like application binaries) are often read and rarely written to. The outcome is that the files will remain using the old format for a long time, and you will not be able to experience full potential of ext4. Modifying attributes with chattr can be done on multiple files. Although digging trough the entire directory system is not really feasible, so you can use some of the shell magic to accomplish the task.



For root filesystems :


You will need to use a Linux liveCD, or the installation CD that came with your distribution. You might want to check if the kernel on that installation medium has ext4 capabilities. Simply follow the above procedure, but don't forget to change /etc/fstab first, so your root partition is marked as ext4.

Zomerconcert Hamaril

Dit weekend vindt het zomerconcert van de muziekschool plaats, en gisteren was de aftrap. Toch wel wat onder de indruk van de grootte van het evenement : 120 groepjes verspreid over 3 dagen spelen elk een ingestudeerd nummer. Waaronder ook ondergetekende, na 3 maand les en als eerste van twee solo muzikanten - de rest speelde allen met verscheidene mensen. Als concert vuurdoop kan dat tellen. Ik mocht niet klagen : slechts enkele kleine steken laten vallen, en dan voornamelijk omwille dat ik de muziek nauwelijks hoorde. Geleerde les : zelf eigen keyboard meebrengen volgende keer, dat bespaart een hoop geknoei met instellingen van een vreemd keyboard, dat dan nog eens minder toetsen heeft dan het mijne.
Soit, complimenten aan de organisatoren, puik gedaan voor zo'n groot event met een heerlijke sfeer !

OpenSolaris 2010.05

Whoever thought that OpenSolaris was dead after the Oracle acquisition, might be wrong : OpenSolaris 2010.05 has been released with some important new features :

  • ZFS deduplication : I've always predicted that this would once become a default feature of file systems, and ZFS is the first to implement this

  • IPS : the new Image Packaging System : the reworked package manager for OpenSolaris, a big deal in bringing the legacy Solaris package management to a higher level

  • USB support for VirtualBox guests

  • Gnome 2.28




Update : seems that this was a link to a draft document.

Larry Ellison about the ex-Sun management

In this Reuters.com article, Larry Ellison openhearted speaks about his view on the ex-Sun management decisions of the last years. It's quite a critical view :


"Their management made some very bad decisions that damaged their business and allowed us to buy them for a bargain price"


"The underlying engineering teams are so good, but the direction they got was so astonishingly bad that even they couldn't succeed"


Ellison shut down one of Schwartz's pet projects -- development of the "Rock" microprocessor for Sun's high-end SPARC server line, a semiconductor that had struggled in development for five years as engineers sought to overcome a string of technical problems. "This processor had two incredible virtues: It was incredibly slow and it consumed vast amounts of energy."


Ellison says he learned that Sun's pony-tailed chief executive, Jonathan Schwartz, ignored problems as they escalated, made poor strategic decisions and spent too much time working on his blog, which Sun translated into 11 languages.



At least you can't accuse Ellison of not being clear. Much is off course corporate chatter; IBMs Power7 chip runs pretty hot, and is equipped with impressive heat sinks too. The article continues to say that investment is boosting again in Sparc and OpenSolaris, but I'm afraid this will not be enough to restore faith in Solaris for many customers.

AIX Technical University, Budapest 2010

I spent the last week in Budapest, attending the IBM AIX Technical University. Hundreds of presentations about AIX, Power, Storage and Tape, spread over 4 days in Budapest. The most interesting talk was about the new stuff in TSM6.2, where one of the most markant features was client upgrades through the TSM ISC. Client packages are pushed from the ISC to the TSM server, where they reside on a diskpool, ready for shipment to the clients. Currently only a Windows feature, but Unix will probably follow in the 6.3 release. There was unfortunately no possibility for a TSM certification, which meant I could not renew (or upgrade) my 5.4 certification.


There wasn't that much time for city sightseeing, though we did manage to get a small visit to the castle on the Buda side of the Lanchid bridge. Pretty nice city, and a good idea for a next city trip...

Jitta

Het heeft iets meer voeten in de aarde gehad dan initieel verwacht, maar sinds gisteren hebben we dan toch een nieuw familielid. Jitta is een Landseer pup die er momenteel uitziet als een schattige witte pluizige teddybeer. Landseer honden zijn verwant aan de Newfoundlanders, groeien uit tot honden van zo'n 50tal kilo, en zijn algemeen bekend om hun vriendelijk & zacht karakter.

Discovering Reason

The possibilities of software synthesizers can be endless, and the more possibilities a program offers, the more complex it appears to any newbie. The guys at Propellerhead realized this, and put up a 34-part tutorial covering all aspects of Reason, with off course a large part dedicated to the Thor analogue oscillator.

Visual guide to the Mandelbrot set

Everyone sure knows fractals, the beautiful mathematical organic drawings. Because they appear similar at all levels of magnification, fractals are often considered to be infinitely complex. Natural objects that are approximated by fractals to a degree include clouds, mountain ranges, lightning bolts, coastlines or snow flakes. The most famous one is the Mandelbrot set, named to the French mathematician.


Programs like Fractint or its successor Xaos make fractal exploring within everyones reach. However, knowing where to zoom in, makes the difference between boring and exciting fractals. Here's a visual guide to fractal exploring, making you feel like Alice in Wonderland.

Boot Solaris with a RO root filesystem

I just finished a very interesting case of a coredumping TSM client on Solaris. After investigation of the core dump, it seemed that the TSM client barfed over an erroneous inode. Some more diagnosis revealed indeed filesystem corruption, unfortunately on the root file system. Normally, one would boot from CDROM or issue a netboot, to correct the corruption, but it turned out the Jumpstart config of the host was really foobarred. I neither did have the time to correct the Jumpstart server config, or walk over to the data center to insert a Solaris DVD.


At times like that, I resort to little tricks in the bootsequence of Solaris : if you boot with boot -a -s, you can specify the location of the startup files. If you enter a /dev/null for the /etc/system file, the host will continue to boot, but with a read-only filesystem :



Rebooting with command: boot -a -s
Boot device: /pci@8,600000/SUNW,qlc@4/fp@0,0/disk@0,0:b File and args: -a -s
Enter filename [kernel/sparcv9/unix]:
Enter default directory for modules [/platform/SUNW,Sun-Fire-280R/kernel /platform/sun4u/kernel /kernel /usr/kernel]:
=> Name of system file [etc/system]: /dev/null
SunOS Release 5.10 Version Generic_118833-24 64-bit
Copyright 1983-2006 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
root filesystem type [ufs]:
Enter physical name of root device
[/pci@8,600000/SUNW,qlc@4/fp@0,0/disk@w500000e0155145d1,0:b]:
Booting to milestone "milestone/single-user:default".
Hostname: qwerty
SUNW,eri0 : 100 Mbps full duplex link up
Requesting System Maintenance Mode
SINGLE USER MODE


Root password for system maintenance (control-d to bypass):
single-user privilege assigned to /dev/console.
Entering System Maintenance Mode



After a few rounds of fsck's, the root filesystem turned out to be corrected, and only 2 files seemed to be impacted by the file system check. As the TSM client worked again, I could easily restore those from the backup.