Hello, world ! Welcome to the weblog of Kristof Willen. This is the place where I publish some weird and interesting links I encountered during my dwellings in cyberspace. Apart from that, you can find some useful/useless information about myself.

We've tackled previously how to look at kernel dumps on HP-UX, let's have a look now how to perform them same on OpenSolaris. The kernel debugger is actually 'quite' user-friendly, and gives you mostly enough information how to handle a crash. If your Solaris is too stable to generate crashes, then use the
savecore -L
command to generate one on the fly. This will generate a dump in /var/adm/crash. Let's have a look at it with mdb :
# mdb -k unix.0 vmcore.0 Loading modules: [ unix krtld genunix specfs dtrace cpu.AuthenticAMD.15 uppc pcplusmp ufs ip sctp usba lofs zfs random ipc md fcip fctl fcp crypto logindmux ptm nfs ] >
The ::status command will display high level information regarding this debugging session. This is mostly a one-liner, which reveals the reason of the crash.
> ::status debugging crash dump vmcore.0 (64-bit) from hostname operating system: 5.11 snv_43 (i86pc) panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffe80000ad3d0 addr=0 occurred in module "unix" due to a NULL pointer dereference dump content: kernel pages only
The ::stack command will prove you with a stack trace, this is the same thing trace you would have seen in syslog or the console.
> ::stack atomic_add_32() nfs_async_inactive+0x55(fffffe820d128b80, 0, ffffffffeff0ebcb) nfs3_inactive+0x38b(fffffe820d128b80, 0) fop_inactive+0x93(fffffe820d128b80, 0) vn_rele+0x66(fffffe820d128b80) snf_smap_desbfree+0x78(fffffe8185e2ff60) dblk_lastfree_desb+0x25(fffffe817a30f8c0, ffffffffac1d7cc0) dblk_decref+0x6b(fffffe817a30f8c0, ffffffffac1d7cc0) freeb+0x89(fffffe817a30f8c0) tcp_rput_data+0x215f(ffffffffb4af7140, fffffe812085d780, ffffffff993c3c00) squeue_enter_chain+0x129(ffffffff993c3c00, fffffe812085d780, fffffe812085d780, 1, 1) ip_input+0x810(ffffffffa23eec68, ffffffffaeab8040, fffffe812085d780, e)
The ::msgbuf command will output the message buffer at the time of crash; the message buffer is most commonly used by sysadmins through the "dmesg" command.
> ::msgbuf MESSAGE .... WARNING: IP: Hardware address '00:14:4f:xxxxxxx' trying to be our address xxxx WARNING: IP: Hardware address '00:14:4f:xxxx' trying to be our address xxxx panic[cpu0]/thread=fffffe80000adc80: BAD TRAP: type=e (#pf Page fault) rp=fffffe80000ad3d0 addr=0 occurred in module "unix" due to a NULL pointer dereference sched: #pf Page fault Bad kernel fault at addr=0x0
One of the coolest commands is the cpuinfo -v command, which will show more information about the running processes at the time of the crash, including some nicely ascii-art style formatting :
> ::cpuinfo -v
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
1 ffffffff983b3800 1f 1 0 59 yes no t-0 fffffe80daac2f20 smtpd
| |
RUNNING <--+ +--> PRI THREAD PROC
READY 99 fffffe8000bacc80 sched
QUIESCED
EXISTS
ENABLE
Other interesting commands are the ::ps (info about running processes), and ::panicinfo, which will reveal thread information, which you can further investigate with the ::walkthread option.
In a following article, I'll write about the Solaris Core Analyzer, which is a Q4 comparabe tool on Solaris to walk through kernel dumps.

Whoever thought that OpenSolaris was dead after the Oracle acquisition, might be wrong : OpenSolaris 2010.05 has been released with some important new features :
Update : seems that this was a link to a draft document.

I just finished a very interesting case of a coredumping TSM client on Solaris. After investigation of the core dump, it seemed that the TSM client barfed over an erroneous inode. Some more diagnosis revealed indeed filesystem corruption, unfortunately on the root file system. Normally, one would boot from CDROM or issue a netboot, to correct the corruption, but it turned out the Jumpstart config of the host was really foobarred. I neither did have the time to correct the Jumpstart server config, or walk over to the data center to insert a Solaris DVD.
At times like that, I resort to little tricks in the bootsequence of Solaris : if you boot with boot -a -s, you can specify the location of the startup files. If you enter a /dev/null for the /etc/system file, the host will continue to boot, but with a read-only filesystem :
Rebooting with command: boot -a -s Boot device: /pci@8,600000/SUNW,qlc@4/fp@0,0/disk@0,0:b File and args: -a -s Enter filename [kernel/sparcv9/unix]: Enter default directory for modules [/platform/SUNW,Sun-Fire-280R/kernel /platform/sun4u/kernel /kernel /usr/kernel]: => Name of system file [etc/system]: /dev/null SunOS Release 5.10 Version Generic_118833-24 64-bit Copyright 1983-2006 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms. root filesystem type [ufs]: Enter physical name of root device [/pci@8,600000/SUNW,qlc@4/fp@0,0/disk@w500000e0155145d1,0:b]: Booting to milestone "milestone/single-user:default". Hostname: qwerty SUNW,eri0 : 100 Mbps full duplex link up Requesting System Maintenance Mode SINGLE USER MODE Root password for system maintenance (control-d to bypass): single-user privilege assigned to /dev/console. Entering System Maintenance Mode
After a few rounds of fsck's, the root filesystem turned out to be corrected, and only 2 files seemed to be impacted by the file system check. As the TSM client worked again, I could easily restore those from the backup.

An issue I lately encountered was that a collegue complained about several processes which kept hanging on a Solaris 10 machine. After investigation, processes like format, powermt and even a for diagnostics invoked dtrace kept hanging, and could not even be killed :
# pkill -9 format # ps -ef |grep -c format 2
In such cases, a good old truss session mostly explains what's going on; but in this case, truss came back with a quite peculiar message :
# truss -p 26632 truss: unanticipated system error: 26632 # # pstack 26632 pstack: cannot examine 26632: unanticipated system error # # pfiles 26632 pfiles: unanticipated system error: 26632
In those cases, the only option you have is to rely on the kernel debugger to determine the cause :
# mdb -k
Loading modules: [ unix genunix specfs dtrace ufs sd pcisch md ip hook neti sctp arp usba fcp fctl ssd nca lofs zfs cpc fcip random crypto logindmux ptm nfs ipc ]
> ::pgrep format
S PID PPID PGID SID UID FLAGS ADDR NAME
R 1241 1 942 686 0 0x4a004900 000006001414c060 format
> 000006001414c060::thread
ADDR STATE FLG PFLG SFLG PRI EPRI PIL INTR
000006001414c060 inval/2000 1424 de50 0 0 0 0 n/a
> 000006001414c060::walk thread | ::findstack
stack pointer for thread 300012b7700: 2a10055cb01
[ 000002a10055cb01 cv_wait+0x38() ]
000002a10055cbb1 PowerSleep+0x14()
000002a10055cc71 PowerGetSema+0xe8()
000002a10055cd31 power_open+0x364()
000002a10055cea1 spec_open+0x4f8()
000002a10055cf61 fop_open+0x78()
000002a10055d011 vn_openat+0x500()
000002a10055d1d1 copen+0x260()
000002a10055d2e1 syscall_trap32+0xcc()
In this case, it was the PowerPath MPIO which was blocked on a semaphore. Further investigation revealed that the drivers for PowerPath were removed from the /etc/system file. Restoring the correct version of that file and a reboot solved the problem.

(File under : 6yo stuff that I finally integrated into my blog)
Since I became a Unix system administrator, I had the opportunity to create some Solaris packages. We all know how important decent package management is on Unix systems, and I have a decent experience in packaging software with my Debian box at home. Apt-build and such are excellent tools under Debian, so I was a bit surprised that package management seemed so primitive under Solaris.
There are some scripts out there which do the job for you, but as it is the habit with scripts, you don't know what these things are doing on your machine. If you want to know all the tidbits about packaging on Solaris, you're on the right place here.
So here is a HOWTO about creating Solaris packages based on my experience.
/export/home/youruser ! +--- pkg ! +--- src ! +--- usr ! +--- local
We will extract and build our software in the ~/src dir. Installation will happen in the ~/pkg/usr/local dir. So you really don't need to setup a chrooted environment as you see so many times explained in other places.
$ ./configure --exec-prefix=/export/home/youruser/pkg/usr/local --prefix=/export/home/youruser/pkg/usr/local.
$ make && make install
$ cd /export/home/youruser/pkg/usr/local/
$ find . -print |pkgproto > prototype
i pkginfo=./pkginfo
d none lib 0755 kristof users
f none lib/libslang.a 0644 kristof users
d none include 0755 kristof users
f none include/slang.h 0644 kristof users
f none include/slcurses.h 0644 kristof users
i pkginfo=./pkginfo
d none lib 0755 bin bin
f none lib/libslang.a 0644 bin bin
d none include 0755 bin bin
f none include/slang.h 0644 bin bin
f none include/slcurses.h 0644 bin bin
You should add pre- and postinstall scripts with a line like the pkginfo one.
PKG="SCprog"
NAME="prog"
ARCH="intel"
VERSION="1.00"
CATEGORY="application"
VENDOR="Foo, Inc."
EMAIL="foo@net.net"
PSTAMP="Mr Pink"
BASEDIR="/usr/local"
CLASSES="none"
The most important entry is the BASEDIR line - it will specify where your software will be installed.
$ pkgmkg -r `pwd`
$ cd /var/spool/pkg
$ pkgtrans -s `pwd` ~/pkg/foo-0.1
$ pkgadd -d foo-0.1
$ rm -rf ~/pkg/usr/local/ && mkdir -p ~/pkg/usr/local/
That's it ! Your package has been created !