OpenSolaris

Solaris core dump analysis with SUNWscat

Topics

I've previously tackled how Solaris core dumps can be investigated with mdb. There's another utility (comparable with Q4 on HP_UX), called SUNWscat. Scat is a tool to analyze kernel dumps (pun probably intended). Just download the SUNWscat package, install it on your server, and wait for a kernel crash to happen. When this happened, you'll find a unix.0 and vmcore.0 in the coredump directory (default /var/crash). When you fire up SUNWscat, you'll be presented with the following screen :

# scat 0
  Solaris[TM] CAT 4.1 (build 526) for Solaris 10 64-bit SPARC(sun4u)

  Copyright © 2003 Sun Microsystems, Inc. All rights reserved.
  Patents Pending. Use is subject to license terms.
  Sun Microsystems proprietary - DO NOT RE-DISTRIBUTE!

opening vmcore.0 ...dumphdr...symtab...core...done
loading core data: modules...panic...memory...time...misc...done
loading stabs...read_type_db: Wrong number of lines in database, or database
doesn't end in a newline
unable to load any stabs file
patches... - NOT AVAILABLE (No such file or directory) done

core file:      /var/crash/vmcore.0
user:           Super-User (root:0)
release:        5.10 (64-bit)
version:        Generic_112233-11
machine:        sun4u
node name:      boson
domain:         arda.org
hw_provider:    Sun_Microsystems
system type:    SUNW,Sun-Fire-V210
hostid:         837844c7
time of crash:  Tue Apr 22 11:49:52 EDT 2008
age of system:  22 hours 5 minutes 4.48 seconds
panic cpu:      0 (ncpus: 8)
panic string:   free: freeing free block, dev:0x200000016e, block:32032, ino:6057255, 
                fs:/homes

running sanity checks.../etc/system...ndd...sysent...misc...done
SolarisCAT(vmcore.0)>

The first thing you probably want to do, is investigating the crash reason :

SolarisCAT(vmcore.0)> analyze
PANIC: free: freeing free block, dev:0x%lx, block:%ld, ino:%lu, fs:%s
[...]
==== printing for generic panic information ====
cpu 0 had the panic

==== panic thread: 0x2a1003f7d40 ==== cpu: 0 ====
==== panic kernel thread: 0x2a1003f7d40  pid: 0  on cpu: 0 ====
cmd: sched

t_stk: 0x2a1003f7b50  sp: 0x1437751  t_stkbase: 0x2a1003f4000
t_pri: 60(SYS)  pctcpu: 0.000000  t_lwp: 0x0
t_procp: 0x1438518(proc_sched)  p_as: 0x1438400(kas)
last cpuid: 0
idle: 50 ticks (0.50 seconds)
start: Mon Apr 21 13:45:07 2008
age: 79485 seconds (22 hours 4 minutes 45 seconds)
stime: 2132 (22 hours 4 minutes 43.16 seconds earlier)
tstate: TS_ONPROC - thread is being run on a processor
tflg:   T_TALLOCSTK - thread structure allocated from stk
        T_DONTBLOCK - for lockfs
        T_PANIC - thread initiated a system panic
tpflg:  none set
tsched: TS_LOAD - thread is in memory
        TS_DONT_SWAP - thread/LWP should not be swapped
        TS_SIGNALLED - thread was awakened by cv_signal()
pflag:  SSYS - system resident process
        SLOAD - in core
        SLOCK - process cannot be swapped

pc: 0x104a720   unix:panicsys+0x44:     call    unix:setjmp
startpc: 0x11a53f8      ufs:ufs_thread_delete+0x0:      save    %sp, -0xd0, %sp

unix:panicsys+0x44 (0x14a3158, 0x2a1003f74a0, 0x1438120, 0x1, 0x0, 0x0)
[...]
unix:thread_start+0x4 (0x3000026e828, 0x0, 0x0, 0x0, 0x0, 0x0)
-- end of kernel thread's stack --

SolarisCAT(vmcore.0)>

The proc command, for example, can tell you about the processes that were running at the time your system crashed. These processes are listed by default in reverse PID order.

SolarisCAT(vmcore.0)> proc

    addr       pid    ppid   uid      size      rss     swresv   time  command
------------- ------ ------ ------ ---------- -------- -------- ------ ---------
0x30003c8e040    283      1      0    3776512  1646592  1302528  90118 /usr/sbin/ssmon
0x30003c96a50    279      1      0    9306112  2514944  1769472     19 /usr/sbin/ssserver
0x30003bee030    256      1      0   27656192  2596864  1138688     57 /usr/sbin/nscd
0x30003c8ea58    243      1      0    2506752  1703936   466944      7 /usr/sbin/cron
0x30003c96038    240      1      0   18874368  2170880  2711552      7 /usr/sbin/syslogd
0x30000f60010    225      1      0    7217152  2400256  1146880    170 /usr/lib/autofs/automountd
0x300020c4a40    217      1      0    2260992  1572864   598016      3 /usr/lib/nfs/lockd
0x300020c5458    213      1      1    4677632  1974272   876544      2 /usr/lib/nfs/statd
0x300020c4028    201      1      0    2629632  2048000   835584     12 /usr/sbin/inetd -s
[...]

Advanced use of scat requires an in-depth understanding of the Solaris kernel. However, you can get a lot of useful information by using just the basic commands.

Crash dump analysis on (Open)Solaris (mdb)

Topics

We've tackled previously how to look at kernel dumps on HP-UX, let's have a look now how to perform them same on OpenSolaris. The kernel debugger is actually 'quite' user-friendly, and gives you mostly enough information how to handle a crash. If your Solaris is too stable to generate crashes, then use the

savecore -L 

command to generate one on the fly. This will generate a dump in /var/adm/crash. Let's have a look at it with mdb :

# mdb -k unix.0 vmcore.0 
Loading modules: [ unix krtld genunix specfs dtrace cpu.AuthenticAMD.15 uppc pcplusmp ufs ip sctp usba lofs zfs random ipc md fcip fctl fcp crypto logindmux ptm nfs ]
>

The ::status command will display high level information regarding this debugging session. This is mostly a one-liner, which reveals the reason of the crash.

> ::status
debugging crash dump vmcore.0 (64-bit) from hostname
operating system: 5.11 snv_43 (i86pc)
panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffe80000ad3d0 addr=0 occurred in module "unix" due to a NULL pointer dereference
dump content: kernel pages only

The ::stack command will prove you with a stack trace, this is the same thing trace you would have seen in syslog or the console.

> ::stack
atomic_add_32()
nfs_async_inactive+0x55(fffffe820d128b80, 0, ffffffffeff0ebcb)
nfs3_inactive+0x38b(fffffe820d128b80, 0)
fop_inactive+0x93(fffffe820d128b80, 0)
vn_rele+0x66(fffffe820d128b80)
snf_smap_desbfree+0x78(fffffe8185e2ff60)
dblk_lastfree_desb+0x25(fffffe817a30f8c0, ffffffffac1d7cc0)
dblk_decref+0x6b(fffffe817a30f8c0, ffffffffac1d7cc0)
freeb+0x89(fffffe817a30f8c0)
tcp_rput_data+0x215f(ffffffffb4af7140, fffffe812085d780, ffffffff993c3c00)
squeue_enter_chain+0x129(ffffffff993c3c00, fffffe812085d780, fffffe812085d780, 1, 1)
ip_input+0x810(ffffffffa23eec68, ffffffffaeab8040, fffffe812085d780, e)

The ::msgbuf command will output the message buffer at the time of crash; the message buffer is most commonly used by sysadmins through the "dmesg" command.

> ::msgbuf
MESSAGE                                                               
....
WARNING: IP: Hardware address '00:14:4f:xxxxxxx' trying to be our address xxxx
WARNING: IP: Hardware address '00:14:4f:xxxx' trying to be our address xxxx

panic[cpu0]/thread=fffffe80000adc80: 
BAD TRAP: type=e (#pf Page fault) rp=fffffe80000ad3d0 addr=0 occurred in module "unix" due to a NULL pointer dereference

sched: 
#pf Page fault
Bad kernel fault at addr=0x0

One of the coolest commands is the cpuinfo -v command, which will show more information about the running processes at the time of the crash, including some nicely ascii-art style formatting :

>  ::cpuinfo -v
 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  1 ffffffff983b3800  1f    1    0  59  yes    no t-0    fffffe80daac2f20 smtpd
                       |    |
            RUNNING   PRI THREAD           PROC
              READY                99 fffffe8000bacc80 sched
           QUIESCED
             EXISTS
             ENABLE

Other interesting commands are the ::ps (info about running processes), and ::panicinfo, which will reveal thread information, which you can further investigate with the ::walkthread option.

In a following article, I'll write about the Solaris Core Analyzer, which is a Q4 comparabe tool on Solaris to walk through kernel dumps.

OpenSolaris 2010.05

Topics

Whoever thought that OpenSolaris was dead after the Oracle acquisition, might be wrong : OpenSolaris 2010.05 has been released with some important new features :

  • ZFS deduplication : I've always predicted that this would once become a default feature of file systems, and ZFS is the first to implement this
  • IPS : the new Image Packaging System : the reworked package manager for OpenSolaris, a big deal in bringing the legacy Solaris package management to a higher level
  • USB support for VirtualBox guests
  • Gnome 2.28

Update : seems that this was a link to a draft document.

Boot Solaris with a RO root filesystem

Topics

I just finished a very interesting case of a coredumping TSM client on Solaris. After investigation of the core dump, it seemed that the TSM client barfed over an erroneous inode. Some more diagnosis revealed indeed filesystem corruption, unfortunately on the root file system. Normally, one would boot from CDROM or issue a netboot, to correct the corruption, but it turned out the Jumpstart config of the host was really foobarred. I neither did have the time to correct the Jumpstart server config, or walk over to the data center to insert a Solaris DVD.

At times like that, I resort to little tricks in the bootsequence of Solaris : if you boot with boot -a -s, you can specify the location of the startup files. If you enter a /dev/null for the /etc/system file, the host will continue to boot, but with a read-only filesystem :

Rebooting with command: boot -a -s
Boot device: /pci@8,600000/SUNW,qlc@4/fp@0,0/disk@0,0:b File and args: -a -s
Enter filename [kernel/sparcv9/unix]:
Enter default directory for modules [/platform/SUNW,Sun-Fire-280R/kernel /platform/sun4u/kernel /kernel /usr/kernel]:
=> Name of system file [etc/system]: /dev/null
SunOS Release 5.10 Version Generic_118833-24 64-bit
Copyright 1983-2006 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
root filesystem type [ufs]:
Enter physical name of root device
[/pci@8,600000/SUNW,qlc@4/fp@0,0/disk@w500000e0155145d1,0:b]:
Booting to milestone "milestone/single-user:default".
Hostname: qwerty
SUNW,eri0 : 100 Mbps full duplex link up
Requesting System Maintenance Mode
SINGLE USER MODE

Root password for system maintenance (control-d to bypass):
single-user privilege assigned to /dev/console.
Entering System Maintenance Mode

After a few rounds of fsck's, the root filesystem turned out to be corrected, and only 2 files seemed to be impacted by the file system check. As the TSM client worked again, I could easily restore those from the backup.

Unkillable processes

Topics

An issue I lately encountered was that a collegue complained about several processes which kept hanging on a Solaris 10 machine. After investigation, processes like format, powermt and even a for diagnostics invoked dtrace kept hanging, and could not even be killed :

# pkill -9 format
# ps -ef |grep -c format
2

In such cases, a good old truss session mostly explains what's going on; but in this case, truss came back with a quite peculiar message :

# truss -p 26632
truss: unanticipated system error: 26632
#
# pstack 26632
pstack: cannot examine 26632: unanticipated system error
#
# pfiles 26632
pfiles: unanticipated system error: 26632

In those cases, the only option you have is to rely on the kernel debugger to determine the cause :

# mdb -k
Loading modules: [ unix genunix specfs dtrace ufs sd pcisch md ip hook neti sctp arp usba fcp fctl ssd nca lofs zfs cpc fcip random crypto logindmux ptm nfs ipc ]
> ::pgrep format
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R   1241      1    942    686      0 0x4a004900 000006001414c060 format
> 000006001414c060::thread
            ADDR    STATE  FLG PFLG SFLG   PRI  EPRI PIL             INTR
000006001414c060 inval/2000 1424 de50    0     0     0   0              n/a
> 000006001414c060::walk thread | ::findstack
stack pointer for thread 300012b7700: 2a10055cb01
[ 000002a10055cb01 cv_wait+0x38() ]
  000002a10055cbb1 PowerSleep+0x14()
  000002a10055cc71 PowerGetSema+0xe8()
  000002a10055cd31 power_open+0x364()
  000002a10055cea1 spec_open+0x4f8()
  000002a10055cf61 fop_open+0x78()
  000002a10055d011 vn_openat+0x500()
  000002a10055d1d1 copen+0x260()
  000002a10055d2e1 syscall_trap32+0xcc()

In this case, it was the PowerPath MPIO which was blocked on a semaphore. Further investigation revealed that the drivers for PowerPath were removed from the /etc/system file. Restoring the correct version of that file and a reboot solved the problem.

Creating (Open)Solaris packages

Topics

(File under : 6yo stuff that I finally integrated into my blog)
Since I became a Unix system administrator, I had the opportunity to create some Solaris packages. We all know how important decent package management is on Unix systems, and I have a decent experience in packaging software with my Debian box at home. Apt-build and such are excellent tools under Debian, so I was a bit surprised that package management seemed so primitive under Solaris.

There are some scripts out there which do the job for you, but as it is the habit with scripts, you don't know what these things are doing on your machine. If you want to know all the tidbits about packaging on Solaris, you're on the right place here.

So here is a HOWTO about creating Solaris packages based on my experience.

Intro and setup


You don't need root access for creating these packages; only if you want to install them, root access is needed. First, create the following directory structure in your homedir :
/export/home/youruser
	!
	+---  pkg
		!
		+--- src
		!
		+--- usr
			!
			+--- local

We will extract and build our software in the ~/src dir. Installation will happen in the ~/pkg/usr/local dir. So you really don't need to setup a chrooted environment as you see so many times explained in other places.

The procedure

  1. Extract your source in ~/pkg/src/foo-0.1 and cd into that directory.
  2. First, configure the software with
    $ ./configure --exec-prefix=/export/home/youruser/pkg/usr/local --prefix=/export/home/youruser/pkg/usr/local.
  3. Build your software and install it in your ~/pkg/usr/local dir :
    $ make && make install
    
  4. Now move to the installed software :
    $ cd /export/home/youruser/pkg/usr/local/
  5. We will now create the prototype file with
    $ find . -print |pkgproto > prototype
  6. Edit the prototype file and add i pkginfo=./pkginfo
    You'll get something like :
          i pkginfo=./pkginfo
          d none lib 0755 kristof users
          f none lib/libslang.a 0644 kristof users
          d none include 0755 kristof users
          f none include/slang.h 0644 kristof users
          f none include/slcurses.h 0644 kristof users
    
  7. Change in your prototype file all file owner info into bin:bin
    You should have something like :
          i pkginfo=./pkginfo
          d none lib 0755 bin bin
          f none lib/libslang.a 0644 bin bin
          d none include 0755 bin bin
          f none include/slang.h 0644 bin bin
          f none include/slcurses.h 0644 bin bin
    

    You should add pre- and postinstall scripts with a line like the pkginfo one.

  8. The part I allways forget is creating the pkginfo file :
          PKG="SCprog"
          NAME="prog"
          ARCH="intel"
          VERSION="1.00"
          CATEGORY="application"
          VENDOR="Foo, Inc."
          EMAIL="foo@net.net"
          PSTAMP="Mr Pink"
          BASEDIR="/usr/local"
          CLASSES="none"
    

    The most important entry is the BASEDIR line - it will specify where your software will be installed.

  9. Make the package with
    $ pkgmk -r `pwd`
  10. Go to the package dir :
    $ cd /var/spool/pkg
  11. Now create the package file itself :
    $ pkgtrans -s `pwd` ~/pkg/foo-0.1
  12. Go to your package (cd ~/pkg) and optionally install it with
    $ pkgadd -d foo-0.1
  13. Gzip it with gzip foo-0.1
  14. Don't forget to clean up your ~/pkg/usr/local dir :
    $ rm -rf ~/pkg/usr/local/ && mkdir -p ~/pkg/usr/local/

That's it ! Your package has been created !

LKFS

Topics

Sure, we all know ZFS, but little chance is that you know RBAC, FSS or SMF. If you do know these, you probably won't know SamFS or JET. These are more obfuscated features of Solaris that are quite handy though.
LKFS aka the Lesser Know Solaris Features is a ebook by Joerg Moellenkamp about generally unknown commands and functionalities in (Open)Solaris.

Time Slider

Topics

I recently came across a blogpost that explained one hell of an OpenSolaris feature, and a killer ZFS feature on the desktop : Time Slider. It is basically backup snapshots integrated into Nautilus. By moving a slider, you can browse throughout your snapshot backups. Screencast available !

Now, Glen Brunette developed the glue to combine the Amazon S3 cloud storage , encryption and Time Slider to provide a really neat way to backup your system. With this method, it will be easy (hey, it´s 0.1 ... but works really good at the moment) to have an automated backup at an offsite location.

NexentaCore 1.0

Topics

Nexenta Core platform 1.0 is now available. NexentaCore includes the complete OpenSolaris kernel (Nevada build 82) and runtime, along with Debian and GNU toolchains. NexentaCore is a minimal (core) foundation that can be used to quickly build servers, desktops, and custom distributions tailored for specialized applications. However, unlike NexentaOS desktop distribution, NexentaCore does not aim to provide a complete desktop.

Nexenta alfa5

Topics

Nexenta alfa release 0.5 has arrived. It is the first Nexenta release which supports Solaris zones. It uses debootstrap technology in conjunction with remote APT repository which simplifies *a lot* of Zone software management routines.

It is possible to pre-install fresh zone with custom software out of the APT repository. In this article you will see how apache gets pre-installed and pre-configured to operate in a newly created Nexenta Zone.