Friday, October 21, 2011

Why am I seeing SIGSEGV when I strace a Java application on Linux?!

A customer recently used strace on a WebLogic server that was having some trouble. The first thing that jumped out at them was hundreds of SIGSEGV, also known as Segmentation Fault, events showed up in the output. They opened a support incident and asked for more information on what might be causing all of those segmentation faults.

Most people that have used Unix for any amount of time are familiar with occasionally seeing "Segmentation Fault (core dumped)" from poorly written programs. If that's all you knew about Unix and you looked at the output of strace on a Java process you'd think something was seriously wrong ("Wow, look at all these segfaults. Those guys at Sun/Oracle must be terrible programmers and they don't know what the hell they're doing!").

The real story is quite different - SIGSEGV in a Java process is almost always perfectly normal and completely safe.

Why?

First we need to go back a little bit and talk about signals.

Deep down under the covers a Unix process is really just the OS executing machine instructions one at a time. Unix needed some way to tell a running process know that some external event happened - for example the OS needs a way to tell a process that someone hit control-C. That message allows the application an opportunity to clean up and shut down gracefully or, if the application wants, it could ignore the signal, effectively saying "Sorry, not going to shut down right now". The OS inventors created signals for that purpose.

There's more background available at Wikipedia and in the signal man page ("man signal" if you have the right package installed).

The idea of threads was invented (much) later. In the early days of threads OS vendors each added their own thread APIs. Some other smart programmers figured out that they could use signals as a way to build threading without having to go change the OS at all. And out of that came the Posix Threads (pthreads) library. The man page and docs for pthread include more info about how pthreads works internally, but basically the thread library registers its own signal handlers and then when the library wants to switch threads, lock or unlock a mutex or do much of anything the pthread library uses signals to make that happen.

OK, so background all done (whew).
What does this have to do with Java? I thought you'd never ask.

The JVM is a multi-threaded process and so under the covers it's using signals to do OS level threading. But the JVM is also doing a metric ton of other really clever stuff; for example in a regular C/C++ program hitting a NULL (a Zero) when you're expecting a pointer to some structure would cause your application to crash. That crash is actually, as you can probably guess by now, the OS sending your process a signal - specifically SIGSEGV. If your app didn't register a signal handler for that signal (and 99.5% of c/c++ apps out there don't) then the signal comes back up to the OS which then terminates the app and (usually) saves the memory state into a core file. The JVM does register a signal handler for SIGSEGV and not just because it doesn't want to crash out when something goes wrong. The JVM registers a signal handler for SIGSEGV because it actually uses SIGSEGV and a bunch of other signals for its own purposes.

What purpose?

Signal Description
SIGSEGV, SIGBUS, SIGFPE, SIGPIPE, SIGILL Used in the implementation for implicit null check, and so forth.
SIGQUIT Thread dump support: To dump Java stack traces at the standard error stream. (Optional.)
SIGTERM, SIGINT, SIGHUP Used to support the shutdown hook mechanism (java.lang.Runtime.addShutdownHook) when the VM is terminated abnormally. (Optional.)
SIGUSR1 Used in the implementation of the java.lang.Thread.interrupt method. (Configurable.) Not used starting with Solaris 10 OS. Reserved on Linux.
SIGUSR2 Used internally. (Configurable.) Not used starting with Solaris 10 OS.
SIGABRT The HotSpot VM does not handle this signal. Instead it calls the abort function after fatal error handling. If an application uses this signal then it should terminate the process to preserve the expected semantics.
Table stolen wholesale from http://download.oracle.com/javase/7/docs/webnotes/tsg/TSG-VM/html/signals.html

Basically the JVM is using SIGSEGV catch null pointer exceptions.

Every time you do something like

try {
...
}
catch ( Exception e ) {
...
}
And the catch block executes because of a NullPointerException what's actually happening is that a SIGSEGV shows up and the JVM is handling it.

And that's perfectly normal and completely safe.

Cribbing from that page on the JVM again:

In general there are two categories of situations where signal/traps arise.
  • Situations in which signals are expected and handled. Examples include the implicit null handling cited above. Another example is the safepoint polling mechanism, which protects a page in memory when a safepoint is required. Any thread that accesses that page causes a SIGSEGV, which results in the execution of a stub that brings the thread to a safepoint.
  • Unexpected signals. This includes a SIGSEGV when executing in VM code, JNI code, or native code. In these cases the signal is unexpected, so fatal error handling is invoked to create the error log and terminate the process.

I don't recommend running strace against a Java process, but if you do do that it's almost certainly safe to ignore any signals you see, including SIGSEGV.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.