Thursday, October 29, 2009

Automatic Restart Script for a Java service

This blog tries to describe a pattern on how to write a pragmatic Unix start and stop script for an automatically restarting Java service.
In many cases, you might want to write a Java service that should be up and running 24x7. Now in theory the garbage collector should deal with everything, and if programmed correctly, the process should never crash. But in praxis, things do happen. For example a web service could encounter user requests where memory use is much bigger than you ever expected.
What is a good maximum heap size anyway ?
A pragmatic approach is just to face the fact that your JRE could run out of memory and deal with it. There are sophisticated monitoring solutions out there to automatically restart processes (e.g. nagios/ganglia), but a poor-man's solution is to automatically restart the JRE from the Unix start script.
Please note, you don't want to restart in every case. A bad command line option should just stop the process and not run into the restar loop. Also, there must be a clean way to manually stop it.
Under these constraints, the best solution I could find is to create a temporary file from the Java code at exactly the point 'of no return'. If the JRE stops before this point, no restart happens. If the JRE stops after this point, automatic restart will kick in.
... parse command line options ...
     // register shutdown hook      Runtime.getRuntime().addShutdownHook(new ShutdownHook(...));
     // register uncaught exception handler
    Thread.currentThread().setDefaultUncaughtExceptionHandler( new UncaughtExceptionHandler() {
      public void uncaughtException(Thread t, Throwable e)
        if( e.getClass()==java.lang.OutOfMemoryError.class ) {
          System.err.println("FATAL: shutting down because of java.lang.OutOfMemoryError ...");
      }} );

    File restart = new File("webservice.restart"); // tell the shell to restart me

... start your service ...
The Unix start script below will restart the JRE if, and only if, the temporary file (webservice.restart) exists. This could go like this:

#try to start service once
${JAVA_HOME}/bin/java -DREPLAYWEB ${JOPT} $*

# restart again (until webservice.restart file was removed)
while [ -f webservice.restart ]; do
echo "### RESTART ###"
/bin/rm -f webservice.restart # let Java decide if we really want to restart
${JAVA_HOME}/bin/java -DREPLAYWEB ${JOPT} $*

So Java decides if the service should be restarted and the shell actually performs the restart.
An alternative would have been to simply use return codes from System.exit() ?
But then the question would be: what's the return code with a not yet known exception ?
If someone else uses kill-9 on the jre, a shutdown hook wouldn't be invoked. And to manually stop the restarting you would have to kill the start script as well as the JRE.
With this file-based approach, the stop script is pretty simple: remove the temporary file and kill the JRE. Please note that finding the proper Java process is not as simple as it seems since the classpath is usually very long and 'ps -f' potentially won't show the classname anymore because the line gets too long. On Linux you can use the --col option to see a longer output, but that doesn't work on Solaris :-(
So a little trick around this is to use a dummy JRE property, -DREPLAYWEB in this case. This mock up property has no meaning except that it will show up in ps before the classpath and you can make it unique enough to identify only this instance of JRE.
The stop script would then perform these steps:
  1. get the 'ps line' that contains the dummy JRE property (REPLAYWEB)
  2. get the task id of that process (awk is good enough)
  3. remove the temporary restart file so the start script won't restart automatically
  4. kill the process
#!/bin/sh -f

psline=`/bin/ps -aef | /bin/grep "REPLAYWEB" | /bin/grep -v grep`
echo $psline
pid=`echo $psline | /bin/awk '{ print $2}'`
if [ $pid ]
 /bin/rm -f webservice.restart
 echo "stop_webservice: killing Web Service with pid=$pid"
 kill $pid
 echo "stop_webservice: Web Service was not running"

No comments:

Post a Comment