Waking up a server

by Thomas

For one of our web sites we programmed a server that provides us with information about the whole friendship network on this site: Who is friends with who, six degrees of separation information, etc. Back in 2007, we wrote that server in C and since we started using it, it was running with absolute reliability.

Suddenly it stopped working, without any prior warning. It did not crash, the process was still alive. It just did not response to any requests that we sent to it. All requests timed out. And even worse, there was no helpful log information that pointed us to any specific direction we could look for the error.

Check list

The good thing was, since we programmed the server, we already had several points to think about first. On top of a C programmer’s check list is the segmentation fault, that happens when trying to access a memory area that the process does not own. But we could almost exclude this point, since the server would have logged any segmentation fault it did. This point got crossed out.

This left a second point on our check list: Blocking. Our server creates worker threads that deal with each client’s requests. For responding to the requests, all threads are sharing the same information about the friendship network, or, in a mathematical term, the graph. Here we are using mutexes in order to keep the information consistent (by preventing to have more than one writing operation at a time or to have a read request while there was some data written).

Mutual exclusion? Dining philosophers? Race condition? Deadlock?

Slowing things down

Yes, that was a lead. The next question was how to reproduce a race condition (that we only assumed was happening), in order to get a deadlock (which we didn’t know where it was happening). Also, we are only using one mutex at a certain time, how could there be a deadlock anyway?

Race conditions are hard to find and they only appear once in a while, depending on how the operating system schedules the threads. However, we tried to slow things down by leaving in debug information and by using additional functions that were stopping our server from being fast.

Great! We were in fact able to reproduce it! Problem not solved yet, though.

While our server was running the tests, we used strace to trace all system calls it made. Additionally, once it blocked, we checked the function stack using gdb, to get information about the location where it blocked.

Beware of asynchronous signals and syslog()

Both tools proved our assumption — our server’s threads were waiting for a mutex to get unlocked. It never got, apparently. The application was stuck in this function (output from strace):

$ strace -p 19888
Process 19888 attached - interrupt to quit
futex(0x2b412b57c514, FUTEX_WAIT, 2, NULL

After checking the whole output of strace we were able to localize the problem: Each time our server called syslog() to log some information and our server at the same time received an asynchronous signal (a SIGALRM in our case), not much later it would block. We now understood the problem much better. syslog() (and all the other three functions of the system logger) are not async signal safe (see the list of all functions that should be async signal safe). They must not be interrupted by a signal while they are called. They use the mutex syslog_lock to prevent several information, that is sent by several processes at the same time, being written to the same file (simplified):

void syslog(int pri, const char* fmt, ...)
{
    __libc_lock_lock(syslog_lock);
    /*
     * write the log information
     */
    __libc_lock_unlock(syslog_lock);
}

In our case, our server received a SIGALRM which asynchronously interrupted this function before it had had the time to unlock the mutex — syslog_lock was still locked, although we had already left the function for the signal handler. This alone doesn’t cause any harm, unless… unless you use the same syslog() call again in the same process in your signal handler.

Well, we did. This caused the signal handler to wait for syslog_lock to get unlocked, which it never got, because the same process locked it. This process, however, was now stuck in the signal handler function and could never again unlock the mutex in the syslog function.

Since our main thread was handling the signals and thus got stuck, it could no longer receive any requests from any of our clients. Although it was still alive as a process, it was busy waiting for the unlocking of the mutex. This was the reason why we could not reach our server at that time.

Solution

For each problem there is a solution… or at least there should be a solution. Anyway, the first one is quite obvious: Don’t use such kind of function in your signal handler. Yes, easy.

The second one is to block signals before you call such a function, in our case syslog(). Using a wrapper function for calling syslog() could look like this:

sigset_t __sigset;

void snsd_log(int priority, const char* message, ...)
{
    va_list args;
    pthread_sigmask(SIG_BLOCK, &__sigset, (sigset_t*)NULL);
    va_start(args, message);
    vsyslog(priority, message, args);
    va_end(args);
    pthread_sigmask(SIG_UNBLOCK, &__sigset, (sigset_t*)NULL);
}

Blocking certain signals by using pthread_sigmask() will make the operating system queue up incoming signals for us. The OS will deliver these signals not before we explicitly unblock them (in the code above calling once again pthread_sigmask()). Using this approach, we can even use the service provided by syslog in our signal handler.

Conclusion

We are now more aware of what might happen when there are incoming asynchronous signals while having locked a mutex. We also learned more about both thread handling and thread cancellation (even though this was not related to this topic, it has been very useful). Besides this, we experienced once more, how powerful and helpful strace can be.

Posted in Development and tagged . Bookmark the permalink.


Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Join Our Team in Shanghai

Now hiring PHP Developer in Shanghai