Wednesday, March 10, 2010

How to write more reliable servers. - dealing with failures

Given enough time most Internet servers will crash. It could be a memory leak, or some unexpected behavior from a new browser, or a deliberate denial of service attack. It could even be from hardware problems such a flaky memory chips.

The question is how do you deal with this.

If you take your basic C/C++ TCP/IP server application, when it dies it dies.
Many people set up some mechanism to monitor it and page a technician to restart it. This has many drawbacks and can lead to periods of prolonged downtime if alerts are missed or crashes become frequent.

Web servers only became reliable when NCSA 1.4.1 came it. It was very well thought out and well written and most web service since have copied the mechanisms it used. NCSA used a parent process that spawn child processes to actually do the work. When a child process dies, the parent wakes up and immediately restarts them.

In my former company IBS(Internet Broadcast Systems) and later at DVBS (Digital Video Broadcast systems) we called this a keep_alive.


The simplest example can be a shells script.
while true
do
server
echo "server crashed" > log
sleep 1
done


With in C server code

int main(int argc, char *argv[])
{

    printf("Started...\n");

...

  Refork:
    switch (pid = fork()) {
    case -1:
 printf("\nCan't fork!\n");
 FatalError(3);
    case 0:
 break;
    default:
 wait(0);
 printf("\nChild died!\n");
 goto Refork;
    }

    printf("Forked...\n");