Watchdog Timer for servers

Overview

Linux admins often need to reboot remote machines in order to load new kernel, or perform system updates. Unfortunately, there is always a chance that the new kernel will not boot (due to some mistakes in /etc/fstab or kernel panic with new kernel image). This can be especially annoying on machines that are expected to work all the time.

While recovering from some errors and mistakes requires the admin to physically access the machine and repair the system using a Live CD, recovering from a kernel panic can be acomplished easily using two simple things:

  1. The LILO fallback=... option, plus
  2. hardware watchdog.

I've designed a simple yet powerful hardware watchdog designed especially for servers. It is connected via RS232 interface and requires no drivers at all (it utilizes its own, platform- independent firmware, that allows control via scripts).

Usage

The watchdog is connected via RS232 (COM) port. It needs to be connected to the computer's +5V power line on the well-known 'molex' connector. Resetting is archieved by using the Reset button connector in the mainboard. An optocoupler is used to isolate the mainboard from the watchdog, resulting in much safer operation.

In order to use the watchdog, one needs to activate it first. Watchdog includes two operation modes: heartbeat, in which it resets the computer when it doesn't confirm it's alive within the specified amount of time. The other mode is 'delayed blast' - the timer resets the computer after a speficied amount of time if it's not "disarmed" properly before the time runs out. This mode is useful for performing kernel upgrades.

Technical specification

The full technical specification, along with prices, is as follows:

The downloadable resources include:

Management

The watchdog has not been built yet, so there is no firmware written, but the inferface will be probably command-line driven. The commands would include

The typical usage would be:

One could possibly safely change the kernel to a new version with:

There would also be a 'heartbeat' option to detect lockups, like this