Watchdog Timer for servers
Overview
Linux admins often need to reboot remote machines in order to load new kernel, or perform system updates. Unfortunately, there is always a chance that the new kernel will not boot (due to some mistakes in /etc/fstab or kernel panic with new kernel image). This can be especially annoying on machines that are expected to work all the time.
While recovering from some errors and mistakes requires the admin to physically access the machine and repair the system using a Live CD, recovering from a kernel panic can be acomplished easily using two simple things:
- The LILO fallback=... option, plus
- hardware watchdog.
I've designed a simple yet powerful hardware watchdog designed especially for servers. It is connected via RS232 interface and requires no drivers at all (it utilizes its own, platform- independent firmware, that allows control via scripts).
Usage
The watchdog is connected via RS232 (COM) port. It needs to be connected to the computer's +5V power line on the well-known 'molex' connector. Resetting is archieved by using the Reset button connector in the mainboard. An optocoupler is used to isolate the mainboard from the watchdog, resulting in much safer operation.
In order to use the watchdog, one needs to activate it first. Watchdog includes two operation modes: heartbeat, in which it resets the computer when it doesn't confirm it's alive within the specified amount of time. The other mode is 'delayed blast' - the timer resets the computer after a speficied amount of time if it's not "disarmed" properly before the time runs out. This mode is useful for performing kernel upgrades.
Technical specification
The full technical specification, along with prices, is as follows:
- Main microcontroller: AT89C2051 - 5.5zł = $1.5
- EEPROM for settings: ST24C02 (256 bytes is more than enough) - 1.5zł = $0.5
- RS232 interface: MAX232 - 3.5zł = $1
- Optocoupler (for interfacing different mainboards safely): CNY17 - 3.5zł = $1
- Xtal: 11.0592MHz (the device sleeps most of the time anyway) - 1.5zł = 0.5$
- LCD: OPTIONAL hd44870 compatible display for server monitoring purposes - about 10-50zł = $3-$15, depending on model
- Two general-purpose push-buttons
The downloadable resources include:
- schematic [24k]
Management
The watchdog has not been built yet, so there is no firmware written, but the inferface will be probably command-line driven. The commands would include
- IMPULSELENGTH [seconds]
- ARM [time in seconds]
- DISARM
- HEARTBEAT [timeout in seconds]
- BEAT
- REBOOT
- STATUS
- LCD PRINT [string]
- LCD CLEAR
- LCD GOTO [line] [position]
- LCD SIZE [lines] [coluns]
- LCD BLINK|NOBLINK|CURSOR|NOCURSOR|...
The typical usage would be:
- stty 57600 < /dev/ttyS0
- echo 'COMMAND ...' > /dev/ttyS0
One could possibly safely change the kernel to a new version with:
- [configure and install the new kernel]
- [configure lilo.conf to load the kernel and specify a fallback option]
- echo 'ARM 600' > /dev/ttyS0
- reboot
- [re-ssh to the server after reboot and ensure everything is OK]
- echo 'DISARM' > /dev/ttyS0
There would also be a 'heartbeat' option to detect lockups, like this
- [start a screen session]
- while sleep 20; do echo 'BEAT' > /dev/ttyS0
- [detach]
- echo 'HEARTBEAT 60' > /dev/ttyS0