MOSS Farm Monitoring – Everybody online?

Before a deployment, I'm working on a "health check" to determine whether or not some common 'gotcha's in a farm are going to be a problem. One problem in particular that I needed to solve is determining whether or not a MOSS Farm Member server is actually online and functioning - specifically its Timer Service.

(Slight detour - Solution Deployment in a farm environment totally relies on Timer Jobs - when you deploy a solution, MOSS creates a Solution Deployment Timer Job that every server in the farm executes. If a server is offline, that solution deployment job will hang indefinitely until the offline server comes back - by design, so that you don't consider a solution deployment "successful" without all servers receiving the deployment.)

There are obviously the normal "Windows" ways to figure this out - can you ping each member, access a file share, etc. But I actually wanted to know that the member was not only running, but actively communicating with MOSS.

My goal was to come up with, for lack of a better term, a "Farm Ping" -- Whereas I would kick off some kind of "dummy" timer job that all farm members would need to run in order to consider the "ping" completed. If I queued this job, and all member servers ran it fine, then I would know everybody was online. If it didn't finish, then I'd know I have a problem.

The only issue with this technique is that I would need to create some type of custom timer job, package it up in a solution, and deploy it whenever I wanted to test this. But really, that just adds another moving piece to my equation. However, looking around, I found a great alternative that MOSS already does - the Config Refresh Timer Job.

After watching my Timer Job Monitoring Utility for a couple minutes, I noticed a pattern. Every farm member that runs Timer Jobs runs the "Config Refresh" job every 30 seconds or so. If I stop the Timer service on a member server, the Config Refresh job will continue to age. Start the Timer service again, and within a minute or so the Config Refresh job gets run and back on its 30-second schedule.

What's this mean? Well, we can simply get SPFarm.Local.TimerService.RunningJobs, take a look at the "Config Refresh" jobs in that collection, and if they're more than 30 seconds old, we know that the server it belongs to is not online. Simple, easy, and surprisingly accurate.