Wednesday, January 02, 2013

SUPERVISOR: detecting faults and fixing them automatically

I've added a new protocol SUPERVISOR [1] to master, which can periodically check for certain conditions and correct them if necessary. This will be in the next release (3.3) of JGroups.

You can think of SUPERVISOR as an automated rule-based system admin.

SUPERVISOR was born out of a discussion on the mailing list [2] where a bug in FD caused the failure detection task in FD to be stopped, so members would not get suspected and excluded anymore. This is bad if the suspected member was the coordinator itself, as new members would not be able to join anymore !

Of course, fixing the bug [3] was the first priority, but I felt that it would be good to also have a second line of defense that detected problems in a running stack. Even if a rule doesn't fix the problem, it can still be used to detect it and alert the system admin, so that the issue can be fixed manually.

The documentation for SUPERVISOR is here: [4].





1 comment:

  1. From the title I almost thought it would evolve the JGroups bytecode itself!
    Interesting concept.