Belas Blog: Confessions of a serial protocol designer

I have a confession to make.

I'm utterly disgusted by my implementation of FD_ALL, and thanks to David Forget for pointing this out !

What's bad about FD_ALL ? It will not scale at all ! After having written several dozen protocols, I thought an amateurish mistake like the one I'm about to show would certainly not happen to me anymore. Boy, was I wrong !

FD_ALL is about detecting crashed nodes in a cluster, and the protocol then lets GMS know so that the crashed node(s) can be excluded from the view.

Let's take a look at the design.

Every node periodically multicasts a HEARTBEAT
This message is received by everyone in the cluster and a hashmap of nodes and timestamps is updated; for a node P, P's timestamp is set to the current time
Another task run at every node periodcially iterates through the timestamps and checks if any timestamps haven't been updated for a given time. If that's the case, the members with outdated timestamps are suspected
A suspicion of P results in a SUSPECT(P) multicast
On reception of SUSPECT(P), every node generates a SUSPECT(P) event and passes it up the stack
VERIFY_SUSPECT catches SUSPECT(P) and sends an ARE_YOU_DEAD message to P
If P is still alive, it'll respond with a I_AM_NOT_DEAD message
If the sender doesn't get this message for a certain time, it'll pass the SUSPECT(P) event further up the stack (otherwise it'll drop it), and GMS will exclude P from the view, but if and only if that given node is the coordinator (first in the view)

Can anyone see the flaw in this design ? Hint: it has to do with the number of messages generated...

OK, so let's see what happens if we have a cluster of 100 nodes:

Say node P is temporarily slow; it doesn't send HEARTBEATs because a big garbage collection is going on, or the CPU is crunching at 90%
99 nodes multicast a SUSPECT(P) message
Every node Q therefore receives 99 SUSPECT(P) messages

Q (via VERIFY_SUSPECT) sends a ARE_YOU_DEAD message to P
P (if it can) responds with an I_AM_NOT_DEAD back to Q
So the total number of messages generated by a single node is 99 * 2

This is done on every node, so the total number of messages is 99 * 99 * 2 = 19'602 messages !

Can you imagine what happens to P, which is a bit overloaded and cannot send out HEARTBEATs in time when it receives 19'602 messages ?

It it aint dead yet, it will die !

Isn't it ironic: by asking a node if it is still alive, we actually kill it !

This is an example of where the effects of using IP multicasts were not taken into account: if we multicast M, and everybody who receives M sends 2 messages, I neglected to see that the number of messages sent is a function of the cluster size !

So what's the solution ? Simple, elegant and outlined in [1].

Everybody sends a HEARTBEAT multicast periodically
Every member maintains a suspect list
This list is adjusted on view changes
Reception of a SUSPECT(P) message adds P to the list
When we suspect P because we haven't received a HEARTBEAT (or traffic if enabled):

The set of eligible members is computed as: members - suspected members
If we are the coordinator (first in the list):

Pass a SUSPECT(P) event up the stack, this runs the VERIFY_SUSPECT protocol and eventually passes the SUSPECT(P) up to GMS, which will exclude P from the view

The cost of running the suspicion protocol is (excluding the periodic heartbeat multicasts):

1 ARE_YOU_DEAD unicast to P
A potential response (I_AM_NOT_DEAD) from P to the coordinator

TOTAL COST in a cluster of 100: 2 messages (this is always constant), compared to 19'602 messages before !

This is way better than the previous implementation !

[1] https://jira.jboss.org/browse/JGRP-1241

5 comments:

Unknown4:47 PM
Good job. Big thanks to David!
Anonymous8:04 PM
I am wondering if it makes sense to have each node send HEARTBEAT to only a limited (user configurable) number of other members?
Perhaps the problem with that approach would be that instead of sending a single multicast HEARTBEAT packet, node would need to send a number of unicast ones, effectively killing all the benefits...
Bela Ban9:03 AM
To which nodes would you send the heartbeats ? Remember, if a node doesn't get heartbeats from everyone, it will suspect those from which it doesn't get them !
A few unicasts are more efficient if the transport is not UDP. But FD_ALL was designed with UDP as transport in mind...
Anonymous12:48 PM
So what happens when the coordinator is the suspect node?
Bela Ban12:58 PM
Say we have {A,B,C,D}. A is the current coordinator. When it is suspected, B (as next-in-line coordinator) runs a double-check (via VERIFY_SUSPECT) to see if A is really dead.
If that's still the case, B becomes coordinator and installs the new view {B,C,D} in the cluster.
Should A have been falsely suspected, and it is still alive, it will merge back into the cluster via MERGE2.

Belas Blog

Friday, October 01, 2010

Confessions of a serial protocol designer

5 comments:

Contributors

Blog Archive