I'm utterly disgusted by my implementation of FD_ALL, and thanks to David Forget for pointing this out !
What's bad about FD_ALL ? It will not scale at all ! After having written several dozen protocols, I thought an amateurish mistake like the one I'm about to show would certainly not happen to me anymore. Boy, was I wrong !
FD_ALL is about detecting crashed nodes in a cluster, and the protocol then lets GMS know so that the crashed node(s) can be excluded from the view.
Let's take a look at the design.
- Every node periodically multicasts a HEARTBEAT
- This message is received by everyone in the cluster and a hashmap of nodes and timestamps is updated; for a node P, P's timestamp is set to the current time
- Another task run at every node periodcially iterates through the timestamps and checks if any timestamps haven't been updated for a given time. If that's the case, the members with outdated timestamps are suspected
- A suspicion of P results in a SUSPECT(P) multicast
- On reception of SUSPECT(P), every node generates a SUSPECT(P) event and passes it up the stack
- VERIFY_SUSPECT catches SUSPECT(P) and sends an ARE_YOU_DEAD message to P
- If P is still alive, it'll respond with a I_AM_NOT_DEAD message
- If the sender doesn't get this message for a certain time, it'll pass the SUSPECT(P) event further up the stack (otherwise it'll drop it), and GMS will exclude P from the view, but if and only if that given node is the coordinator (first in the view)
OK, so let's see what happens if we have a cluster of 100 nodes:
- Say node P is temporarily slow; it doesn't send HEARTBEATs because a big garbage collection is going on, or the CPU is crunching at 90%
- 99 nodes multicast a SUSPECT(P) message
- Every node Q therefore receives 99 SUSPECT(P) messages
- Q (via VERIFY_SUSPECT) sends a ARE_YOU_DEAD message to P
- P (if it can) responds with an I_AM_NOT_DEAD back to Q
- So the total number of messages generated by a single node is 99 * 2
- This is done on every node, so the total number of messages is 99 * 99 * 2 = 19'602 messages !
Can you imagine what happens to P, which is a bit overloaded and cannot send out HEARTBEATs in time when it receives 19'602 messages ?
It it aint dead yet, it will die !
Isn't it ironic: by asking a node if it is still alive, we actually kill it !
This is an example of where the effects of using IP multicasts were not taken into account: if we multicast M, and everybody who receives M sends 2 messages, I neglected to see that the number of messages sent is a function of the cluster size !
So what's the solution ? Simple, elegant and outlined in [1].
- Everybody sends a HEARTBEAT multicast periodically
- Every member maintains a suspect list
- This list is adjusted on view changes
- Reception of a SUSPECT(P) message adds P to the list
- When we suspect P because we haven't received a HEARTBEAT (or traffic if enabled):
- The set of eligible members is computed as: members - suspected members
- If we are the coordinator (first in the list):
- Pass a SUSPECT(P) event up the stack, this runs the VERIFY_SUSPECT protocol and eventually passes the SUSPECT(P) up to GMS, which will exclude P from the view
The cost of running the suspicion protocol is (excluding the periodic heartbeat multicasts):
- 1 ARE_YOU_DEAD unicast to P
- A potential response (I_AM_NOT_DEAD) from P to the coordinator
This is way better than the previous implementation !
[1] https://jira.jboss.org/browse/JGRP-1241
Good job. Big thanks to David!
ReplyDeleteI am wondering if it makes sense to have each node send HEARTBEAT to only a limited (user configurable) number of other members?
ReplyDeletePerhaps the problem with that approach would be that instead of sending a single multicast HEARTBEAT packet, node would need to send a number of unicast ones, effectively killing all the benefits...
To which nodes would you send the heartbeats ? Remember, if a node doesn't get heartbeats from everyone, it will suspect those from which it doesn't get them !
ReplyDeleteA few unicasts are more efficient if the transport is not UDP. But FD_ALL was designed with UDP as transport in mind...
So what happens when the coordinator is the suspect node?
ReplyDeleteSay we have {A,B,C,D}. A is the current coordinator. When it is suspected, B (as next-in-line coordinator) runs a double-check (via VERIFY_SUSPECT) to see if A is really dead.
ReplyDeleteIf that's still the case, B becomes coordinator and installs the new view {B,C,D} in the cluster.
Should A have been falsely suspected, and it is still alive, it will merge back into the cluster via MERGE2.