Friday, October 01, 2010

Confessions of a serial protocol designer

I have a confession to make.

I'm utterly disgusted by my implementation of FD_ALL, and thanks to David Forget for pointing this out !

What's bad about FD_ALL ? It will not scale at all ! After having written several dozen protocols, I thought an amateurish mistake like the one I'm about to show would certainly not happen to me anymore. Boy, was I wrong !

FD_ALL is about detecting crashed nodes in a cluster, and the protocol then lets GMS know so that the crashed node(s) can be excluded from the view.

Let's take a look at the design.
  • Every node periodically multicasts a HEARTBEAT
  • This message is received by everyone in the cluster and a hashmap of nodes and timestamps is updated; for a node P, P's timestamp is set to the current time
  • Another task run at every node periodcially iterates through the timestamps and checks if any timestamps haven't been updated for a given time. If that's the case, the members with outdated timestamps are suspected
  • A suspicion of P results in a SUSPECT(P) multicast
  • On reception of SUSPECT(P), every node generates a SUSPECT(P) event and passes it up the stack
  • VERIFY_SUSPECT catches SUSPECT(P) and sends an ARE_YOU_DEAD message to P
  • If P is still alive, it'll respond with a I_AM_NOT_DEAD message
  • If the sender doesn't get this message for a certain time, it'll pass the SUSPECT(P) event further up the stack (otherwise it'll drop it), and GMS will exclude P from the view, but if and only if that given node is the coordinator (first in the view)
Can anyone see the flaw in this design ? Hint: it has to do with the number of messages generated...

OK, so let's see what happens if we have a cluster of 100 nodes:
  • Say node P is temporarily slow; it doesn't send HEARTBEATs because a big garbage collection is going on, or the CPU is crunching at 90%
  • 99 nodes multicast a SUSPECT(P) message
  • Every node Q therefore receives 99 SUSPECT(P) messages
    • Q (via VERIFY_SUSPECT) sends a ARE_YOU_DEAD message to P
    • P (if it can) responds with an I_AM_NOT_DEAD back to Q
    • So the total number of messages generated by a single node is 99 * 2
  • This is done on every node, so the total number of messages is 99 * 99 * 2 = 19'602 messages !

Can you imagine what happens to P, which is a bit overloaded and cannot send out HEARTBEATs in time when it receives 19'602 messages ?

It it aint dead yet, it will die !

Isn't it ironic: by asking a node if it is still alive, we actually kill it !

This is an example of where the effects of using IP multicasts were not taken into account: if we multicast M, and everybody who receives M sends 2 messages, I neglected to see that the number of messages sent is a function of the cluster size !

So what's the solution ? Simple, elegant and outlined in [1].
  • Everybody sends a HEARTBEAT multicast periodically
  • Every member maintains a suspect list 
  • This list is adjusted on view changes 
  • Reception of a SUSPECT(P) message adds P to the list 
  • When we suspect P because we haven't received a HEARTBEAT (or traffic if enabled): 
    • The set of eligible members is computed as: members - suspected members 
    • If we are the coordinator (first in the list): 
      • Pass a SUSPECT(P) event up the stack, this runs the VERIFY_SUSPECT protocol and eventually passes the SUSPECT(P) up to GMS, which will exclude P from the view

The cost of running the suspicion protocol is (excluding the periodic heartbeat multicasts):
  • 1 ARE_YOU_DEAD unicast to P
  • A potential response (I_AM_NOT_DEAD) from P to the coordinator
TOTAL COST in a cluster of 100: 2 messages (this is always constant), compared to 19'602 messages before !

This is way better than the previous implementation !



  1. Good job. Big thanks to David!

  2. Anonymous8:04 PM

    I am wondering if it makes sense to have each node send HEARTBEAT to only a limited (user configurable) number of other members?
    Perhaps the problem with that approach would be that instead of sending a single multicast HEARTBEAT packet, node would need to send a number of unicast ones, effectively killing all the benefits...

  3. To which nodes would you send the heartbeats ? Remember, if a node doesn't get heartbeats from everyone, it will suspect those from which it doesn't get them !
    A few unicasts are more efficient if the transport is not UDP. But FD_ALL was designed with UDP as transport in mind...

  4. Anonymous12:48 PM

    So what happens when the coordinator is the suspect node?

  5. Say we have {A,B,C,D}. A is the current coordinator. When it is suspected, B (as next-in-line coordinator) runs a double-check (via VERIFY_SUSPECT) to see if A is really dead.
    If that's still the case, B becomes coordinator and installs the new view {B,C,D} in the cluster.
    Should A have been falsely suspected, and it is still alive, it will merge back into the cluster via MERGE2.