Wednesday, April 01, 2009

Those damn edge cases !

While JGroups is over 10 years old and very mature, sometimes I still run into cases that aren't handled. While the average user won't run into edge cases because we test the normal cases very well, if you do run into one, in the best case, you have 'undefined' behavior (whatever that means !), in the worst case, you're hosed.

Here's one.

The other week I was at a (european) army, for a week of JGroups consulting. They have a system which runs JGroups nodes over flappy links, radio and satellite networks. Sometimes, a link between 2 nodes A and B can even turn asymmetric, meaning A can send to B, but B not to A !

It turns out that they have a lot of partitions (e.g. when a satellite link goes down), followed by subsequent remerging when the link is restored. Sometimes, members would not be able to communicate with each other after the merge.

This was caused by an edge case in UNICAST which doesn't handle overlapping partitions.

A non-overlapping partition is a partition where a cluster of {A,B,C,D} falls apart into 2 (or more) subclusters of {A,B} and {C,D}, or {A}, {B}, {C}, {D}. The latter case can easily be reproduced when you kill a switch connecting the 4 nodes.

An overlapping partition is when the cluster falls apart into subclusters that overlap, e.g. {A,B,C} and {C,D}. This can happen with asymmetrical links (which never happens with a regular switch !), or FD and many nodes being killed at the same time and a merge occuring before all dead nodes have been removed from the cluster.

If this sounds obsure, it actually is !

But anyway, here's what happens at the UNICAST level.

Warning: rough road ahead...

UNICAST keeps state for each connection. E.g. if A sends a unicast message to B, A maintains the last sequence number (seqno) sent to B (e.g. #25) and B maintains the highest seqno received from A (#25). The same holds for message from B to A, let's say B's last message to A was #7.

Now we have a network partition, which creates a new view {A,B} at A and {B} at B. So, in other words, B unilaterally excluded A from its view, but A didn't exclude B. The reason is that A can communicate with B, but B cannot communicate with A.

Now, you might ask, wait a minute ! If A can communicate with B, why can't B communicate with A ?

This doesn't happen with switches, but here we're talking about separate up and down links over radios, and if a radio up-link goes down, that just means we cannot send, but still receive (through the down-link) !

Let's now look at what happens:

When B receives the new view {B}, it removes the entry for A from its connection table. It therefore loses the memory that its last message to A was #7.

On the other side, A doesn't remove its connection entry for B, which is still at #25.

When the partition heals and a merge ensues, A sends a message to B. The message's seqno is #25, the next message to B will be #26 and so on.

On the receiver side, B creates a new connection table entry for A with seqno #1. When A#25 and A#26 are received, they're stored in the table, but not passed up to the application because we expect messages #1-#24 from A first.

This is terrible because A will never send messages #1-#24 ! Because B will simply store all messages from A, it will run out of memory at some point, unless there's another view excluding A !

Doom also lurks in the reverse direction: when B sends #1 to A, A expects #7 and therefore discards messages #1-#6 from B !

This is bad and caused me to enhance the design for UNICAST. The new design includes connection IDs, so we'll reset an old connection when a new connection ID is received, and receivers asking senders for the initial seqnos if they have no entry for a given sender.

This change will not affect current users, running systems which are connected via switches/VLANs etc. But it will remove a showstopper for running JGroups in rough environments like the one described above.

The design for the new enhanced UNICAST protocol can be found at [1].



  1. By the way, this is NOT an April's Fools joke ! :-)

  2. In the case where A can communicate with B, but B can't communicate with A, wouldn't FD eventually fix the view on A to just be A since B couldn't respond, or is it that the network links in question go up/down in less time than it takes for FD/VERIFY_SUSPECT to shun a member?

  3. We're talking about separate links here: B would suspect A, and become {B}, whereas A (on a different link) would still be able to communicate with B, and therefore not suspect B ({A,B}).
    In OverlappingMergeTest, I've recreated this scenario with A,B,C. I had to pull VERIFY_SUSPECT out of the stack and disable shunning, and inject SUSPECT and MERGE events, to reproduce it.
    The test ends up with {A,B,C} on A and {B,C} on B and C, and then a merge gets them back together again. Actually, this doesn't work until I've implemented the enhanced design I described in the blog...

  4. Asymmetrical communication happens also when network security is not the same. A simple case that took the sys admin a few days to work out ...
    A ping B : OK
    B ping A : not OK
    As you guessed, A was configured in a very defensive mode and therefore would not respond to unexpected ping requests, while it would process the response to its own ping packet.
    Similar things may happen when traffic is route through another router (fail-over) which would also have asymmetrical rules ... and there goes the fun to track them down :)

  5. Yes, excellent example of another case where we can end up with asymmetrical links !