Wednesday, June 17, 2009

Shunning has been shunned

I finally completed the MERGE4 functionality, which now handles asymmetrical merges and greatly improves the usefulness of JGroups in mobile networks. I've blogged about this earlier this year.

The new merge functionality also allowed me to trash shunning, which is great, because I've always had problems explaining the difference between shunning and merging. Merging would usually be needed when we had real network partitions, whereas shunning would be needed when only a single member was expelled from the group (e.g. because it failed to respond to heartbeats, but hasn't really crashed).

However, with FD_ALL, there could be a scenario where everybody shunned everybody else (a shun-fest :-)), and so all the cluster nodes would leave and re-join the cluster, possibly even multiple times. Clearly not a desirable scenario, even though it didn't lead to incorrect results !

The new model is now much simpler: we have members join, leave and merge. The latter happens on a network partition, for example. In the old model, when a member was unresponsive, it was shunned and subsequently rejoined. In the new model, there's simply going to be a merge between the group which found that member unresponsive and the (now newly responsive) member.

Since I also improved merging speed and correctness (wrt concurrent merges), I suggest download 2.8.beta2 (which I'll upload to SourceForge shortly) and give it a try.

One thing that I'll have to talk about (in my next post) is what to do with merging. For example, if we have shared state and it diverged during a network partition, how can the application make sure that the merge doesn't cause inconsistent states.

More on this later, enjoy,


  1. OK, 2.8.0.Beta2 has now been uploaded to SourceForge and can be downloaded here:

  2. Bela, I do not see MERGE4 class in 2.8 beta, but I see MERGE2 and MERGE3.
    Where can I get it? Are there examples of stack configuration (TCP/UDP) with MERGE4?

  3. Actually, the changes are in GMS. I didn't have to add a new MERGE4 protocol, but simply changed MERGE2, MERGE3 and MERGEFAST.
    Your existing config will work, but 'shun' has been deprecated, that's all.

  4. I'm very keen to see how to handle the shared state merging. Shunning and ensuring our state was correct has been our biggest problem using jGroups

  5. I'm afraid there is no simple one-fits-all solution to state merging;it always depends on the state at hand.

    I discuss some ways to handle state merging in section 5.6 of the manual:

    It all depends on whether the different partitions were allowed to progress, or whether (as described in 5.6.3) only the primary partition was allowed to make progress, and the others had to shut down or become read-only. In the latter case, on a merge everyone in a *non-primary* partition simply fetches the state from the primary partition (via Channel.getState()) and overwrites its own state and everything is fine.

    If we don't have primary partitions, and all partitions can have changes to the *same* data, then we could use timestamps. A timestamp is updated on a write, and on a merge, the data with the latest timestamp wins. This requires time synchronization, e.g. via NTP.

    Another mechanism (used. e.g. by Amazon Dynamo, google for the paper describing the mechanism), is to maintain version numbers per data element. A version is updated on a write. On a merge, data can be merged without conflict in some cases. If there are conflicts, however, they have to be handled by the application programmer ! The dev will get multiple data items back, each with a version history and then has to merge them back into one.
    I hope this helps somewhat. Again, I cannot implement data merging in JGroups because it is highly dependent on your application.
    However, one thing we want to do in JGroups/Infinispan/JBossAS is to provide a few data merge strategies to select from, that cover a number of app scenarios.

  6. Bela,

    I'm using JGroups to create ReplicatedHashMaps in several bundles in an OSGi application. They all use the udp.xml configuration file. I assume that these will have issues with using the same mcast_addr value. If so, what is the best approach to setting values for the mcast_addr at run-time so that each ReplicatedHashMap has its own mcast_addr?

  7. Yes, every cluster will see traffic from different clusters running at the sam mcast_addr and mcast_port. They will discard it though but you´ll still see warnings in the logs...
    Can't you preassign mcast addresses in your app ?
    Other than that, your app could simply create random mcast addresses, so the chances of picking the same mcast address are small.
    We do something similar for our unit tests, take a look at ResourceManager, which is part of JGroups to see how we allocate a range of mcast addresses