Wednesday, July 31, 2013

BYOC: Bring Your Own Coordinator

I've finally resolved one of the older issues in JGroups, created 6 years ago: pluggable policy for picking the coordinator [1].

This adds the ability to let application code create the new View on a join, leave or crash-leave, or the MergeView on a merge.

Why is this important ?

There are 2 scenarios that benefit from this:
  1. The new default merge policy always picks a previous coordinator if available, so that moving around of the coordinatorship is minimized. Because singleton services (services which are instantiated only on one node in the cluster, usually on the coordinator) need to shutdown services and start them every time the coordinatorship moves, minimizing this means less work for singletons.
  2. A node with a coordinator role has more work to do, e.g. join/leave/merge handling, plus it may be running some singleton services. Some systems will therefore prefer bulkier machines (faster CPU, more memory) to run the coordinator. But so far, this hasn't been possible, as JGroups always preferred every node equally. This changes now, and the puggable policy allows for pinning the coordinatorship to a number of machines (a subset of the cluster nodes).
The documentation is at [2]. For questions and feedback, please use the mailing list. 



Monday, July 29, 2013

First JGroups release under Apache License 2.0

I'm happy to announce that I just released 3.4.0.Alpha1, which is the first release under the Apache License 2.0 !

In order to change the license, I had to track down all contributors who still have code in the code base and get their permissions [1], which was time consuming.

Given that JGroups came into existence in 1998, even though a lot of code has since been added/removed/replaced, there are still quite a number of people who still have contributions in the current master.

Finding some of those folks turned out to be hard. I ran into a few cases of abandoned email addresses, and sometimes googling for "firstname lastname" led me to people with the same name but no connection to JGroups whatsoever...

Anyway, the change is done now and I wanted to release a first alpha under AL 2.0 as soon as possible, so other projects can start using JGroups with the new license.

Although 'alpha' suggests bad code quality, the opposite is actually true: this release is very stable, at least as stable as the latest 3.3.x. The major changes are listed below.
Enjoy !

RELAY2: allow for more than one site master

If we have a lot of traffic between sites, having more than 1 site master 
increases performance and reduces stress on the single site master

Kerberos based authentication

New AUTH plugin contributed by Martin Swales. Experimental, needs more work

Probe now works with TCP too

If multicasting is not enabled, can be started as follows: -addr -port 12345
, where is the physical address:port of a node.
Probe will ask that node for the addresses of all other members and 
then send the request to all members.


UNICAST3: ack messages sooner

A message would get acked after delivery, not reception. This was changed, 
so that long running app code would not delay acking the message, which 
could lead to unneeded retransmission by the sender.

Bug fixes

FRAG/FRAG2: incorrect ordering with message batches

Reassembled messages would get reinserted into the batch at the end instead of 
at their original position

RSVP: incorrect ordering with message batches

RSVP-tagged messages in a batch would get delivered immediately,
instead of waiting for their turn

Memory leak in STABLE

Occurred when send_stable_msg_to_coord_only was enabled.

NAKACK2/UNICAST3: problems with flow control

If an incoming message sent out other messages before returning, it could 
block forever as new credits would not be processed.
Caused by a regression (removal of ignore_sync_thread) in FlowControl.

AUTH: nodes without AUTH protocol can join cluster

If a member didn't ship an AuthHeader, the message would get passed up
instead of rejected.

LockService issues

Bug fix for concurrent tryLock() blocks and various optimizations.

Logical name cache is cleared, affecting other channels

If we have multiple channels in the same JVM, the a new view in one channel
triggers removal of the entries of all other caches


The manual is at

The complete list of features and bug fixes can be found at

Bela Ban, Kreuzlingen, Switzerland
Aug 2013