Monday, February 16, 2009

What's cool about logical addresses ?

Finally, logical addresses (https://jira.jboss.org/jira/browse/JGRP-129) will get implemented (in 2.8) !

For those of you who've used JGroups, you'll know that the identity of a node was always its IP address and the port on which the receiver thread was listening, e.g. 192.168.1.5:7800.

While this gives you a relatively compact and readable address (you can deduct from the address on which host it resides), there's also a problem: this type of address is not unique over space and time.

Let's look at an example.

Say we have a cluster of {A,B,C}. C's address is 192.168.1.5:7800. Let's assume A has sent 25 messages to C and C has multicast 104 messages. We're using sequence numbers (seqnos) to order messages, attached to a message via a header.

So the next message that C will multicast is #105 and the next message it expects from A is #26.

This is state that is maintained by the respective protocols in a JGroups stack.

Now let's assume C is killed and restarted. Or C is shunned, therefore leaves the channel and then automatically (if configured) reconnects. Let's also assume that the failure detection protocol has not yet kicked in and therefore A and B will not have received a view {A,B} which excludes C.

Now C rejoins the cluster. Because this is a reincarnation of C, it creates a new protocol stack, and all the state mentioned above is gone. The reincarnated C now sends #1 as next seqno and expects #1 from A as well.

There are 2 things that happen now:
  1. When C multicasts its next message with seqno #1, both A and B will drop it. A drops it because it expects C's next message to be #105, not #1. As a matter of fact A will drop the first 104 messages from C !
  2. A multicasts a message with seqno #26. However, C expects #1 from A and therefore buffers message #26. As a matter of fact, C will buffer all messages from A until it receives #1 which will not happen ! Consequence: C will run out of memory at some point. Even worse: C will prevent stability messages from purging messages seen by all cluster nodes, so in the worst case, all cluster nodes will run out of memory !
OK, while this is somewhat of an edge case and can be remedied by (a) waiting some time before restarting a node and/or (b) not pinning down ports, the fact is still that when this happens, it wreaks havoc.

So how are logical address going to change this ?

A logical address consists of
  • an org.jgroups.util.UUID (copied from java.util.UUID and relieved of some useless fields) and
  • a logical name
The logical name is given to a channel when the channel is created, e.g.
JChannel channel=new JChannel("node-4", "/home/bela/udp.xml");

This means that the channel's address will always get printed as "node-4". Under the cover, however, we use a UUID (for equals() and hashCode()), which is unique over space and time. The UUID is recreated on channel connect, so the above reincarnation issue will not happen.

The logical name is syntactic sugar, because if we have views consisting of UUIDs (16 bytes), that's not a pretty sight, so views like {"node-1", "node-2", "node-3", "node-4"} look much better.

Note that the user will be able to pick whether to see UUIDs or logical names.

Also, if null is passed as logical name, JGroups will create a logical name (e.g. using the host name and a counter).

A UUID will get mapped to one or more physical addresses. The mapping is maintained by the transport and there will be an ARP-like protocol (handled by Discovery) to fetch the initial mappings, and to fetch a mapping if not available.

The detailed design is described in http://javagroups.cvs.sourceforge.net/viewvc/javagroups/JGroups/doc/design/LogicalAddresses.txt?revision=1.12&view=markup.

So the most important aspect of logical addresses is that they decouple the identity of a JGroups node from its physical address.

This opens up interesting possibilities.

We might for example associate multiple physical address with a UUID, and load balance over the physical addresses. We could open multiple sockets, and associate each (receiver) socket's physical address with the UUID. We could even change this at runtime: e.g. if a NIC is down and we get exceptions on the socket, simply create another socket, remove the old association across the cluster (there's a call for this) and associate the new physical address with the UUID.

Another possibilty is to implement NATting, firewalling or STUNning this way !

I'll probably make the picking of a physical address given a UUID pluggable, so developers can even provide their own address translation in the transport !

This change is overdue and I'm happy that work has finally started on this. If you want to follow this, the branch is Branch_JGroups_2_8_LogicalAddresses.

15 comments:

  1. I forgot to mention that UUIDs are smaller in a marshalled format (16 bytes) and take up less space in memory.
    So, performance might also benefit slightly. Best of both worlds !

    ReplyDelete
  2. Looks amazing! I'll need some time to understand all the new possibilities. I understand this could make life easier in clouds, were "reincarnations" usually have a different ip and macs?

    ReplyDelete
  3. Assuming that a cloud is a superset of a cluster, with WAN being the differentiator, then yes !
    I haven't thought of all new possibilities, but bridging (separate) subclusters together might just be simpler with logical addresses.
    I think (but haven't tried out yet !) that it might be easier to have an multicast capable cluster in NYC, and another one in SFO, and bridge them via a TCP based cluster, which relays traffic between NYC and SFO.
    I think the relaying might be simpler if logical addresses are used.
    More on this topic (replication between data centers) later this year on my blog !

    ReplyDelete
  4. About reincarnation - isn't is possible to exclude a member from a view when this member tries to join again (and consider it as a new incoming member)?

    Victor N

    ReplyDelete
  5. In theory, yes. But if we have view {A,B,C} and never received {A,B}, then neither A nor B will exclude C !
    So, since we never suspected C and excluded it, we will not be able to reject its re-entry into the cluster. A and B still think this is the existing C.

    ReplyDelete
  6. It's a great idea to have logical names. I already thought about implementing such a feature by myself since we recently implemented some kind of administration interface for our distributed caches (jgroups used for communication) and seeing a node as its socket is less cool than having logical names like 'node1_instance2' or similar.

    ReplyDelete
  7. Bela,
    ok, let's assume that we can not rejoin the old node this way. But why this old node, which did not became a new member - why does it reply to "keep alive" messages? That is when you try to reincarnate, you receive "join timed out", so you are not member, but you will never be excluded from view (until you stop join attempts). Is this a bug?

    ReplyDelete
  8. Vatel: the problem is that C cleared its state while A and B didn't. So {A,B} think C is the same old member, whereas C thinks (and rightly so) that it left and rejoined the cluster.

    This unilateral decision to keep or clear the state leads to the problems described above. If everyone agreed to NOT clear the state, or everyone agreed to *clear* the state, then the issue would not exist.

    With logical addresses, C will NEVER come back as C, but it will always be a new, different node (let's call it C').

    Since this is C' which has not yet been seen by ANYONE in the cluster, EVERYONE will create a fresh state space for C', ie. start sequence numbers at #1.

    ReplyDelete
  9. Stefan: thanks ! Yes, I agree, this has been log overdue.

    I hope you guys try out logical addresses with the 2.8 alpha and give me feedback, so I can mold this while the code is still warm... :-)

    ReplyDelete
  10. Bela, thanks for your comments,
    I feel I am starting to understand something in JGroups ;)

    Sorry, maybe I misunderstand something, but seems there is something strange in this logic:

    a) node B says "join attempt timed out"
    b) (at the same time!) node B becomes a member and replies to "keep alive" from node A - how did node B become a member if it logged the error "join timed out"?

    Must a node be a member to reply to "keep alive" or not? If it is not member, ok, but why does it respond to keep alive in such case?

    ReplyDelete
  11. Maybe B failed to join the first time around, and succeeded the second time.
    The JOIN logic loops until it joins successfully, the JOIN failed due to a security exception, or no existing members are found, in which case the new member becomes a singleton cluster.

    ReplyDelete
  12. Would the use of logical addresses provide a better solution for "flapping" links? We experience issues when views get separated (due to a WAN link-loss), and during this separation process the link is restored.
    Are there configuration settings for the current stack to improve this behavior in this situation?

    ReplyDelete
  13. Failure detection (FD / FD_ALL / FD_SOCK) causes nodes to get excluded if connectivity is lost. By the same token, when it is restored, merge (MERGE2) takes care of merging subgroups back into the cluster.
    How fast these 2 things happen is largely a function of their configuration, e.g. heartbeat intervals in FD and max number of missed heartbeats before a node is declared dead.
    For merging to work, 'loopback' has to be set to true in the transport (e.g. UDP or TCP) on certain operating systems.
    Note that logical addresses will help to eliminate some of the isues that arise when nodes disappear and then shortly afterwards reappear, with configs that have fixed ports for example.

    ReplyDelete
  14. Hi Bela. Am new to JGroups (have just finished reading the manual) and it looks like a great fit for my application. I had a couple of questions - have you done any integration/implemented any usage patterns with/for Google Protocol Buffers? I'm keen to understand what the best method of transmitting GPB messages might be? send()-ing the Google Message Object or converting that to a byte[] and send()-ing the array instead. Also I am looking to implement a distributed model whereby each node is only able to send a new Request once it has received at least one Response. What are your thoughts about the best way of implementing such a model? I do want to use multicast, so all members can see all responses, but I don't want the Requester to issue new Requests until it is certain that the previous request has been at least Acked.
    Cheers.

    ReplyDelete
  15. Google protocol buffers look interesting, essentially IDL with a new look :-)
    I've thought about generating marshalling and unmarshalling code before with an XML descr of my data and then parsing it and a backend generating code. We could for instance generate Java code used by JGroups itself, and C code used by the JGroups wireshark plugin !
    Regarding transmission of GPB data, the best way is always to generate a byte[] representation and passing it to JGroups.
    For the request/response model you described, I suggest take a look at RpcDispatcher, which provides synchronous and asynchronous requests, and it can block until all responses have been received, or a certain number of responses has been received.

    ReplyDelete