Friday, April 29, 2011

Largest JGroups cluster ever: 536 nodes !

I just returned from a trip to a customer who's working on creating a large scale JGroups cluster. The largest cluster I've ever created is 32 nodes, due to the fact that I don't have access to a larger lab...

I've heard of a customer who's running a 420 node cluster, but I haven't seen it with my own eyes.

However, this record was surpassed on Thursday April 28 2011: we managed to run a 536 node cluster !

The setup was 130 celeron based blades with 1GB of memory, each running 4 JVMs with 96MB of heap, plus 4 embedded devices with 4 JVMs running on each. Each blade had 2 1GB NICs setup with IP Bonding. Note that the 4 processes are competing for CPU time and network IO, so with more blades or more physical memory available, I'm convinced we could go to 1000+ nodes !

The configuration used was udp-largecluster.xml (with some modifications), recently created and shipped with JGroups 2.12.

We started the processes in batches of 130, then waited for 20 seconds, then launched the second batch and so on. The reason we staggered the startup was to reduce the number of merges, which would have increased the startup time.

Running this a couple of times (plus 50+ times over night), the cluster always formed fine, and most of the time we didn't have any merges at all.

It took around 150-200 seconds (including the 5 sleeps of 20 seconds each) to start the cluster; in the picture at the bottom we see a run that took 176 seconds.

Changes to JGroups

This large scale setup revealed that certain protocols need slight modifications to optimally support large clusters, a few of these changes are:
  • Discovery: the current view is sent back with every discovery response. This is not normally an issue, but if you have a 500+ view, then the size of a discovery response becomes huge. We'll fix this by returning only the coordinator's address and not the view. For discovery requests triggered by MERGE2, we'll return the ViewId instead of the entire view.
  • We're thinking about canonicalizing UUIDs with IDs, so nodes will be assigned unique (short) IDs instead of UUIDs. This means reducing the size for having 17 bytes (UUID) in memory in favor of 2 bytes (short).
  • STABLE messages: here, we return an array of members plus a digest (containing 3 longs) for *each* member. This also generates large messages (11K for 260 nodes).
  • The fix in general for these problems is to reduce the data sent, e.g. by compressing the view, or not sending it at all, if possible. For digests, we can also reduce the data sent by sending only diffs, by sending only 1 long and using shorts for diffs, by using bitsets representing offsets to a previously sent value, and so on. 
Ideas are abundant, we now need to see which one is the most efficient.

For now, 536 nodes is an excellent number and - remember - we got to this number *without* the changes discussed above ! I'm convinced we can easily go higher, e.g. to 1000 nodes, without any changes. However, to reach 2000 nodes, the above changes will probably be required.

Anyway, I'm very happy to see this new record !

If anyone has created an even larger cluster, I'd be very interested in hearing about it !
Cheers, and happy clustering,



17 comments:

  1. Awesome! Next up, 1k nodes! :)

    ReplyDelete
  2. Yep, that's the goal. With Infinispan running on top !

    ReplyDelete
  3. Mitch Christensen7:02 PM

    This is great. Congratulations.

    Can you describe what each node was doing after the group was formed? Was there messaging going on, etc.? How about state transfer? Was there any attempt to randomly kill/restart nodes or otherwise partition the cluster after the group was formed? Did this include FD, FD_SOCK, FD_ALL?

    ReplyDelete
  4. Hi Mitch,

    thx !

    We tested cluster formation, so this is the Draw demo that we started.
    There is no messaging going on after startup; instead all instances are killed as soon as the cluster has formed successfully, and then the next round is started.
    The config is more or less udp-largecluster.xml, which includes FD_SOCK and FD_ALL (transport is UDP). I was a bit concerned about FD_SOCK, and the traffic generated to discover the neighbor's server socket port, but this turned out not to be an issue. We nevertheless commented FD_SOCK.
    The next step for this particular customer will be to send 5K messages, ca 30% of all nodes will do that. This is to mimic the application that'll run on the devices.
    Randomly killing nodes was not yet on the menu...

    ReplyDelete
  5. Congratulations!
    Can you post the exact configuration? It will help us a lot

    ReplyDelete
  6. The config is udp-largecluster.xml (shipped with JGroups) with minor modifications (I think FD_SOCK) was removed)

    ReplyDelete
  7. Let's continue the discussion on the JGroups users mailing list, if more questions/problems come up. Large clusters is something I'm very interested in

    ReplyDelete
  8. Hi,
    Can you tell me what the "collector" GUI at the end of the post is ? Is it part of JGroups ?

    ReplyDelete
  9. No, the GUI is not part of JGroups. However, it might be open sourced by the company which wrote it, stay tuned...
    Bela

    ReplyDelete
  10. Anonymous4:25 PM

    Awesome!
    Congratulations.

    ReplyDelete
  11. I am late but congratulations!
    One year has passed... did the company happen to open-source the "collector" tool?
    By the way, is there any documentation for the scripts in the "bin" directory? If not, I can fork on Github and insert in each file a small header describing what the script does (provided I understand).
    Cheers, keep up the great work!

    ReplyDelete
    Replies
    1. Thanks Raoul !
      No, the tool hasn't been open sourced yet; one issue was that they were using Eclipe RCP which (IMO) is a PITA...


      No, the scripts in bin are only partly documented in the manual, for instance probe.sh.
      Yes, if you'd like to add a small description to each script, be my guest: I'd be happy to apply a pull request.
      Cheers,

      Delete
  12. Has anyone tried beyond 256 node cluster with more than 50G memory?

    ReplyDelete
    Replies
    1. You mean 50GB of memory per node ?

      The above cluster had 536 nodes, and there were 4 nodes per physical box. IIRC, each node had ca. 100MB of heap.

      Delete
    2. We have 141 nodes with 18G heap each.

      Delete
  13. Quick update: as of Sept 2013, the record for the biggest cluster is 1115 nodes !
    See the mailing list archive [1] for details.

    [1] https://sourceforge.net/p/javagroups/mailman/message/31439524

    ReplyDelete