Friday, April 29, 2011

Largest JGroups cluster ever: 536 nodes !

I just returned from a trip to a customer who's working on creating a large scale JGroups cluster. The largest cluster I've ever created is 32 nodes, due to the fact that I don't have access to a larger lab...

I've heard of a customer who's running a 420 node cluster, but I haven't seen it with my own eyes.

However, this record was surpassed on Thursday April 28 2011: we managed to run a 536 node cluster !

The setup was 130 celeron based blades with 1GB of memory, each running 4 JVMs with 96MB of heap, plus 4 embedded devices with 4 JVMs running on each. Each blade had 2 1GB NICs setup with IP Bonding. Note that the 4 processes are competing for CPU time and network IO, so with more blades or more physical memory available, I'm convinced we could go to 1000+ nodes !

The configuration used was udp-largecluster.xml (with some modifications), recently created and shipped with JGroups 2.12.

We started the processes in batches of 130, then waited for 20 seconds, then launched the second batch and so on. The reason we staggered the startup was to reduce the number of merges, which would have increased the startup time.

Running this a couple of times (plus 50+ times over night), the cluster always formed fine, and most of the time we didn't have any merges at all.

It took around 150-200 seconds (including the 5 sleeps of 20 seconds each) to start the cluster; in the picture at the bottom we see a run that took 176 seconds.

Changes to JGroups

This large scale setup revealed that certain protocols need slight modifications to optimally support large clusters, a few of these changes are:
  • Discovery: the current view is sent back with every discovery response. This is not normally an issue, but if you have a 500+ view, then the size of a discovery response becomes huge. We'll fix this by returning only the coordinator's address and not the view. For discovery requests triggered by MERGE2, we'll return the ViewId instead of the entire view.
  • We're thinking about canonicalizing UUIDs with IDs, so nodes will be assigned unique (short) IDs instead of UUIDs. This means reducing the size for having 17 bytes (UUID) in memory in favor of 2 bytes (short).
  • STABLE messages: here, we return an array of members plus a digest (containing 3 longs) for *each* member. This also generates large messages (11K for 260 nodes).
  • The fix in general for these problems is to reduce the data sent, e.g. by compressing the view, or not sending it at all, if possible. For digests, we can also reduce the data sent by sending only diffs, by sending only 1 long and using shorts for diffs, by using bitsets representing offsets to a previously sent value, and so on. 
Ideas are abundant, we now need to see which one is the most efficient.

For now, 536 nodes is an excellent number and - remember - we got to this number *without* the changes discussed above ! I'm convinced we can easily go higher, e.g. to 1000 nodes, without any changes. However, to reach 2000 nodes, the above changes will probably be required.

Anyway, I'm very happy to see this new record !

If anyone has created an even larger cluster, I'd be very interested in hearing about it !
Cheers, and happy clustering,

Friday, April 01, 2011

JBossWorld 2011 around the corner

Wanted to let you know that I've got 2 talks at JBW (Boston, May 3-6).

The first talk [1] is about geographic failover of JBoss clusters. I'll show 2 clusters, one in NYC, the other one in ZRH. Both are completely independent and don't know about each other. However, they're bridged with a JGroups RELAY and therefore appear as if they were one big virtual cluster.

This can be used for geographic failover, but it could also be used for example to extend a private cloud with an external, public cloud without having to use a hardware VPN device.

As always with my talks, this will be demo'ed, so you know this isn't just vapor ware !

The second talk [2] discusses 5 different ways of running a JBoss cluster on EC2. I'll show 2 demos, one of which works only on EC2, the other works on all clouds.

This will be a fun week, followed by a week of biking in the Bay Area ! YEAH !!

Hope to see and meet many of you in Boston !