I've heard of a customer who's running a 420 node cluster, but I haven't seen it with my own eyes.
However, this record was surpassed on Thursday April 28 2011: we managed to run a 536 node cluster !
The setup was 130 celeron based blades with 1GB of memory, each running 4 JVMs with 96MB of heap, plus 4 embedded devices with 4 JVMs running on each. Each blade had 2 1GB NICs setup with IP Bonding. Note that the 4 processes are competing for CPU time and network IO, so with more blades or more physical memory available, I'm convinced we could go to 1000+ nodes !
The configuration used was udp-largecluster.xml (with some modifications), recently created and shipped with JGroups 2.12.
We started the processes in batches of 130, then waited for 20 seconds, then launched the second batch and so on. The reason we staggered the startup was to reduce the number of merges, which would have increased the startup time.
Running this a couple of times (plus 50+ times over night), the cluster always formed fine, and most of the time we didn't have any merges at all.
It took around 150-200 seconds (including the 5 sleeps of 20 seconds each) to start the cluster; in the picture at the bottom we see a run that took 176 seconds.
Changes to JGroups
This large scale setup revealed that certain protocols need slight modifications to optimally support large clusters, a few of these changes are:
- Discovery: the current view is sent back with every discovery response. This is not normally an issue, but if you have a 500+ view, then the size of a discovery response becomes huge. We'll fix this by returning only the coordinator's address and not the view. For discovery requests triggered by MERGE2, we'll return the ViewId instead of the entire view.
- We're thinking about canonicalizing UUIDs with IDs, so nodes will be assigned unique (short) IDs instead of UUIDs. This means reducing the size for having 17 bytes (UUID) in memory in favor of 2 bytes (short).
- STABLE messages: here, we return an array of members plus a digest (containing 3 longs) for *each* member. This also generates large messages (11K for 260 nodes).
- The fix in general for these problems is to reduce the data sent, e.g. by compressing the view, or not sending it at all, if possible. For digests, we can also reduce the data sent by sending only diffs, by sending only 1 long and using shorts for diffs, by using bitsets representing offsets to a previously sent value, and so on.
For now, 536 nodes is an excellent number and - remember - we got to this number *without* the changes discussed above ! I'm convinced we can easily go higher, e.g. to 1000 nodes, without any changes. However, to reach 2000 nodes, the above changes will probably be required.
Anyway, I'm very happy to see this new record !
If anyone has created an even larger cluster, I'd be very interested in hearing about it !
Cheers, and happy clustering,