Wednesday, January 21, 2009

ReplCache: storing your data in the cloud with variable replication

Some time ago, I wrote a prototype of a cache which distributes its elements (key-value pairs) across all cluster nodes. This is done by computing the consistent hash of a key K and picking a cluster node based on the hash mod N where N is the cluster size. So any given element will only ever be stored once in the cluster.

This is great because it maximizes use of the aggregated memory of the 'cloud' (a.k.a. all cluster nodes). For example, if we have 10 nodes, and each node has 1 GB of memory, then the aggregated (cloud) memory is 10 GB. This is similar to a logical volume manager (e.g. LVM in Linux), where we 'see' a virtual volume, the size of which can grow or shrink, and which hides the mapping to physical disks.

So, if we pick a good consistent hash algorithm, then for 1'000 elements, we can assume that in a cluster of 10 nodes, each node stores on average 100 elements. Also, with consistent hashing, if you pick a good hash algorithm, rehashing on view changes is minimal.

Now, the question is what we do when a node crashes. All elements stored by that node are gone, and have to be re-read from somewhere, usually a database.

To provide highly available data and minimize access to the database, a common technique is to replicate data. For example, if we replicate K to all 10 nodes, then we can tolerate 9 nodes going down and will still have K available.

However, this comes at a cost: if everyone replicates all of its elements to all cluster nodes, then we can effectively only use 1/N of the 'cloud memory' (10 GB), which is 1 GB... So we trade access to the large cloud memory for availability.

This is like RAID: if we have 2 disks of 500 GB each, then we can use them as RAID 0 or JBOD (Just a Bunch of Disks) and have 1 TB available for our data. If one of the disks crashes, we lose data that resides on that disk. If we happen to have a file F with 10 blocks, and 5 were stored on the crashed disk, then F is gone.

If we use RAID 1, then the contents of disk-1 are mirrored onto disk-2 and vice versa. This is great, because we can now lose 1 disk and still have all of our data available. However, we now have only 500 MB of disk space available for our data !

Enter ReplCache. This is a prototype I've been working on for the last 2 weeks.

ReplCache allows for variable replication, which means we can tell it on a put(key, value, K) how many copies (replication count) of that element should be stored in the cloud. A replication count K can be:
  • K == 1: the element is stored only once. This is the same as what PartitionedHashMap does
  • K == -1: the element is stored on all nodes in the cluster
  • K == > 1: the element is stored on K nodes only. ReplCache makes sure to always have K instances of an element available, and if K drops because a node leaves or crashes, ReplCache might copy or move the element to bring K back up to the original value
So why is ReplCache better than PartitionedHashMap ?

ReplCache is a superset of PartitionedHashMap, which means it can be used as a PartitionedHashMap: just use K == 1 for all elements to be inserted !

The more important feature, however, is that ReplCache can use more of the available cloud memory and that it allows a user to define availability as a quality of service per data element ! Data that can be re-read from the DB can be stored with K == 1. Data that should be highly available should use K == -1, and data which should be more or less highly available, but can still be read from the DB (but maybe that's costly), should be stored with K > 1.

Compare this to RAID: once we've configured RAID 1, then all data written to disk-1 will always be mirrored to disk-2, even data that could be trashed on a crash, for example data in /tmp.

With ReplCache, the user (who knows his/her data best) takes control and defines QoS for each element !

Below is a screenshot of 2 ReplCache instances (started with java org.jgroups.demos.ReplCacheDemo -props /home/bela/udp.xml) which shows that we've added some data:


It shows that both instance have key "everywhere" because it is replicated to all cluster nodes due to K == -1. The same goes for key "two": because K == 2, it is stored on both instances as we only have 2 cluster nodes.
There are 2 keys with K == 1: "id" and "name". Both are stored on instance 2, but that's coincidence. For K keys and N cluster nodes, every node should store approximately K/N keys.

ReplCache is experimental, and serves as a prototype to play with data partitioning/striping for JBossCache.
ReplCache is in the JGroups CVS (head) and the code can be downloaded here. To run the demo, execute:
java -jar replcachedemo.jar

For the technical details, the design is here.

There is a nice 5 minute demo at http://www.jgroups.org/demos.html.

Feedback is appreciated, use the JGroups mailing lists !

Enjoy !

Monday, January 05, 2009

JGroups 2.7 released

Finally, after almost a year of development, I released 2.7.0.GA this morning. It can be downloaded from http://sourceforge.net/project/showfiles.php?group_id=6081&package_id=94868&release_id=651542.

Although 2.7 has 211 JIRA issues (bugfixes, tasks or features), most of the bugs have been back ported to the 2.6 branch. Why ? Because 2.6.7 is the version that ships with JBoss 5, and we made sure JGroups works optimally with it.

So what's new ?

There are almost no new features ! (Can you tell I'm not a marketing person ? :-))

Most work (besides bug fixes) went into refactoring, e.g. we converted our test suite from JUnit to TestNG, allowing for parallel test execution and thus reduced our testing time from 2.5 hours to 15 minutes !

Another change was that all properties are now set using JSR 175 annotations, so we could remove a lot of boilerplate code from protocol implementations. In my opinion, the more code I can remove (without impacting functionality), the better !

Using annotations for properties also allows us to automatically generate documentation for the properties of all protocols.

I also marked unsupported or experimental classes/methods with @Unsupported or @Experimental annotations.

We were able to increase performance a bit, compared to 2.6, but 2.6 is already quite fast, so unless you need those additional 5-10%, go for 2.6.7.

In a nutshell, 2.7 serves as the groundwork for 2.8, which will have many new features.