This is great because it maximizes use of the aggregated memory of the 'cloud' (a.k.a. all cluster nodes). For example, if we have 10 nodes, and each node has 1 GB of memory, then the aggregated (cloud) memory is 10 GB. This is similar to a logical volume manager (e.g. LVM in Linux), where we 'see' a virtual volume, the size of which can grow or shrink, and which hides the mapping to physical disks.
So, if we pick a good consistent hash algorithm, then for 1'000 elements, we can assume that in a cluster of 10 nodes, each node stores on average 100 elements. Also, with consistent hashing, if you pick a good hash algorithm, rehashing on view changes is minimal.
Now, the question is what we do when a node crashes. All elements stored by that node are gone, and have to be re-read from somewhere, usually a database.
To provide highly available data and minimize access to the database, a common technique is to replicate data. For example, if we replicate K to all 10 nodes, then we can tolerate 9 nodes going down and will still have K available.
However, this comes at a cost: if everyone replicates all of its elements to all cluster nodes, then we can effectively only use 1/N of the 'cloud memory' (10 GB), which is 1 GB... So we trade access to the large cloud memory for availability.
This is like RAID: if we have 2 disks of 500 GB each, then we can use them as RAID 0 or JBOD (Just a Bunch of Disks) and have 1 TB available for our data. If one of the disks crashes, we lose data that resides on that disk. If we happen to have a file F with 10 blocks, and 5 were stored on the crashed disk, then F is gone.
If we use RAID 1, then the contents of disk-1 are mirrored onto disk-2 and vice versa. This is great, because we can now lose 1 disk and still have all of our data available. However, we now have only 500 MB of disk space available for our data !
Enter ReplCache. This is a prototype I've been working on for the last 2 weeks.
ReplCache allows for variable replication, which means we can tell it on a put(key, value, K) how many copies (replication count) of that element should be stored in the cloud. A replication count K can be:
- K == 1: the element is stored only once. This is the same as what PartitionedHashMap does
- K == -1: the element is stored on all nodes in the cluster
- K == > 1: the element is stored on K nodes only. ReplCache makes sure to always have K instances of an element available, and if K drops because a node leaves or crashes, ReplCache might copy or move the element to bring K back up to the original value
ReplCache is a superset of PartitionedHashMap, which means it can be used as a PartitionedHashMap: just use K == 1 for all elements to be inserted !
The more important feature, however, is that ReplCache can use more of the available cloud memory and that it allows a user to define availability as a quality of service per data element ! Data that can be re-read from the DB can be stored with K == 1. Data that should be highly available should use K == -1, and data which should be more or less highly available, but can still be read from the DB (but maybe that's costly), should be stored with K > 1.
Compare this to RAID: once we've configured RAID 1, then all data written to disk-1 will always be mirrored to disk-2, even data that could be trashed on a crash, for example data in /tmp.
With ReplCache, the user (who knows his/her data best) takes control and defines QoS for each element !
Below is a screenshot of 2 ReplCache instances (started with java org.jgroups.demos.ReplCacheDemo -props /home/bela/udp.xml) which shows that we've added some data:
It shows that both instance have key "everywhere" because it is replicated to all cluster nodes due to K == -1. The same goes for key "two": because K == 2, it is stored on both instances as we only have 2 cluster nodes.
There are 2 keys with K == 1: "id" and "name". Both are stored on instance 2, but that's coincidence. For K keys and N cluster nodes, every node should store approximately K/N keys.
ReplCache is experimental, and serves as a prototype to play with data partitioning/striping for JBossCache.
ReplCache is in the JGroups CVS (head) and the code can be downloaded here. To run the demo, execute:
java -jar replcachedemo.jar
For the technical details, the design is here.
There is a nice 5 minute demo at http://www.jgroups.org/demos.html.
Feedback is appreciated, use the JGroups mailing lists !
Enjoy !