Wednesday, January 21, 2009

ReplCache: storing your data in the cloud with variable replication

Some time ago, I wrote a prototype of a cache which distributes its elements (key-value pairs) across all cluster nodes. This is done by computing the consistent hash of a key K and picking a cluster node based on the hash mod N where N is the cluster size. So any given element will only ever be stored once in the cluster.

This is great because it maximizes use of the aggregated memory of the 'cloud' (a.k.a. all cluster nodes). For example, if we have 10 nodes, and each node has 1 GB of memory, then the aggregated (cloud) memory is 10 GB. This is similar to a logical volume manager (e.g. LVM in Linux), where we 'see' a virtual volume, the size of which can grow or shrink, and which hides the mapping to physical disks.

So, if we pick a good consistent hash algorithm, then for 1'000 elements, we can assume that in a cluster of 10 nodes, each node stores on average 100 elements. Also, with consistent hashing, if you pick a good hash algorithm, rehashing on view changes is minimal.

Now, the question is what we do when a node crashes. All elements stored by that node are gone, and have to be re-read from somewhere, usually a database.

To provide highly available data and minimize access to the database, a common technique is to replicate data. For example, if we replicate K to all 10 nodes, then we can tolerate 9 nodes going down and will still have K available.

However, this comes at a cost: if everyone replicates all of its elements to all cluster nodes, then we can effectively only use 1/N of the 'cloud memory' (10 GB), which is 1 GB... So we trade access to the large cloud memory for availability.

This is like RAID: if we have 2 disks of 500 GB each, then we can use them as RAID 0 or JBOD (Just a Bunch of Disks) and have 1 TB available for our data. If one of the disks crashes, we lose data that resides on that disk. If we happen to have a file F with 10 blocks, and 5 were stored on the crashed disk, then F is gone.

If we use RAID 1, then the contents of disk-1 are mirrored onto disk-2 and vice versa. This is great, because we can now lose 1 disk and still have all of our data available. However, we now have only 500 MB of disk space available for our data !

Enter ReplCache. This is a prototype I've been working on for the last 2 weeks.

ReplCache allows for variable replication, which means we can tell it on a put(key, value, K) how many copies (replication count) of that element should be stored in the cloud. A replication count K can be:
  • K == 1: the element is stored only once. This is the same as what PartitionedHashMap does
  • K == -1: the element is stored on all nodes in the cluster
  • K == > 1: the element is stored on K nodes only. ReplCache makes sure to always have K instances of an element available, and if K drops because a node leaves or crashes, ReplCache might copy or move the element to bring K back up to the original value
So why is ReplCache better than PartitionedHashMap ?

ReplCache is a superset of PartitionedHashMap, which means it can be used as a PartitionedHashMap: just use K == 1 for all elements to be inserted !

The more important feature, however, is that ReplCache can use more of the available cloud memory and that it allows a user to define availability as a quality of service per data element ! Data that can be re-read from the DB can be stored with K == 1. Data that should be highly available should use K == -1, and data which should be more or less highly available, but can still be read from the DB (but maybe that's costly), should be stored with K > 1.

Compare this to RAID: once we've configured RAID 1, then all data written to disk-1 will always be mirrored to disk-2, even data that could be trashed on a crash, for example data in /tmp.

With ReplCache, the user (who knows his/her data best) takes control and defines QoS for each element !

Below is a screenshot of 2 ReplCache instances (started with java org.jgroups.demos.ReplCacheDemo -props /home/bela/udp.xml) which shows that we've added some data:


It shows that both instance have key "everywhere" because it is replicated to all cluster nodes due to K == -1. The same goes for key "two": because K == 2, it is stored on both instances as we only have 2 cluster nodes.
There are 2 keys with K == 1: "id" and "name". Both are stored on instance 2, but that's coincidence. For K keys and N cluster nodes, every node should store approximately K/N keys.

ReplCache is experimental, and serves as a prototype to play with data partitioning/striping for JBossCache.
ReplCache is in the JGroups CVS (head) and the code can be downloaded here. To run the demo, execute:
java -jar replcachedemo.jar

For the technical details, the design is here.

There is a nice 5 minute demo at http://www.jgroups.org/demos.html.

Feedback is appreciated, use the JGroups mailing lists !

Enjoy !

13 comments:

  1. Nice.

    What kind of consistency models are you thinking about?

    -Dave

    ReplyDelete
  2. Are you referring to Werner's blog about eventual consistency [1] and CAP (consistency - availability - partitionining) ?

    The prototype is close to an eventually consistent model, because it disseminates updates (puts) asynchronously. JGroups will (eventually) deliver the updates to all non-faulty nodes. Meaning, that if we send update U, and node N crashes before receiving U, N will be inconsistent.

    This is only relevant though when N stores all updates in a persistent store and reinitializes itself from that store.

    So, following the CAP theory, ReplCache supports AP, but not C; C is only eventually consistent.



    [1] http://www.allthingsdistributed.com/2008/12/eventually_consistent.html

    ReplyDelete
  3. I forgot to say, that we could easily switch this model to support CA (consistency and availability): the put() method could be changed to disseminate updates synchronously and atomically, e.g. updates are applied at all chosen nodes, or not at all. This is called uniform delivery in group comm theory.
    The downside is that partitions prevent progress...

    ReplyDelete
  4. The fact that each ReplCache opens its own JChannel is not very flexible, since if I want to create a few hundred caches (each needs to be an instance of ReplCache, since I need to be able to make a clear in a specific cache), opening hundred JChannels seems to be a bit under optimal.
    Any advice on how to do this using ReplCache, or if you have any plans to make it a bit more flexible?

    ReplyDelete
  5. Hmm, one possibility would be to inject a channel/RpcDispatcher into an instance of a ReplCache. This way, multiple ReplCaches could use the same channel.
    However, you'd have to multiplex/demultiplex requests between ReplCaches and the channel.
    Handling of view changes also gets a bit more complex.
    Another, simpler, solution would be to maintain regions and add all keys added into ReplCache into a specific region's set.
    When you want to clear all keys in a given region, you iterate over the region's set S and call ReplCache.remove(K) for each K which is an element of S.

    In my article, I mentioned the concept of regions, but in my definition regions were sets of elements which would be co-located, so the consistent hash function would take the region (rather than the individual keys) as input.
    ReplCache.remove() could then be overloaded to take the name of a region...

    ReplyDelete
  6. Thanks for the help.
    While waiting for your comment, I implemented something along the lines you described as first option, the injection of channel/rpcDispatcher into ReplCache. I'm in the phase of testing it, if all multiplex/demultiplex works as expected.
    Would you consider this kind of approach as something to go into jgroups release somewhere in the future?

    ReplyDelete
  7. Yes, sure, if it works.

    I was a bit hesitant in my reply, because we've run into quite a bit of trouble trying to multiplex the JChannel itself. This was the org.jgroups.mux.Multiplexer/MuxChannel, which has since been deprecated...

    Do you envisage symmetric setups ? E.g.

    ReplCaches A,B,C on channel C1 and
    ReplCaches A,B,C on channel C2 ?

    Or do you accepts scenarios like these ?

    ReplCaches A,B,C on channel C1 and
    ReplCaches A,C on channel C2 (no B on C2) ?

    The issue with B missing on C2 is that the 'view' for A is {C1,C2}, for B it is {C1} and for C it is {C1,C2}. This can bring with it unpleasant surprises, such as service views versus real views. Please google for "JGroups Multiplexer" and read my (negative) comments, and see whether your impl avoids the pits we've fallen into !
    Cheers,

    ReplyDelete
  8. Are C1 & C2 on the same node?

    If so, then (at least in my usecase) I assume cache A will only need 1 channel for its communication needs.

    If in separate nodes, then how handling a node failure (which would cause the asymmetric situation) is different from handling it in the situation where each cache creates its own channel?

    I don't have detailed knowledge of jgroups implementation for handling this so I might be missing something quite obvious, sorry if that is the case.

    ReplyDelete
  9. C1 and C2 could be different processes on the same host, or they could be on different hosts.

    Well, if you have A, B and C sitting on C1 and A and C sitting on C2, then you have different 'service views' for {A,C} and {B}.

    The problem with this is that consistent hashing for a key K added to C2's B ReplCache instance might get hashed to C1, but there is no B created on C1 !

    That's why - if we go for a multiplexed model - we'd have to require *symmetric* setups, which might be a little too restrictive.
    I'd also have to look into all the other issues we ran into with the Multiplexer.

    ReplyDelete
  10. Even if symmetric setups are restrictive, I think there are a few usecases which would benefit from the possibility of using the same channel for several caches. Will you consider refactoring it a bit to allow for that?

    ReplyDelete
  11. Yes, I think at least being able to inject the RPCDispatcher/Channel into a ReplCache instance makes sense

    ReplyDelete
  12. Bela, does this ReplCache store elements in memory only (and not on disk)?
    I am looking for something like distributed EhCache (which can use DiskStore) with variable replication like ReplCache. In our case we should store ~5 Gbytes in cache, so I would not replicate all data to all nodes :)

    One more problem with EhCache is that its disk store becomes invalid if JVM crashes (and the cache becomes empty).

    ReplyDelete
  13. No, ReplCache doesn't currently use a disk as backup storage. It is only a prototype to experiment with variable caching using replication counts.
    However, Infinispan (www.infinispan.org) provides cache storing and loading to and from disk, DBs, other caches etc.

    ReplyDelete