Tuesday, January 13, 2015

RAFT consensus in JGroups

I'm happy to announce the first alpha release of jgroups-raft, which is an implementation of RAFT [2,3] in JGroups. The jgroups-raft project is currently a separate project on GitHub [1], but may be integrated into JGroups at a later stage.

The functionality includes leader election (section 5.2 in [3]), log replication (5.3), snapshotting and log compaction (7). Cluster membership changes (6) has not yet been implemented; the system currently requires a static membership.

The persistent log is implemented using LevelDB (MapDB support is not complete yet). Also, leader election based on the log commit status (and length) (5.4.1) has not been implemented.

The code quality is alpha at best, and the functionality hasn't been tested with unit tests. Use at your own risk.


So what can jgroups-raft currently be used for ?


Mainly to experiment with RAFT consensus in JGroups. The system comes with a demo of a replicated state machine (replicated hashmap) which can be used to update state in a fixed-size cluster with consensus. The majority (RAFT.majority) is 2, so nore more than 3 instances should be started.
Start the 3 instances like this:

bin/demo.sh -name B -follower
bin/demo.sh -name C -follower
bin/demo.sh -name A

The -follower flag is optional, but it skips leader election for a quick startup (and issues with the missing implementation of 5.4.1).

Note that the -name flag is used as both the logical name of a member and the name of the log. So, after starting the 3 instances, the temp directory will contain logs A.log, B.log and C.log (using LevelDB).

If we kill B and start it again as B, then B.log will be used again. If we start a member D, then this is considered a new member and a log D.log will be created.

Here's the output at C after adding a new entry foo=bar and printing the log:
[1] add [2] get [3] remove [4] show all [5] dump log [6] snapshot [x] exit
first-applied=1, last-applied=3, commit-index=3, log size=55b

1
key: foo
value: bar
[1] add [2] get [3] remove [4] show all [5] dump log [6] snapshot [x] exit
first-applied=1, last-applied=4, commit-index=3, log size=70b

-- put(foo, bar) -> null
5

index (term): command
---------------------
1 (1): put(name, Bela)
2 (1): put(id, 322649)
3 (1): put(name, Bela Ban)
4 (7): put(foo, bar)

[1] add [2] get [3] remove [4] show all [5] dump log [6] snapshot [x] exit
first-applied=1, last-applied=4, commit-index=4, log size=70b

4
{foo=bar, name=Bela Ban, id=322649}
[1] add [2] get [3] remove [4] show all [5] dump log [6] snapshot [x] exit
first-applied=1, last-applied=4, commit-index=4, log size=70b


We can see that the state consists of 3 entries and the log has 4 elements (name was changed twice).

When a node is killed and restarted, the state machine is reinitialized from the log:

[mac] /Users/bela/jgroups-raft$ bin/demo.sh -name C -follower
LOG is existent, must not be initialized
777 [DEBUG] RAFT: set last_applied=4, commit_index=4, current_term=7
778 [DEBUG] RAFT: snapshot /tmp/C.snapshot not found, initializing state machine from persistent log
781 [DEBUG] RAFT: applied 3 log entries (2 - 4) to the state machine

-------------------------------------------------------------------
GMS: address=C, cluster=rsm, physical address=192.168.1.3:54886
-------------------------------------------------------------------
-- view change: [B|4] (3) [B, A, C]
[1] add [2] get [3] remove [4] show all [5] dump log [6] snapshot [x] exit
first-applied=1, last-applied=4, commit-index=4, log size=70b

4
{foo=bar, name=Bela Ban, id=322649}


We can see that the state machine was initialized from the persistent log.

If a member is down for a considerable amount of time, and then started again, it may be out of sync, and - if a snapshot was taken at the leader - the first log entry of the leader might be higher than the last commited log entry at the member. In this case, the leader will transfer its snapshot to the restarted member first, and then the usual algorithm is used to bring the restarted member up to date.


What's next ?

We're currently experimenting with an implementation of etcd [5] over jgroups-raft. Also, we're looking into how to use RAFT consensus in Infinispan [6].

I'm currently putting the finishing touches on a JGroups workshop (more on this soon), and will return to work on jgroups-raft after that. The next work items include
  • unit tests and code reviews
  • leader election comparing logs (5.4.1)
  • alternative ELECTION protocol using the JGroups built-in features (reduces code)
  • cluster membership changes
  • consistent reads; reads are currently dirty (section 8 has not yet been implemented)


Please use the mailing list [4] for feedback, questions and discussions.

Cheers,
Bela

[1] https://github.com/belaban/jgroups-raft
[2] http://raftconsensus.github.io/
[3] http://ramcloud.stanford.edu/raft.pdf
[4] https://groups.google.com/forum/#!forum/jgroups-raft
[5] https://github.com/redhat-italy/jgroups-etcd
[6] http://www.infinispan.org



No comments:

Post a Comment