« Anyone using Macs for Java development? | Main | Infiniband on Linux, not having much luck »

May 10, 2007

Are software only based quorum mechanisms possible?

This topic keeps coming up again and again. Is it possible for a cluster to hosts a software singleton (that needs exclusive access to state in a repository) to implement quorums absolutely? Yes, if you have a state device that can act as a tiebreaker. This is possible when the state repository itself can act as a tie breaker if a network partition splits the cluster and some singleton service in the cluster needs exclusive access to that state. The repository can grant access to the singleton in one partition or the other and this prevents split brain syndrome where both partitions host singletons that independantly update and therefore corrupt the state respository. WebSphere 6.x uses this approach for transaction logs and its messaging. We use an NFS v4 leased lock on the transaction log file as a tie breaker for the transaction manager and a database for the tie breaker on the messaging engine or an NFS v4 lock when the messages are stored in a file. The external tie breaker that manages the state also means we don't need quorums to do it. Each partition can elect a member to host the service and that service can either access the state or not depending on whether the tie breaker will allow it or not.

The tie breaker needs to be absolute and perfect for this approach to work.

What happens when you don't have a tie breaker? You can't do it. When the network partition forms, both partitions have no connectivity with each other, this is why it's partitioned. Each partition forms independantly on different time lines, for example, a cluster of 5 servers might partition in to two clusters of 3 and 2. But, 2 might know before 3 that the membership changed or vice versa or maybe it's at the same time. They are disconnected or partitioned so there is no way to order the membership changes. Therefore, each partition will attempt to start the singleton service and the service will try to gain access to the state. An external tie breaker can arbitrate the state access between the two partitions absolutely and resolve the partition in favor of one or the other. This is what WebSphere does when it uses the NFS v4 lock or the database lock/table entries. If you don't have an absolute tie breaker, a tyrant basically, then there is no way to guarantee only one partition is hosting the singleton service that was running on the original cluster before it partitioned.

WebSphere has a way to use blade centers as a tie breaker also. We use the blade center service processor as a tie breaker. When the cluster partitions, the majority partition will tell the service processor to power down all the servers in the minority cluster before recovering any singletons. This guarantees they are down before we start recovery in the majority partition and this then lets us guarantee the singleton only runs in one place at a time. If the service processor cannot be contacted or the service processor can't power down the blade then the majority cluster disables the group and we require manual intervention to sort it out.

Can an ObjectGrid be a tiebreaker?

Yes... so long as the ObjectGrid cluster itself stored the state and also used a tie breaker for persisting the state :), so no really, we'd still need a tiebreaker. We could store transaction logs in an ObjectGrid and so long as we can guarantee the ObjectGrid will only allow one party to access the log even when partitions happen then it would be fine. We could do this by using the service processor option. This works because this would only allow a single partition to exist at any one time. An NFS lock wouldn't work because the acquisition and loss of the lock isn't intimately tied to the modification of data within the memory ObjectGrid. Now, we could store the data only while we a partition has the lock. We would have to recheck the lock on every request but even here unless we stored the data in the file protected by that lock then it would be hard to close the window completely, maybe we lost the lock just after looking at it and then we committed the data. The ObjectGrid would really need to store the data in a file synchronously. The lock on the file would then arbitrate access and ensure that only one ObjectGrid cluster partition could modify the file/state at the same time. But, this again uses an absolute tiebreaker (the NFS lock) so nothing changes, you still need a tie breaker that can arbitrate state access perfectly.


As far as I know, there is no way that a singleton software service can be guaranteed to run in a cluster in exactly one place without a tie breaker to arbitrate network partitions (i.e. usually some hardware device) that is infallible. There are a variety of forms of tie breaker as I discussed but the tie breaker must be able to arbitrate access to the state required by the service. This is why using an NFS lock to arbitrate a service that uses a database doesn't work, the NFS lock doesn't stop the service from accessing state in the database. It works for our transaction manager because the lock is required to be able to edit the transaction log file. This is really important. The tie breaker needs to be able to isolate the parties that are refused access from interacting with the state where ever it is stored. Programming logic can allow a database to do this using locking or other approaches. NFS V4 locks clearly do this for files and blade service processors do this for any state by basically ensuring the other parties are physically unable to interact with the state by switching them off, kind of extreme but hey, it works.

You may be able to come up with heuristics for this but there will always be a window when both partitions can simultaeneously access the shared state without this kind of tie breaker.

Discovering everyday mechanisms that could be used as tie breakers was essential for the HA work in WebSphere 6.0 and the arrival of NFS v4 was critical to the success of that strategy as it allowed us to implement guaranteed HA for the transaction logs etc.

May 10, 2007 | Permalink


Post a comment