April 27, 2006
WebSphere group membership overview
WebSphere ND and XD include the new HAManager components which does group services/cluster membership etc for various components in the runtime.
What does it do?
It determines the current set of live JVM from all possible and tells every live JVM the current set of members using total ordering. This means that everybody sees all membership lists in exactly the same order. We use this for election, making things fault tolerant etc.
How does it detect failure?
We keep sockets open between peer JVMs. We heart beat over some of these sockets periodically. If a socket is closed or N heart beats are missed then the JVM that notices will tell everybody in the current view that the other JVM is suspect. We then try to push a new membership to the surviving members with the new membership.
If a JVM exits/panics/crashes then typically the socket is closed immediately which gives a very fast failure detector for whats likely the most common case. If the box hosting the JVM or it's OS dies then the heart beating typically is what is the failure detection. The default detection time for heart beats is 200 seconds (approx). It can be tuned to much smaller amounts of time, as low as single digit seconds.
If a box dies then typically multiple JVMs disappear together. This is handled because one of the dead JVMs is detected first, the messageship change protocol will then detect the other failed JVMs when it attempts to install the new view. So, typically, the view change isn't delayed by much when multiple failures occur although it clearly takes longer than when a single JVM fails because the membership protocol isn't interrupted when further failures are detected.
If a cluster gets split by a network failure then each cluster carries on independantly. When the network fault is fixed then we need them to merge. The merge is triggered because each partition is periodically trying to contact non members of its group. If a group contacts a member of another group then a merge is initiated once all members of both partitions verify that they can communicate with each other.
Adding new members
Remember, a cluster is checking 'dead' JVMs periodically regardless. When a JVM starts then it tries to contact the members also. One of these protocols will detect a JVM and then trigger whats basically a merge operation to add the new member. However, when a new member is detected, we don't install a view immediately. First, we wait for all members of the current membership to verify that they also can see the new JVM, once this happens then the new JVM is 'connected'. Then, we add a further delay in the 30 second range before triggering a membership change. This helps with the situation when a box starts and multiple JVMs start concurrently. The first JVM is spotted first but the others will likely also be detected in the 30 second window. This means we get one new membership event rather than N but it does add latency slowing the inclusion of new members. This helps avoids HAManager policy thrashing and just saves MIPs across the cluster. It's a trade off.
So, the membership changes
Once the membership changes then the HAManager coordinator figures out which JVMs can host which singletons and then figures out which policy controls which singleton. Next, it attempts to make sure the policy for a singleton is being met. This usually means it's running on the 'best' JVM in the current membership.
The HAManager can also handle critical singletons where we must handle network partitions in a guaranteed fashion but this requires integration with the hardware platform (e.g. xSeries Blade Center) through scripts. WebSphere doesn't require this normally as it's coded to handle it using a non-quorum approach.
Nobody is to come to IBM support and claim that this description is gospel and explains exactly how our group services work. It would likely take a book to explain the various nuances of the HAManager algorithms, we have a lot of man years invested in it and 5 paragraphs doesn't describe how it works in detail. This blog entry merely gives a very high level taste of how our membership algorithms work in the 6.x releases. The description in this blog is not sufficient to diagnose the details but can help with figuring out what is going on.
April 27, 2006 | Permalink