April 06, 2005
WebSphere 6.0 failure detection tuning
The 6.0.1 infocenter has a section on tuning the heart beating for the high availability manager. See here. It's only documented in the Z document for now, I'm getting that fixed so it's in the distributed/Unix section also.
So, the two properties allow both the period at which heart beats are sent and the number of missed heart beats indicating a failure to be tuned. A sideeffect (about to be documented in the infocenter for 6.0.2) is that when a server JVM starts (DMgr, NodeAgent, Cluster member etc) then the HAManager uses a protocol that waits for two heart beat intervals before installing the initial view. You can see when a view is installed because of the HMGR218I messages in system out. These indicate how many servers are visible to a particular JVM. It should always be the number of JVMs running. The default setting is 10 seconds so this means that once the HAManager is initialized at server start then we install the view after a minimum of 20 seconds.
This isn't an issue normally because there is an easy 20 seconds worth of startup left anyway once the HAManager is started and this 20 seconds is in parallel with server start. Clearly, if a customer set this interval to 10 minutes then the server wouldn't become ready for 20 minutes which is a big problem at that point. So, don't set it to such a high value. The overhead of heart beating even with large cells is almost noise so increasing this interval beyond 10/20 seconds doesn't really buy you anything.
Anyway, just a heads up while this makes it's way in to the info center. Also, always check the infocenter after reading my stuff here as what you're reading here may be out of date but the info center is always current in the end.
April 6, 2005 | Permalink