December 14, 2005
JGroups slowly improving
Saw Belas JGroups latest release on TSS. Slowly improving it. Familiar kinds of bug fixes when I remember the HAManager development but his solutions for the fixes he discussed are kind of strange but probably work. Group services is complex stuff. We probably have at least 20 man years in our group services support (HAManager) at this point, much more if you count all the testing hours it's receiving in the company.
The performance number Bela posted were pretty low though, wondering if he was using 100mbit ethernet or something. He should be able to easily saturate a gb network connection even with small messages. We can send messages at a much higher rate than that through HAM/DCS/RMM in WAS. Our problem typically is we can't generate messages fast enough to push RMM. It's an unusual web app that can generate 35-40MB of replication traffic per second per box. Even serializing that kind of message rate takes a lot more CPU than sending it typically.
High reliable message rates are interesting from a memory point of view also. Suppose, we're sending messages at a rate of 50MB/sec. That means if the receiver GCs for 500ms, the publisher needs 25MB of extra memory to buffer the new messages during that pause. We detuned HAManager in WAS 6.0 to a 10MB max buffer before you see the congestion messages on the console. We did a 10MB default because we had to reduce JVM heap sizes. Even so, an app generates 10MB/sec of replication traffic sustained is a rare one. But, customers can and should tune this number higher if they see congestion type messages. In testing, we never really used a buffer larger than 30MB even when sending at very fast message rates (50MB/sec). Depends on a lot of factors though, network congestion, GC pauses etc.
December 14, 2005 | Permalink
My guess is the test was CPU bound _not_ network bound since he was testing with both processes on the same box.
Posted by: Rob Misek | Dec 14, 2005 1:00:28 PM
Re-read my blog dude, and search for 100MBps switch. So 105MB/sec is excellent, considering the max of 12.5MB, this is not a gige switch.
Other numbers for TCP-NIO are 13MB/sec...
I'll post a more detailed analysis in Feb
Posted by: Bela Ban | Dec 15, 2005 7:53:48 AM
Again, Bela what does the switch have to do with this test? Doesn't the OS shortcut multicast directly to the other process co-located on your machine?
Posted by: Rob Misek | Dec 15, 2005 2:33:57 PM