April 11, 2006
Wierdness with loopback on Linux
We're running some tests for ObjectGrid, simple benchmarks exercising the client/server function. We have a simple scenario, a client running one thread hitting an ObjectGrid server with one replica using synchronous replication. So, we're looping updating some objects and pushing the changes to the first server who then replicates to the replica synchronously.
On my laptop, a run post JIT takes around 3.8 seconds with all 3 JVMs running on my laptop. When we run all 3 JVMs on Linux with a 2.6 kernel then the answer is 42 seconds!!! We then moved the client to another linux box, so thats client on linux box A and servers on linux box B. Answer is then 2.3 seconds. If I run the client on a Windows laptop with the servers still on Linux then it's around 3.8 seconds.
So, the problem is related to running the client on Linux colocated with the servers. Somethings wrong with the networking support we think when both ends of the socket are on the same linux box.
We tried other unixes, our p5 ML16 (a 16 way 1.9Ghz p5 AIX) runs the test at 1.8 seconds, fastest box here right now and no same box performance issue.
So, for now, we're still investigating whats up with running the client on the same box as the server on Linux to try figure out whats up? Windows had a problem like this but they fixed it and on Windows using the boxes IP instead of localhost fixed it also, but no such luck on Linux. We've tried running the same scenario on Solaris, AIX, Linux and Windows and so far only Linux shows the big performance hit. Anyway, the hunt continues.
April 11, 2006 | Permalink
It may be doing reverse dns lookups.
Try traceroute 127.0.0.1 and see if it resolves the name......
Posted by: Don Brady | Apr 12, 2006 12:07:34 PM
Did you try it with both 127.0.0.1 and with the real IP of the device? I'd be curious if there is a difference. Also the DNS issue that Don mentions does sound like a strong contender.
Posted by: Scott Carlson | Apr 12, 2006 12:47:56 PM
Nah, tried the easy stuff :) We tried using a proper IP and with 127.0.0.1 and it's the same. ping 127.0.0.1 just prints up the same IP address.
Current theory is delayed ACKs on Linux but still guessing.
Posted by: Billy | Apr 12, 2006 3:06:20 PM
This is a possibly related java bug. Seems a change in Linux 2.6.15 triggers bad performance related to TCP_NODELAY.
Posted by: Yuri | Apr 13, 2006 4:55:19 PM