February 28, 2007
Very large heaps and caches
I was playing around with a scenario today with ObjectGrid. A customer wants to store 26Gb of data in a single JVM managed by ObjectGrid. First problem was finding a box with enough memory. I used a 16 way p5 with 64GB of memory running AIX. Each CPU has two threads so the box is basically a 32 way. I used the IBM Java 5 JVM and the current internal build of ObjectGrid.
The customer scenario is a read only type scenario. Load 26GB of data in to the ObjectGrid and then query it when processing incoming requests. We forced full GCs to see how long a full GC will take. It takes 8 seconds for a normal GC. A GC with compact is 17 seconds. This is using all (32) CPUs. If I limit the number of CPUs to 4 (-Xgcthreads4) then the answer is 55 seconds without a compact or basically 8x longer which is what we'd expect. IBMs GC has a fully parallel garbage collector and so is almost linear scaling as the number of processors rises. The actual time a GC will take depends on how many object references you have and the complexity of the objects. Fewer large objects is obviously better as it results in fewer references to follow but compacts will still be pretty brutal as memory size is the same or maybe a little bigger.
The Java 5 GC uses two heaps, a nursery and a tenured heap. The nursery is used for all new object allocations and once objects have survived N GC attempts on the nursery then they are moved to the tenured heap. The trick with this scenario is the cached data needs to be in the tenured heap and we want as little leakage from the nursery as possible once the application starts running.
Why? If we have leakage then the tenured heap will start to fill and once it is exhausted then the JVM will do a full GC on 26GB of data. That will result in a large pause as timed above. A box with few CPUs will see a pretty huge pause. If this results in a compact then double it.
So, basically, we want to avoid that. We can avoid leakage through careful tuning of the JVM GC settings.
The nursery heap size lets us tune normal pauses when the generational GC is working. We need to set the nursery size of JVM to have a short acceptable pause during normal processing. The nursery heap will fill during normal processing and every time it fills, we do a generational GC. The trick for a good response time is to keep the nursery small enough so that GCs on it are fast. The default was around 40MB today and that resulted in small pauses but frequent ones. We ended up using a 400MB nursery which resulted in 400ms pauses on 32 way mode and 800ms pauses on the 4 way setup. If 400ms is too long then just use a smaller nursery and you'll see more frequent but shorter GC cycles. If the nursery GCs happen more frequently than your response time then you may get leakage as objects in a transaction will survive multiple nursery GCs and may be moved to tenure by accident. So, set the nursery big enough that the nursery GCs happen at a rate thats comfortably longer than your typical response time for requests.
We need to the JVM to almost never move objects from the nursery to the tenure heap unless the objects are basically in the 26GB of data in the ObjectGrid. IBMs GC automatically figures out what the best tenure age is before promoting an object to tenured space.
If you don't have leakage when in steady state then it works well. You won't see a full GC during normal operation. The generation GC algorithms will take care of the garbage generated during normal request processing. This kind of scenario is basically limited to read only or scenarios with rare writes. The pauses incurred by full GCs when the tenured heap fills are the cause of this limitation unless you can live with the pauses. A heavily multi-cored box or a big SMP like our p5 will greatly speed GC. A p5 box also has the advantage of very, very fast single core throughput compared with alternative multi-core type architectures.
If your scenario is a huge heap with rapidly changing data then the GC will be a problem with large pauses even with generational. The best architecture for now (Feb 07) is smaller JVMs coupled with partitioning, so we'd use JVMs with 400MB heaps which would probably see < 500ms pauses and just use 50 of them. If you can't partition the data then there is a problem for now unless you can live with the pauses. There are tricks we can do to try to minimize GC with very large heaps but heap fragmentation etc will likely eventually get you no matter what. GC doesn't seem to work well with in memory cache/database type scenarios. The bottom line is that the data manager knows the data has a strong reference and that data shouldn't be GCed at all anyway but there is no way to do that right now except for using direct mapped byte buffers and that means writing your own memory management routines which is a major hassle and a CPU drain for normal processing (converting from byte format to objects and vice versa). Bring back malloc and free :) These problems would go away.
February 28, 2007 | Permalink
Wow. 26Gb of data. Wow.
So, one thing you don't talk about is how the data moves, from presumably object form (since you are trying to avoid byte formats), between the 'server' and the 'client'. If the server really is storing 'objects', then you're doing some kind of byte-oriented serialization, no? Over a pipe. So I'd have to ask, why is the server storing objects in the first place instead of byte arrays ready for serialization to the client? And if you're on a single box, shmem may be a better way to handle this than pipes anyway (avoid copying data at the expense of managing your shmem).
Yes, I'm an ObjectGrid n00b, more or less.
Posted by: Patrick Mueller | Feb 28, 2007 9:23:40 AM
It stores objects because a lot of ObjectGrid type applications run business logic against the data directly, as in this case. If it was like a client/server setup then bytes would work well but we have the equivalent of stored procedures also which would then pay a big cost and for some applications, e.g. a 1000 server grid hosting 500GB of data, all data would only be accessed locally using the data grid apis.
Posted by: Billy | Feb 28, 2007 9:42:24 AM
Something is wrong. I've had heaps of nearly that size on Solaris with fewer CPUs and achieved far better performance. It could just be the VM or AIX or the memory architecture or something. I've never been able to get as good of performance out of the AIX JVM as the Sun or BEA VMs. I've never tried this big of a heap on AIX...
If the stuff is staying in memory what about a concurrent collector?
Posted by: Andy | Feb 28, 2007 11:57:30 AM
26GB is staying in memory. We're keeping what amounts to an inmemory database occupying about 26GB of memory (7 million objects). It's always there. I think you'd see the same issue on Solaris, it just takes time to scan the heap.
Posted by: Billy | Feb 28, 2007 12:46:04 PM
Day 2 results posted.
Posted by: Billy | Feb 28, 2007 8:12:07 PM
Just to clarify, I take it you mean that when you say "it takes 8 seconds for a normal GC", you mean that the mark and remark phases stop the world for 8 seconds? I presume you are using a CMS collector?
Does IBM have a parallel remark phase in their Java 5 JVM?
Posted by: Robert Greig | Mar 2, 2007 5:17:48 PM
I'm using a generational garbage collector. IBM has concurrent mark. The mark phase happens in parallel with existing work as a GC nears, once the GC hits then it's parallelized as best it can. I've tested GC time versus # CPUs and the IBM Java 5 is basically linear with the number of CPUs.
Posted by: Billy | Mar 2, 2007 9:38:51 PM
Sorry I was not being entirely precise, I meant "initial mark" and "re-mark" phases. The Sun JVM also has concurrent and parallel mark.
I presume the IBM JVM also stops the world in the initial mark and re-mark phases?
What I was getting at was that with boxes with multiple cores/CPUs you will reduce the "stop the world" time if you can have a parallel remark phase - Sun 1.6 introduces a parallel remark phase whereas with 1.5 it was serial. Does IBM 1.5 have parallel re-mark?
I think it is important to differentiate between the stop the world pause time and the overall GC time which includes concurrent time. On a small heap the initial mark and remark phases may not be worth worrying about but with larger heaps they may well be.
Posted by: Robert Greig | Mar 3, 2007 11:37:30 AM