September 13, 2007
Tuning for multicore, now the hard part, processors are going NUMA
The easy stuff I've already covered. Critical sections, planning for dramatically lower core speeds over the next 3-5 years whilst trying to have more and more threads to attempt to fully utilize all cores on the processors. This is actually comparatively easy compared to the next stuff. The next thing is hard. These processors will be organized in a NUMA fashion. Whats NUMA? It means that unlike todays processors, all memory is not equal. The processors will arrange cores in to M blocks of N cores apiece given MN cores per processor. A block of cores will have a common cache and it's own memory controller.
You can see this with AMDs current designs pretty easily. They have already used a hyperchannel to connect two seperate processors in to a multicore processor complex. If a core on processor A needs memory thats attached to processor B then it asks B for that data over the hyperchannel and vice versa. This happens transparently from a programming point of view, memory is just memory. But it's far from transparent from a performance point of view. Imagine processor A executing code that is too big to fit in its cache stored in the memory attached to B. Pretty painful. Now imagine this scaled up on a single chip a few times.
This is likely whats coming. This means the address space is partitioned across core blocks. Your core is no longer directly attached to all the memory as it is now. There is now a hidden network of sorts between core blocks allowing a core in one block to request memory from another core block. This adds more latency when accessing the 'non local' memory than if the core wanted memory from its local memory controller.
The processor is becoming like a distributed system. We have cache on the core blocks which is highest speed, we have local memory which is next and then we have foreign memory which is more memory thats attached to a different core block. Anyone that has seen NUMA before will recognize this and the tuning issues that come along with it. Clearly, for maximum performance, we want a core to only access its local memory. All memory is equal from a programming model point of view but obviously, fetching non local memory to a core will cost you.
Java programmers unlike C/C++ programmers have long forgotten about this type of stuff, it's the heap, simple and uniform. This uniformity is the enemy to performance here because the memory simply isn't any more. The JVMs are going to have to jump through hoops and the middleware on top will need to cooperate in order to make sure that when a thread runs, it runs on a core that has most of the data the thread needs on memory thats attached to that core. Clearly, JVM vendors will want to try and hide this but thats not going to be easy. Just as we really need partitioned architectures to scale linearly on grids of servers, the same is coming to software within a JVM so that we can exploit these architecture to their fullest.
Clearly, code will still run if applications don't do this, but it's also clear that code will not run quickly on the hardware. We have a lot of work and innovation ahead of us in JVM technologies as well as middleware on top to exploit the new features that are coming in the JVMs to allow high performing applications to be written on top of these new architectures.
September 13, 2007 | Permalink
But multi-processor Opteron systems (as well as pretty much any other multi-processor system with more than 2 or 4 processors) have always been NUMA. So that's not really new, it's just that this fact has mostly been ignored because multi-processor systems weren't that common.
Posted by: Christof Meerwald | Sep 14, 2007 11:57:55 AM
You're exactly right but multi-processor machines still have one memory image and it was only rare large ones which were NUMA based like bigger p5 stuff. This is the old boxes but on much more commonly available boxes.
Posted by: Billy | Sep 14, 2007 12:05:15 PM