November 28, 2006
Cache size and multi-core
The trend to multi-core seems to be accelerating as chip vendors try to figure out what to do with all those transistors. Current multi-processor boxes have about 2MB or more cache per CPU. A box might have 4-8 CPUs. This gives us 8 cores and 16MB of cache.
Current software is written with the expection that there is a lot of cache memory. Our path lengths are not short. We inline methods as a common optimization, for example. This adds to path length.
This game is finished. CPU designers are using the transistors for cores not cache. You will likely not see a proportional increase in cache memory compared with the number of cores. This is because cache is a more expensive way to use the transistors. The makers seem to be using performance per watt as the metric. This means there is less cache to go around.
If the box runs a single process then this is best as the limited cache can be used to cache that processes instructions. But, if multiple processes are run then the cache becomes less effective as it's shared between multiple sets of instructions. This can have a dramatic impact on performance once you start getting caches misses. You won't see this issue in benchmarks as they don't typically run multiple JVMs per box but I think moving forward you will see one active process per box in these benchmarks.
Today, some customers pin processes to certain CPUs. This basically dedicates that CPUs resources including cache to that process and can help with performance. If a multi-core chip has 4 or 8 cores and has a single cache then you have lost this flexibility to a point as you can't dedicate cache like you could before.
In a way, CPU makers that are slower to the multi-core game may have a short term advantage here. For example, Intel are at quad core today. Two sockets, quad core. AMD are late and could do a 4 socket dual core answer instead. The 4 socket answer has twice the cache of the 2 socket answer and this may help a lot of todays applications perform better than the two socket alternative.
The move towards multi-core will lower the cost of computing as customers will be able to have more cores for a lower price than was possible with conventional multi-processor boxes before. However, they will also have less cache memory.
Software that needs to run fast on multi-core will therefore be optimized very differently than software today. Small path length is a must. Small path means less clock cycles to execute and more effective use of instruction cache memory.
How does small path influence software design? JITs may have more limited optimisation options. Inlining will be frowned on except for the most critical of paths. At the extreme, you could see JITs being turned off as this would allow the interpreter to be fully cached and may offer better performance. As the same time, JITs will have an advantage over compiled code in that the JIT can compile to the platform it finds itself in whereas a static compiler doesn't have that capability.
Componentization and general purpose frameworks may become a problem as they usually add to path length when compared with a custom framework or no components. Frameworks may be able to use aspect technology to reduce the path length to almost custom lengths by eliminating the usual loops invoking callbacks etc typically seen in frameworks. We may end up having component architectures and frameworks that are compiled rather than executed at runtime.
The program may use a single chip and all the cores within it so that can get exclusive use of the cache for the cores. The problem here is can the software make use of all those cores. Economically, will it be allowed to? All those cores won't be busy all the time. We can see 32 thread and 64 thread chips coming soon. How many software products can really scale to that number of threads. It takes a lot of engineering to do it. So while some will do it for maximum performance, others won't have software that can scale to 64 threads or can't afford idle cores.
If the chip gets shared between multiple processes then performance may drop due to cache pollution. Thats a lot of threads, potentially a lot of processes may run on that chip meaning less cache to go around.
Now if you add in the virtualization trend it gets ugly pretty quickly. People are going to want to virtualize these boxes but the reality is the older non multi-core boxes are faster because of the tighter coupling of cache to core that they offered.
Anyway, these constraints are going to influence software design moving forward. Multi-core is here to stay. There doesn't seem to be any alternatives to it moving forward. This means that adapting todays software to run on these new boxes will be interesting. It's not just about multi-threading the application, it's more complex than that. Lets see what happens... We may see makers offering multiple lines of CPUs. 2 or 4 core high clock speed CPUS with large caches AND 8 core or higher chips will less per core cache. Two lines of chips for different markets.
November 28, 2006 | Permalink
You mention AMD being late to the game and thy could match 8 cores by having 4 socket machines. Well AMD has had 4 socket 8 core machines available since 2005. They are called Opterons. Also even within each socket the cores have separate caches. Intels architecture puts 4 MB of cache for a 2 core chip so a 2P/4C box has 8 MB cache all shared while an AMD 4p/8C Opteron also has 8 MB of cache but its 1 MB dedicated per core.
Given these corrections I do agree with you that for NUMA architectures with core dedicated caches it would make sense to less aggresive inlining so as to be able to fit the working set of code in the smaller per core cache
Posted by: Prabuddha | Jan 5, 2007 3:12:24 PM