September 02, 2007
Tuning for multi-core: First, the easy stuff
Multi-core processors have arrived and are becoming the normal processor architecture now. Sun is the most extreme with 8 core processors, Intel has quad core and AMD dual core. This is going to impact tuning application servers in some pretty basic ways. It’s like going back to 2002 and tuning a JVM to run on a biggish SMP from the time. Lets compare a JVM running on a two way box and then on an 8 core processor. First, lets level set using published benchmarks on relative performance of these wide multi-core processors and the now normal dual core. IBM has published WebSphere 6.1 results for both Sun T2000 with the new T2 (8 cores chip) and the power 5+ (two cores per chip). The T2 number is 616 and the p5+ 404. So T2 is faster than p5+ processor for processor. But, lets look at core speeds. Lets divide by the number of cores. We get 616/8 or 77 for T2 versus 404/2 or 202 for p5+. That means the per core, p5 is almost 3x faster than a T2 running the exact same Java code. If I'd had a p6 number (the new IBM processor) then it's worse as it's much faster than a p5+. In terms of instructions per second overall, the T2 is a fast processor but for the context of this article, it’s the core speed that we’ll pay attention to. This core speed is expected to further slow as processors get wider, 16 cores and beyond. You can see these numbers on the spec web site here. Programmers and vendors have had it easy up to now as core speeds kept increasing as path lengths increases. Those days are gone as processor makers adopt wide multi-core processor architectures.
More threads are needed
This is kind of obvious but if you have a thread pool of 10 threads on a two way, you’ll likely need 40 on an 8 core processor. What is the impact of more threads on the application besides the potential of more parallelism? Locking between threads will be a bigger problem. This is simply because there are more threads competing for resources. What resources? Things like access to thread pools, connection pools, database locks. Many programmers use synchronized blocks to serialize access to critical sections. This may not perform well as the number of cores/application threads scales up. Usually, people try to get clever using more specific locks such as MultipleReaderSingleWriter style locks that allow several threads to read the structure concurrently but only a single writer. This helps but even this has issues on a multi-core processor. Why? The core speed is lower. This impacts lock performance simply because the locking code itself takes longer to run. This means the critical section around the locking code of even an RWLock will start to hold up the threads as they start to stack up and compete around that resource to acquire read locks even. More efficient critical sections will help here and the good news is Doug Lea is ahead of this and Java 6 uses rewritten locking code to speed these paths. There are also other tricks such as striping locks which I’m using but haven’t seen anywhere else yet. This means instead of having a single RWLock, lets have say five. A thread can obtain a read lock by acquiring a read lock on any of the five. A thread can obtain a write lock only when it gets a write lock on all 5. This penalizes the writer but allows much more concurrency (five fold) for the common case that is reading. I’ve tested this on an AZUL box with over a 100 cores with good results as well as 32 way p5 systems.
Impact of slower cores on concurrency
Slower cores means code takes longer to execute. As we've already seen, a wide multi-core processor is around 3x slower than a dual core processor. This means code runs this much slower and this includes locking code, code within critical sections and time spent while holding locks on database rows etc. Clearly, holding locks or staying in critical sections longer means more concurrency issues. Applications will need to be modified to address these issues.
Given there are more threads being used and the individual threads can be running as much as 3x slower than in the past, each thread will hold connections for that much longer also. So, pools need to be bigger for two reasons. The first is simply there are more threads. The second is that connections are borrowed from the pool for a longer time because the cores are slower so you need more connections available in the pool.
This is kind of interesting in a way because it’s like going back five years. Back then; our SMP boxes had much slower cores than today. The more extreme multi-core processors today are like moving back a few years in terms of core speeds. Applications will need careful tuning to be able to fully exploit those processors as well as optimize access to shared lock resources such as in memory locks as well as rows in databases. Java faces some unique challenges because its developers typically use a lot of frameworks. Examples of common frameworks would be commons-logging, JavaEE, Hibernate, OpenJPA, Spring and the list goes on. These frameworks help greatly with productivity but do nothing for path length. This extra path length is going to start to work against the parallelism needed to exploit these new processor architectures as longer path length around locking code means locks are held for longer and that does not improve concurrency, it restricts it especially with that code running 3x slower on wide multi-core processors than we are used to. Clearly, vendors will need to pay attention to concurrency and path length like never before as well as application designers may need take a step back to look at what they are doing also, maybe use optimistic locking more on databases rather than pessimistic locks or other such approaches. Anyway, interesting times ahead.
Well, I know plenty of projects that uses stripped locking. EDU.oswego.cs.dl.util.concurrent.ConcurrentHashMap , for example, is what 6-7 years old?
Posted by: Kasper Nielsen | Sep 3, 2007 3:51:55 AM
I meant, described for everyone to see in articles etc. It's all obvious when you think about it. Doug has done great work for everyone over the years.
Posted by: Billy | Sep 3, 2007 8:22:34 AM
Although, when I was tuning ObjectGrid on the AZUL box I remember phoning Doug and asking him about why the RWLock wasn't using striping so I don't think we're talking about the same things here. Yes, a concurrent hash map probably uses an rw lock per bucket but I'm talking about using a set of rw locks per bucket which is different.
Posted by: Billy | Sep 3, 2007 10:47:02 AM
I'm not sure I get the point
This means instead of having a single RWLock, lets have say five. A thread can obtain a read lock by acquiring a read lock on any of the five. A thread can obtain a write lock only when it gets a write lock on all 5.
How is this different from a rw lock that allow multiple concurrent readers but only one exclusive writer, such as for example j.u.c.locks.ReentrantReadWriteLock
Posted by: Kasper Nielsen | Sep 3, 2007 3:16:40 PM
A single RW lock has a single critical section that is entered for granting either read or write locks. This single lock is a choke point in a wide processor/SMP. Using five RW locks provides five critical sections allowing five x the lock grant rate of a single lock.
Posted by: Billy | Sep 3, 2007 3:42:32 PM
That is not really true. A RW lock implementation does not necessarily need to have a critical section. j.u.c.locks.ReentrantReadWriteLock, for example, does not have a critical section.
Posted by: Kasper Nielsen | Sep 3, 2007 4:46:44 PM