September 02, 2007
Tuning for multi-core: First, the easy stuff
Multi-core processors have arrived and are becoming the normal processor architecture now. Sun is the most extreme with 8 core processors, Intel has quad core and AMD dual core. This is going to impact tuning application servers in some pretty basic ways. It’s like going back to 2002 and tuning a JVM to run on a biggish SMP from the time. Lets compare a JVM running on a two way box and then on an 8 core processor. First, lets level set using published benchmarks on relative performance of these wide multi-core processors and the now normal dual core. IBM has published WebSphere 6.1 results for both Sun T2000 with the new T2 (8 cores chip) and the power 5+ (two cores per chip). The T2 number is 616 and the p5+ 404. So T2 is faster than p5+ processor for processor. But, lets look at core speeds. Lets divide by the number of cores. We get 616/8 or 77 for T2 versus 404/2 or 202 for p5+. That means the per core, p5 is almost 3x faster than a T2 running the exact same Java code. If I'd had a p6 number (the new IBM processor) then it's worse as it's much faster than a p5+. In terms of instructions per second overall, the T2 is a fast processor but for the context of this article, it’s the core speed that we’ll pay attention to. This core speed is expected to further slow as processors get wider, 16 cores and beyond. You can see these numbers on the spec web site here. Programmers and vendors have had it easy up to now as core speeds kept increasing as path lengths increases. Those days are gone as processor makers adopt wide multi-core processor architectures.
More threads are needed
This is kind of obvious but if you have a thread pool of 10 threads on a two way, you’ll likely need 40 on an 8 core processor. What is the impact of more threads on the application besides the potential of more parallelism? Locking between threads will be a bigger problem. This is simply because there are more threads competing for resources. What resources? Things like access to thread pools, connection pools, database locks. Many programmers use synchronized blocks to serialize access to critical sections. This may not perform well as the number of cores/application threads scales up. Usually, people try to get clever using more specific locks such as MultipleReaderSingleWriter style locks that allow several threads to read the structure concurrently but only a single writer. This helps but even this has issues on a multi-core processor. Why? The core speed is lower. This impacts lock performance simply because the locking code itself takes longer to run. This means the critical section around the locking code of even an RWLock will start to hold up the threads as they start to stack up and compete around that resource to acquire read locks even. More efficient critical sections will help here and the good news is Doug Lea is ahead of this and Java 6 uses rewritten locking code to speed these paths. There are also other tricks such as striping locks which I’m using but haven’t seen anywhere else yet. This means instead of having a single RWLock, lets have say five. A thread can obtain a read lock by acquiring a read lock on any of the five. A thread can obtain a write lock only when it gets a write lock on all 5. This penalizes the writer but allows much more concurrency (five fold) for the common case that is reading. I’ve tested this on an AZUL box with over a 100 cores with good results as well as 32 way p5 systems.
Impact of slower cores on concurrency
Slower cores means code takes longer to execute. As we've already seen, a wide multi-core processor is around 3x slower than a dual core processor. This means code runs this much slower and this includes locking code, code within critical sections and time spent while holding locks on database rows etc. Clearly, holding locks or staying in critical sections longer means more concurrency issues. Applications will need to be modified to address these issues.
Given there are more threads being used and the individual threads can be running as much as 3x slower than in the past, each thread will hold connections for that much longer also. So, pools need to be bigger for two reasons. The first is simply there are more threads. The second is that connections are borrowed from the pool for a longer time because the cores are slower so you need more connections available in the pool.
This is kind of interesting in a way because it’s like going back five years. Back then; our SMP boxes had much slower cores than today. The more extreme multi-core processors today are like moving back a few years in terms of core speeds. Applications will need careful tuning to be able to fully exploit those processors as well as optimize access to shared lock resources such as in memory locks as well as rows in databases. Java faces some unique challenges because its developers typically use a lot of frameworks. Examples of common frameworks would be commons-logging, JavaEE, Hibernate, OpenJPA, Spring and the list goes on. These frameworks help greatly with productivity but do nothing for path length. This extra path length is going to start to work against the parallelism needed to exploit these new processor architectures as longer path length around locking code means locks are held for longer and that does not improve concurrency, it restricts it especially with that code running 3x slower on wide multi-core processors than we are used to. Clearly, vendors will need to pay attention to concurrency and path length like never before as well as application designers may need take a step back to look at what they are doing also, maybe use optimistic locking more on databases rather than pessimistic locks or other such approaches. Anyway, interesting times ahead.
August 25, 2007
XTP: Pat Helland, Square Architectures and Distributed Transactions evolve
It's becoming more and more clear than conventional tiered architectures have a lot of trouble scaling. Conventional tiering uses a triangular shape. If requests come in from the left then you tend to see multiple tiers moving right until you hit the data source on the right. The tiers are widest in terms of number of processors on the left where the web servers are and then the number drops steadily as we move right closer to the data source. This is why these architectures won't scale indefinitely. People attempt to help scalability by including data source offloading technologies such as caching on the left hand tiers, off loading the data source and allowing it to support more processors in front of it. These triangular architectures will not support XTP (extreme transaction processing).
XTP needs square architectures. A square architecture is one where the right hand side gets wider as the left hand side gets wider. The whole architecture scales from left to right. This is necessary to maintain growing transaction rates with constant transaction times. All resources needed by the system must have a way to scale out.
Products such as ObjectGrid can provide a framework in which to host XTP applications. The applications run using a partitioned model. There is no two phase commit. Transactions span only a single partition. A partition provides a unit of scalable state management for the application. This state includes both relational data as well as messaging based data. ObjectGrid unifies both types to a single data model and API. Messages are data and vice versa, there is no distinction.
If a transaction needs to change data across partitions then it's done in multiple steps. The first step is a transaction on partition A. Partition A sends a message to partition B. This message is sent from A to B using single phase commit. As a result, B may receive the same message multiple times and B must be able to handle this. B processes the message asynchronously.
ObjectGrids job is to provide the infrastructure for hosting and placing the partitions, providing fault tolerance and data integrity and support applications using the above application architecture. The application developers role is to architecture the system as a set of partitions and write the business logic using a one transaction per partition model with messaging hooking up multiple partitions. The developer doesn't need to worry about the infrastructure needed to host the application; this is ObjectGrids job in this.
Pat Helland from Amazon wrote an excellent paper on this type of architecture. ObjectGrid is designed around and embraces this type of architecture when hosting XTP applications. Pat describes keyed entities. Transactions cannot span entities. Logical transactions spanning entities happen instead with a series of independent entity scoped transactions connected using idempotent messaging. This approach expressed in ObjectGrid terms has an entity equal to a partition. An entity is an object graph with a hashable root object. Applications route messages to the root object using a hashing approach. ObjectGrid provides relational management for the entities and a messaging mechanism for communication between partitions.
As companies move to a real time rather than a batch enterprise architecture then we'll see XTP start to spread from telco and financial companies today to more and more enterprises as they try to make their business models real time. This is necessary for companies to stay competitive. Waiting for batch jobs to complete etc was fine several years ago but won't help them be competitive moving forward. No one wants to wait. Everybody wants everything immediately. Conventional triangular architectures will not cope as companies move to this model. The front end of these applications will use square architectures and XTP patterns and middleware.
There will still be conventional back-ends for reporting hosting on data ware houses, at least until XTP data stores evolve to support distributed reporting etc. These back-ends will basically be asynchronous replicas of a subset of the state in the XTP system.
To summarize, as we move towards real time enterprises, XTP will start to take hold and applications will need to be designed around new architectures as described here in order to scale effectively, middleware will need to evolve to support these new architectures. These types of architecture are exactly why we’re building ObjectGrid.