October 23, 2006
Leveled provisioning architectures, keys to success
WebSphere XD is an example of a provisioner. It takes a set of machines, a set of applications, performance service level agreements for those applications and it then tries to meet those rules by provisioning the applications on the fixed set of resources, the boxes, assigned to the application. Whats interesting here is the pattern that it uses and which all products like it also use. The basic pattern is to measure the current performance and then make a change if the current performance isn't meeting the SLAs. The levels at which this is done however are ways to characterisie products.
Level 1, thread/request provisioning
Here, a router or proxy or even a thread pool within an application server can make changes. This is the least costly way to meet goals and it has the fastest reaction time. The proxy will have a queue of outstanding requests and a set of targets that can accept those requests. It will first classify the requests in to something like high, medium and low priority. It will do something like a weighted round robin over the requests to ensure that the response times are met for the high priority requests. This could possibly starve out lower priority requests if need be. But, if a class of requests is not meeting the goal then improving its response time by starving the lesser goals is a really cheap way to meet the goal. The cost is simple adjusting weights on the dispatcher which is a very cheap action to take. What happens when you still can't meet the SLA and you have completely starved the lesser classes? Time to move up the provisioning tiers.
Level 2, application placement
Here, we know that an application (AppA) which services a particular request class cannot meet its goals. The placement controller looks at the set of nodes currently available and will change the mix of applications to have more of the AppA application running in the node group. This provides more CPUs for the application which should allow it to better meet its SLA. If this isn't enough then we may add more than one extra CPU to the mix. Obviously, this is more expensive than Level 1. We're stopping and starting JVMs here and this can take between one to two minutes before the JVM starts and then the JIT starts running and maybe after a few thousand requests can been processed by the new JVMs then it'll be up to speed. The reaction time of level 2 compared to level 1 is pretty significant. Changing weights in memory is millions of times cheaper than starting a JVM.
But, again, what happens when we have used all the machines to run AppA and we still can't meet the SLAs. Time to kick up to the next provisioning tier.
Level 3. Machine provisioning
We get here when the set of machines given to XD cannot meet the goals. We now need to integrate XD with products such as Tivilo Intelligent orchestrator(TIO) or Tivoli Provisioning Manager (TPM). These products can manage multiple farms of machines. Lets say farm A runs a Data Synapse compute farm. Farm B runs XD. Suppose TIO was told that farm B is more important than A. So, when XD says it cannot meet the goals then it tells TIO of this problem. TIO then makes it's decision to pull a machine from Farm A and give it to Farm B. This can be quite complicated. The Farm A intel box might be running Solaris and XD may be required to run on SUSE 10. TIO has to tear down the box and install SUSE and XD and then add the box to the XD node group. Obviously, this can take a while, maybe 30-45 minutes before the action has any impact on the performance of farm B. Virtualized hardware can help here significantly with the latency. If the box was running VMWare or XEN then we could tell XEN/VMWare to simply stop running the Solaris Data Synapse virtual machine and then start a pre build XD virtual machine. The operating system would take maybe five minutes to boot and then maybe another 5 minutes of XD to start running so ten minutes. The virtual machines would also significantly simplify configuring the TIO/TPM simply because the number of configurations is greatly reduced, i.e. we're running virtual hardware now so we only need one operating system image.
Obviously, provisioning is a fractal type architecture. It's the same sort of thing from a high level going on at each level but the details are different. What should be clear is that as we move up the levels, the ability of the system to react slows down. Level 1, we can handle pretty quickly, update a weight in memory and increase response. Level 2, we move out to two or more minute times before the change starts to help out. Level 3, we could be at as much as 45 minutes.
Customer success with this technology depends on realizing these kinds of scenarios and understanding the chain of events. If you have applications that need to respond to 10x load increases within a second then you better stick with level 1. Level 2 and 3 will not be able to respond fast enough for that kind of goal. Conversely, if you have applications where load slowly ramps up over 15 minutes then a level 2 system will probably work well for you. Next, if you know on a certain day that your load doubles then a level 3 system can help automate that nicely by automatically changing the farm machine ratios using a simple calendar.
October 23, 2006 | Permalink
Oh, man. You're soooooooooooooooo smart :)
Posted by: John | Oct 25, 2006 3:58:50 AM