I've had several conversations with individuals that centered around capacity planning and whether we needed to buy more hardware. A lot of the information I've collected on this subject is directly from Craig Shallahamer at OraPub. I also recommend reading Neil Gunther Guerrilla Capacity Planning , although Craig deals more directly with how to apply capacity planning, and forecasting to Oracle, Neil Gunther applies to all systems. This post is my interpretation of their material (not only websites but books as well), as well as some of the research I've done in demonstrating their material on my own systems.
I've collected stats at not only the OS level, but within Oracle as well. I've use SAR and IOSTAT from an operating system view, and then within Oracle the wait interface as well as the time model views that are collect valuable information about the OS. I can say where the performance problems reside and why hardware is not the cause. In other words we haven't reached capacity within the CPUs or the IO subsystem. However, I've heard -- "... things are running slow we need more hardware."
Hardware can mask the performance tuning needs, but its only a band-aid not the cure. Admittedly ordering additionally hardware and bringing it online can be done in a shorter time frame then tuning the application, especially if waste is prevalent in the application. But you have to know what hardware will give you the performance increase desired. And understanding CPU utilization will provide you the knowledge required to make an informed decision.
What is CPU Utilization? Its a percentage that is shown in the upper portion of a TOP command in Linux, as well is displayed by SAR in Linux.
Cpu(s): 13.3% us, 6.0% sy, 0.0% ni, 78.0% id, 2.0% wa
We know how to find it, but what does it really mean? CPU utilization equals the service time multiplied by the arrival rate divided by number of queues. That's a little better picture, but that leads to additional questions like where do I find the arrival rate or service times and how many queues exist on my system.
For CPUs the number of queues equal the number of CPUS. If I have a 4 CPU dual core box, that's 8 CPUs -- or 8 queues. A CPU is one queue because it can service one request at a time. If I have 8 CPUs I can service 8 request simultaneously.
The Arrival rate is how fast the work load is arriving at the CPU. To figure this out you have to decide on what is the workload. With Oracle there are many different stats that can be used to make this determination. I haven't found any hard and fast rules. On the current database I work with I usually choose IO, both logical and physical, as it is a DSS so there's a lot of IO occurring. For other systems block changes or user calls may characterize the workload. To calculate the workload's arrival rate, simply take random samples of the stats over a given period of time. Within Oracle you utilize the v$sysstat view keying in on the interested stats. Since this view is cumulative from startup of the instance, you run the query twice and then subtract the first numbers from the second query to get the deltas.
Once you have your Delta you divide by the sample time.
So now I have the arrival rate, the number of transactions (number of IOs) occurring over time. Next I need to find my Service time.
The service time I'm going to derive from the other values. What I've found is that when calculating this value for Oracle it is actually the response time and not the service time, as it will include the wait time within Oracle. The formula for service time is Utilization multiplied by Queues divided by Arrival Rate:
S = U*M/Arrival Rate
Now that I have all these numbers what does it mean to the system?
There is an interesting relationship between CPU utilization and response time. By increasing the CPU utilization the response time will actually increase. An increase in response time means jobs are processing longer because transactions are taking longer. Why? Transactions will take longer because the CPU utilization increases, the arrival rate has increased which means the same number of CPUs are trying to service an increased number of transactions. If the arrival rate has increased significantly enough, queuing begins to occur which can be seen from the OS by using TOP in Linux -- The run queue.
load average: 1.50, 0.86, 0.51
This situation is simply I have more transactions entering the system that require the CPU to perform some work, but they have to stand in line and wait to be serviced.
Not all increases in arrival rate will impact the response time. Each system has a point in which queuing begins to occur and you have to determine where that point lies on the spectrum. Some of the systems that I work with I can see queuing begin at around the 40 - 43%, but its not until it begins to push over 50% that users will actually begin to felt by the users. I believe this is because of the type of database, I venture to guess if it was OLTP their tolerance level would be less.
I'm only using half of my CPU resources, I should be able to use 100% CPU. And you can, however, if you do it will impact performance, so there's a trade off. Having a CPU utilization at 90% and better is not a metric for how well you are performing, its a sign that you have reached capacity and your performance is being impacted.
So capacity is reached you buy more hardware. Actually the first step is to ensure that your application and OS are tuned sufficiently to support the workload. If it is, then yes hardware would be the next step.
How do you decide whether you need more CPUs or faster CPUs. If my response time is within the SLA for the current workload, but additional workload will put it over the top then the answer is more CPUs. This increases the number of queues and allows for more transactions to be serviced within a given time frame. However, if I want a faster response time (think make it go faster) then I want faster CPUs. Faster CPUs will allow the transaction to be serviced faster (and reduce response time).
Example 1, I have a process that runs a number of jobs that takes 3 days to complete. The users would like this process to be finished in 1 day instead. Looking at the process there is a lot of tuning that can be done but there's no one available to perform all the tuning and testing required as it will take an overhaul of the application and database. I've removed all the serialization that I can at this time. The solution to make this go faster is faster CPUs.
Example 2, I have several processes that run for 1 day but I'm doubling the number of these processes that will be running simultaneously. The users are happy with the 1 day turn around on the current process. However, my CPU utilization runs at about 45% during this processing. If I double the number the CPUs I can add the additional workload without impacting the current response time.
Example 3, I commute back and forth between Tampa and Orlando 4 days a week. I make the trip in 80 minutes one way. If I wanted to reduce my travel time to 60 minutes (imagine no speed limits on I-4) I would increase my speed. And if my car couldn't handle going faster I would get a new car that went faster. However, if there were 6 additional people that wanted to make the trip i would get an additional car to carry the passengers (or a bigger car if you wanted to think of the seats as the actual queue).
To wrap up, CPU utilization is tied to arrival rate, service time and the number of queues. An increase in CPU utilization will at a point impact response time. To understand how CPU utilization impacts the performance its important to characterize the workload -- determine what a transaction means. This is only a small slice of the information on the subject and I still have more to learn. The challenge is communicating this information to the people who make the decisions and in a way that is meaningful. I find it to be a great challenge as they are always trying to look for just one magical number that will show how the databases are performing that will also mean something to the business.