Saturday, July 3, 2010

Key To Troubleshooting Global Cache Waits

I returned from vacation a few weeks ago to a serious performance issue that had hampered processing by at least 75% overall. Some processing wasn't impacted.

I read through all the email threads theorizing that the problem was the redo switching.

I pulled up the gv$session looking keying on the minimum information to see what Oracle was telling us through the wait interface. I was floored the Global Cache Waits were significantly high, processes weren't moving through the GC. I confirmed by running some AWR reports over different time periods since the issue was first reported.

I sent an email explaining my findings using the wait interface, and then discounted the redo switching since we were not seeing any waits for Archiver, Log Writer or any other background process on log waits. There were background processes that waited on GC waits.

I explained to the team that Oracle will tell you what is slowing processing down if you just listen to the wait interface. It won't give you the root cause but it will tell you where to start looking. And even though we were switching frequently, we weren't waiting on the switch to occur and therefore that was not the area to concentrate efforts.

Another theory was sent out that the Interconnect was probably suffering because some of reference tables were cached on one of the old nodes that was no longer being used by the application (we are actually preparing to remove it from the cluster). The theory was that the blocks were remaining on the instance and all the other instances had to always go to that instance to retrieve the blocks. A perfect example of how global cache does not work. I explained how the buffer cache worked and that the blocks would actually be pulled over to the requesting instance and the touch count incremented.

I also explained that it didn't appear we were saturating the Interconnect by the amount of traffic that was trying to be pushed across, but we were experiencing block lost, according to the wait interface.

I took the additional step in disproving the theory by caching data on one of the other instances, and then pulling the data over to one of the other instances. Checking the waits -- Global Cache Waits.

Another theory was raised that because we did a lot of full table scans (FTS) and our multi-block read count was high (16K), that this was why we were experiencing the waits. Because on Metalink there was a note that OS settings may need to be adjust under these circumstance. In particular the UDP buffers, but ours were already at 256K. We have always had a 16K MBR count. So what changed? Did the workload actually start soaring? Were we saturating the bandwidth of the Interconnect?

The network and linux administrator were instructed to start checking the OS and hardware. As well as, I requested all of the changes that went into the database, OS or hardware prior to the waits issue.

In the meantime, the theory of the instance buffer cache kept surfacing from individuals that really didn't have an understanding of how buffer cache or global cache fusion worked. So I decided the easiest way to demonstrate was to shutdown the node. Which I did... and the results were not exactly what I wanted.

The entire cluster started having issues as nodes were evicted.

Things proceeded to get worse. As we tried to bring up the individual nodes they experienced reconfiguration issues and would keep rebooting. After about an hour of the new DBA on the block struggling I realized the issue..... THE SWITCH WAS BAD! Then I started with admonishing myself for not realizing it from the onset of the issue.

I immediately began explaining my ah ha moment. The nodes were having problem reconfiguring and staying in the cluster because they were having trouble communicating across the Interconnect. The Interconnect being nothing more than 8 Cat 5 wires plugged into a 48 port Netgear switch (yes I know not exactly a stellar setup but you learn to make due with what you have at the moment). The chances of all 8 wires going bad was slim, but the switch a definite could happen. Everyone thought I was insane and said switches don't go bad. I differed with them based on my husband managing a team of network engineers who dealt with switches going bad all the time. They are after all nothing more than electronics and they too can over heat, have cards go bad just like any other electronics.

We decided to bring up one node at time and let it come all the way up with the database, before we started the next node. And my boss wanted us to make sure that we didn't bring up the old nodes at all. We obliged, after all by this point I was fairly certain what the outcome would demonstrate. After node one was up we brought up Grid so that we could have an instant picture for assessment. The second node came up and there were the cluster waits. The only activity that was occurring was the opening of the database and then just the background processes. The cluster waits were noticeably high, to those of us who had a mental baseline of what they should be at this time under normal circumstances. We introduced the third node and the waits shot up more. We then started getting Inbound Connection Timeouts on the third node. We made a collective decision to bring down the node.

Not having a spare switch on hand (long story), we decided to grab a little 6 port 100mb switch. Made the swap with the two active node, and the Global Cache Waits pretty much disappeared. Since only half the bandwidth was being used, we decided to experiment and see what would happen if we brought up the third node.

Unfortunately, since we only had 100mb switch and the traffic pushed to just over 90mb with the third node we were saturating the bandwidth. We needed to back off of using the third node until we could be procure a Gig switch. Which we did the next day.

With the new switch in place the Global Cache Waits returned to normal. Block lost were a thing of the past.

The one thing that I came out of this fire fighting activity is to not make troubleshooting more difficult. Think through what comprises the waits in question and remember to include the hardware in the assessment. For Global Cache Waits if you have been cruising right along without any issues and then all of sudden the waits soar chances are that the issue isn't with the database or RAC, but with the Interconnect.

Thursday, April 29, 2010

Executing Linux Shell Script Through DBMS_SCHEDULER

We receive some of refrence data through third party vendors. This usually involves us loading the data from CD or other source, running sqlldr, and then some sql to sync tables across databases, as well as denormalize a table or two.

In order to move this out of my hands and into the business hands I had to make the process error free and easy to do -- push buttons and make magic happen. I did this through multiple steps:

1. The Linux script that uploads data into a QA schema on Development was already in place. I just didn't need a way to execute it via a push of a button on APEX. DBMS_SCHEDULER was the simplest approach.

2. Next I created a refresh_ref_tables package on each of the databases, which would be called from a procedure on the APEX database. By doing this allows me to call the procedures via database links and be able to include more than 4 databases in the process.

3. Created an apex application that allowed for buttons to push and the magic to happen. I even created all the QA reports within the application so they could do away with their spreadsheets on the business side.

The first stumbling block was the DBMS_SCHEDULER job to execute the Linux shell script. I created the job using the job_action as the /usr/bin/sh, and then the actual script is the argument. So the create_job statement looked like this:

dbms_scheduler.create_job(
job_name=>'USED_FOR_SHELL_SCRIPTS',
job_type=>'EXECUTABLE',
job_action=>'/usr/bin/sh',
number_of_arguments=>1);

This can be used for any shell script execution. We made sure that only the DBA group had the ability to schedule this job. A tried several different ways to schedule the actual script, but I kept receiving the file or directory not found. I probably can now just schedule the actual shell script, but for now we will leave it this way as I can several different shell scripts being scheduled.

Within APEX when the page is displayed the user can select a button that will then begin the shell script by executing the following:

begin
dbms_scheduler.set_job_argument_value@orl1(job_name=>'USED_FOR_SHELL_SCRIPTS',
argument_position=>1, argument_value='/home/oracle/script/path/refresh.sh');
end;

Then to run the job:

begin
dbms_scheduler.run_job@orl1(job_name=>'USED_FOR_SHELL_SCRIPTS');

The shell script kicks off and executes. One issue I have found with running shell scripts in this manner is that I can't use pipes or redirects for logfiles. However, I'm using sqlldr and do have the log file option there. The sql statements I actually log the row counts and actions to an audit log.

If errors do occur with executing the script the error is logged to the dba_scheduler_job_run_details table. Some errors appear to be very generic, but with a closer look in this table the actual issue will be discovered.

select log_id, log_date,job_name,status,error#,additional_info from dba_scheduler_job_run_details where job_name='USED_FOR_SHELL_SCRIPTS';

One of the errrors I was repeatly receiving was ORA-27369: job type Executable Failed with exit code: Key has expired

This usually indicates a permissions issue, but where? The additional info column provided the answer:

STANDARD_ERROR="/home/oracle/scripts/path/refresh.sh: line 84: //XXXXX_DESC.dat: Permission denied
cat: write error: Broken pipe
/home/oracle/scripts/path/refresh.sh: "

This was due to the unzip command and if the file already existed. I added the -u option and it elimnated this error.

Once the development script refresh is finished the QA staff handles the data QA to understand the impact of the data changes. Upon their approval we then move the changes into all the databases primary schema that holds these tables.

This is done through a package on each of the databases, that is executed via apex. The package on each of the databases has a different procedure for the different types of reference data that is being handled. They are kicked off via APEX as required.

The actual process that is kicked off from apex is a procedure that resides on the apex database that allows for multiple databases to be selected for the refresh. I used this same concept for user management, as it works well.

The multiple select list is passed in as the pDB parameter. The list will look as follows orcl1:orcl2 or orcl1:orcl4:orcl5

The last one doesn't have a colon so I add a colon. This allows for the IF statements to be built. The vDB is then analyzed for the existence of each of the different databases and if found performs the action. Then it moves on to the next on in the list, until all have been satisfied.

The procedure on APEX is as follows:

refresh(pDB IN VARCHAR2)
IS
vDB VARCHAR2(100);
vJobTime DATE := trunc(sysdate+1)+5/24;

BEGIN

vDB := pDB ||':'; -- this required because multiselect list with have
each selection separated by a colon, but not the
end one.

IF instr(vDB, 'orcl1:') != 0 THEN
-- what I want to execute
EXECUTE IMMEDIATE 'BEGIN refresh_tables@orl1;end';
end if;
-- additional statements to cover each of the databases
IF inst(vDB, 'orcl5:') != 0 THEN
-- In this case I need to ensure it happens after hours
dbms_scheduler.create_job@orcl5
(job_name=>'TEST',
job_type=>'PLSQL_BLOCK',
job_action=>'BEGIN REFRESH;end;',
start_date => vJobTime,
enabled=> TRUE);

end if;

end;

Now we only get involved, if there are failures or the requiremnts change in how the reference data is being handled.

As always the names, and paths have all been changed.

Sunday, April 18, 2010

Slow Performance On Node -- Root Filesystem maybe 100%

When SSH over to one of the nodes in our RAC cluster the response was horrible, in the neighborhood of 3 minutes plus before being prompted for username and then another 3 minutes plus to be prompted for password.

First I pull up TOP. The first thing I noticed is the Run Queue

load average: 23.09,22.48,21.73 -- Ouch

I check the CPU:

Cpu(s): 0.3%us, 0.6%sy, 0.0% ni, 74.4%id, 24.7% wa, 0.0%hi, 0.0%si

So it's not a process in the CPU, however, I have a high wait at 24.7%

I check the MEM:

Mem: 65742800k total, 22487544k used, 43255256k free, 680408k buffers

That is definitely not right as I should have around 57700140k used. This looks like the instance may not be up.

I check the instance:

ps -ef|grep pmon

and only the asm1_pmon is running

I check the filesystem space

df -k

And the root filesystem shows / 100%

That's the answer. I'm using ASM, root filling up can cause the instance to go belly up. It will also cause the run queue to jump as processing that need the root filesystem will not be able to process.

The resolution is to find the largest files in the root filesystem in remove them. In our case the trace file cleanup routine is not functioning properly and I find trace files dating back 2009 to include the cdmp and core dumps. I remove all the trace files. I also zip a few large files that have appeared recently that I know are not being actively utilized and will look to remove them.

I try to restart the instance

srvctl start instance -i orl1 -d oral

But receive a dependency error on the resource, so I opted to bounce the node which will also restart the CRS and both instances.

If your CPU is low, but your run queue is high, check the root filesystem. Its the most common issue we see in this scenario.

Monday, April 12, 2010

User Management Package For Business Users to Access VIA APEX

Application Express has allowed us (DBA types) to add GUI front end to some of the day to day maintenance tasks that we not only do for administration, but also for handling tasks for business users that don't fit cleanly into our current applications. Although developers may have an issue with building a GUI application with a GUI application -- as DBA I find it convenient, efficient and I don't have to wait on anyone to create an app for me. Makes me independent and I like being independent, but also keeps the developers from being tasked to spend time creating an in-house application for the DBAs.

I'm not the one that installed Application Express, nor am I the one that created the original applications. I have however, taught myself how to modify the applications, as well as created new ones. An application that I'm currently working on putting together is User Management. The preferred solution would be a purchased product with single sign on, but this is the next best thing, as creating users, unlocking accounts and changing passwords is not how I like to spend my days.

The User Management application is very simple it allows across all databases the cloning of business users, changing of passwords even if they do not know their passwords, and the unlocking of accounts. All of these tasks have security wrapped around them, so only certain users can clone accounts and they are further restricted to the type of account they can clone (basically only in their department). Change password they must know their password or respond to the verification email which is sent to the email assigned to the employee account (this is a business email and not a personal email account). The unlock account works the same way, they must respond to an email to verify its them and only certain accounts will be unlocked -- those that have the business profiles assigned. With all of these procedures no DBA accounts, or system privileged accounts can be cloned, have their password changed or unlocked.

In its simplest terms its a self service application. The following is the code behind the package that is executed through Application Express:



There is addition setup within the application that allows for the certain users to clone accounts, as well as the verification email for changing unknown passwords. I'll save that information for a future blog.

ASM Issues When Adding A New Node

It seems with every new node that is added a different issue slaps us in the face.  It would be so nice if just one node went in smoothly and without incident... of course it would be even better if resources were available to test adding the nodes, but that's not the issues I want to document in this post.

There have been a few issues with ASM during this exercise of adding nodes; ASM Instance Number incorrect, and Disk partitions not shared across all nodes -- actually has happened twice but for 2 different reasons.

Issue 1:  ASM Instance Number Incorrect.  The ASM instances are using an spfile

$ORACLE_HOME/dbs/spfile+ASMx.ora

The x is replaced with the number of the ASM instance.

Once the configuration is set in DBCA, the utility asks this is an ASM database would you like to extend the ASM instance?  Of course, the answer is yes or there will be problems when the database instance is extended -- Disks will not be available.

However, this bombs with the Instance Number is incorrect.  It seems that the spfile that DBCA decides to use is a generic version

$ORACLE_HOME/dbs/spfile+ASM.ora

That's an spfile without the instance number attached.  Since you can't edit the spfile directly, back on the original install host create a pfile from the spfile while logged into the ASM instance

create pfile from spfile;

I'm good with the location, but if you want to specify the directory and filename that can be done with

create pfile='/oracle/products/dbs/init+ASM1.ora' from spfile;

Now edit the pfile, I add the new instance number as well as any instance numbers that maybe missing.  Then recreate the spfile file.

create spfile from pfile;

Then I copy this file over to the new nodes spfile+ASM.ora generic file.

Rerun DBCA and the ASM instance is extended, as does the database.

Issue 2a:  Disk Partitions not shared across all nodes

The first time the message was received only one of the partitions wasn't being seen.  The fix:
1.  Ensuring that Oracle was in the Disk group (sets the permissions to access the disks)
2.  Check the raw devices in the /raw/dev file to ensure they are correctly specified and not duplicated.  This
     will be on the new node.

And in case anyone is wondering how I did 1 and 2, I'll have to defer to my boss as he's actually the one that handles the raw devices and ensuring the powerpath ids all match up.

Issue 2b:  Disk Partitions not shared across all nodes and all of the diskgroups are listed

Metalink Note: describes this as symbolic link is used on the instance the initial install was performed from, and the spfile needs to be re-created without the symbolic link and the ASM instance bounced.

Since this is my fourth node and this is the only one I have had this issue with I'm inclined to believe it has something to do with how I created the spfile and the pfile that points to the spfile.  So tomorrow I get to experiment on how to correct this situation.  My goal is to correct it, without having to bounce any of the ASM instances.

So tomorrow should be an interesting day.

I still don't have a solution to this problem. I did open a SR with Oracle which they haven't provided any useful information to try only asked for information that was already provided and files that were already uploaded. I often wonder if the support analyst read the actual information or just make assumptions so that they can push the issue back to the customer and get the clock off of them.

Sunday, April 11, 2010

Adding A Node To A RAC Environment Gotcha

Over the last month I have had the opportunity to add 4 additional nodes to a RAC environment.  The current environment was 4 nodes running Suse 9.  The decision was made that instead of replacing the existing nodes, we would add and make it an 8 node cluster.  If you had the opportunity to read my post on CPU Utilization, you will understand that what we really did was increase our capacity by increasing the availability of CPU time for more transactions.  Our Arrival Rate can increase without impacting our response time -- provided of course we can balance our workload across the nodes appropriately.   On the horizon is an upgrade to 11gR2, so the new nodes are running Suse 10.  The older nodes will be removed as we have the new one's stable and upgraded to Suse10.

The decision to run with a mix OS for a short period of time was not without at least one quirk.  If you have an OCFS2 mount point you will not be able to mount it on both versions, its either 9 or 10.  This is unfortunate even in the short term.  The use of external tables or even jobs that still use UTL_FILE will have fail if they happen to start on one of the nodes the filesystem is not mounted on.   A work around would be service groups, however that may require some code changes that might not make sense in the short term.

A work around that has been effective in the short term, is to set the local_listener on the nodes that do not have the OCFS2 filesystem to a listener where the filesystem is mounted.  Let's say I have a host ORL1 and ORL2 that have the filesystem mounted, but I have ORL3 where the filesystem can not be mounted at this time.

Ensure that the TNSNAMES.ORA on ORL3 has the Listener for one of the two nodes where the filesystem is mounted:

LISTENER_ORL1 =
  (address = (protocol = tcp)(host = oral.medai.com)(port = 1521))


Then I change the parameter local_listener to one of the 2 hosts listeners:

alter system set local_listener=LISTENER_ORL1 scope=memory sid='orl3';

This changes the parameter only in memory and only for instance ORL3.

Now if any session connects to the listener on ORL3 it will be redirected to ORL1.  Of course, this means no jobs will be directed to ORL3 which defeats the purpose of adding the additional nodes.  This is only a temporary solution to adding the new nodes before removing the older nodes for upgrade of the OS.

Manipulating the local_listener parameter is also a way to redirect jobs away from a node that you may be preparing to perform maintenance.  This allows jobs to continue to process that have already started, but will prevent additional jobs from starting on the instance.  Allows you to bounce the instance, or node in order to perform maintenance without having to kill processes.  Doesn't solve the issue if you are using inter-instance parallelism and you want to take down a instance or node for maintenance as that is a whole other discussion.

Although its not supported and Oracle strongly suggests not running in a mix OS version, it does work and will allow you the opportunity to upgrade your OS one host at a time.   But its not without quirks and you need to ready to either handle the quirks or research the quirks.  The OCFS2 is just quirk, I'm sure there are more I just haven't been able to pinpoint the actual issues we have experienced with the different OS versions.  Although I will say each node that I have add different problems have appeared, but I think that's just Oracle and the utilities they have created to make things easier.

If anyone has some other quirks or gotchas please feel to list them out.

Introduction -- CPU Utilization Impact on Response Time

I've had several conversations with individuals that centered around capacity planning and whether we needed to buy more hardware.  A lot of the information I've collected on this subject is directly from Craig Shallahamer at OraPub.  I also recommend reading Neil Gunther Guerrilla Capacity Planning , although Craig deals more directly with how to apply capacity planning, and forecasting to Oracle, Neil Gunther applies to all systems.  This post is my interpretation of their material (not only websites but books as well), as well as some of the research I've done in demonstrating their material on my own systems.

I've collected stats at not only the OS level, but within Oracle as well.  I've use SAR and IOSTAT from an operating system view, and then within Oracle the wait interface as well as the time model views that are collect valuable information about the OS.  I can say where the performance problems reside and why hardware is not the cause.  In other words we haven't reached capacity within the CPUs or the IO subsystem.  However, I've heard  -- "... things are running slow we need more hardware."

Hardware can mask the performance tuning needs, but its only a band-aid not the cure.  Admittedly ordering additionally hardware and bringing it online can be done in a shorter time frame then tuning the application, especially if waste is prevalent in the application.   But you have to know what hardware will give you the performance increase desired.  And understanding CPU utilization will provide you the knowledge required to make an informed decision.

What is CPU Utilization?  Its a percentage that is shown in the upper portion of a TOP command in Linux, as well is displayed by SAR in Linux. 

Cpu(s): 13.3% us, 6.0% sy, 0.0% ni, 78.0% id, 2.0% wa

We know how to find it, but what does it really mean?   CPU utilization equals the service time multiplied by the arrival rate divided by number of queues.   That's a little better picture, but that leads to additional questions like where do I find the arrival rate or service times and how many queues exist on my system.

For CPUs the number of queues equal the number of CPUS.  If I have a 4 CPU dual core box, that's 8 CPUs -- or 8 queues.  A CPU is one queue because it can service one request at a time.  If I have 8 CPUs I can service 8 request simultaneously. 

The Arrival rate is how fast the work load is arriving at the CPU.   To figure this out you have to decide on what is the workload.  With Oracle there are many different stats that can be used to make this determination.  I haven't found any hard and fast rules.  On the current database I work with I usually choose IO, both logical and physical, as it is a DSS so there's a lot of IO occurring.  For other systems block changes or user calls may characterize the workload.   To calculate the workload's arrival rate, simply take random samples of the stats over a given period of time.  Within Oracle you utilize the v$sysstat view keying in on the interested stats.  Since this view is cumulative from startup of the instance, you run the query twice and then subtract the first numbers from the second query to get the deltas.

Once you have your Delta you divide by the sample time.

So now I have the arrival rate, the number of transactions (number of IOs) occurring over time.  Next I need to find my Service time.

The service time I'm going to derive from the other values.  What I've found is that when calculating this value for Oracle it is actually the response time and not the service time, as it will include the wait time within Oracle.  The formula for service time is Utilization multiplied by Queues divided by Arrival Rate:

 S = U*M/Arrival Rate

Now that I have all these numbers what does it mean to the system?

There is an interesting relationship between CPU utilization and response time.  By increasing the CPU utilization the response time will actually increase.  An increase in response time means jobs are processing longer because transactions are taking longer.  Why?  Transactions will take longer because the CPU utilization increases, the arrival rate has increased which means the same number of CPUs are trying to service an increased number of transactions.  If the arrival rate has increased significantly enough, queuing begins to occur which can be seen from the OS by using TOP in Linux -- The run queue. 

load average: 1.50, 0.86, 0.51

This situation is simply I have more transactions entering the system that require the CPU to perform some work, but they have to stand in line and wait to be serviced.

Not all increases in arrival rate will impact the response time.  Each system has a point in which queuing begins to occur and you have to determine where that point lies on the spectrum.  Some of the systems that I work with I can see queuing begin at around the 40 - 43%, but its not until it begins to push over 50% that users will actually begin to felt by the users.  I believe this is because of the type of database, I venture to guess if it was OLTP their tolerance level would be less.

I'm only using half of my CPU resources, I should be able to use 100% CPU.   And you can, however, if you do it will impact performance, so there's a trade off.  Having a CPU utilization at 90% and better is not a metric for how well you are performing, its a sign that you have reached capacity and your performance is being impacted.

So capacity is reached you buy more hardware.  Actually the first step is to ensure that your application and OS are tuned sufficiently to support the workload.  If it is, then yes hardware would be the next step. 

How do you decide whether you need more CPUs or faster CPUs.  If my response time is within the SLA for the current workload, but additional workload will put it over the top then the answer is more CPUs.  This increases the number of queues and allows for more transactions to be serviced within a given time frame.  However, if I want a faster response time (think make it go faster) then I want faster CPUs.  Faster CPUs will allow the transaction to be serviced faster (and reduce response time).

Example 1, I have a process that runs a number of jobs that takes 3 days to complete.  The users would like this process to be finished in 1 day instead.  Looking at the process there is a lot of tuning that can be done but there's no one available to perform all the tuning and testing required as it will take an overhaul of the application and database.  I've removed all the serialization that I can at this time.  The solution to make this go faster is faster CPUs.

Example 2, I have several processes that run for 1 day but I'm doubling the number of these processes that will be running simultaneously.  The users are happy with the 1 day turn around on the current process.  However, my CPU utilization runs at about 45% during this processing.  If I double the number the CPUs I can add the additional workload without impacting the current response time.

Example 3, I commute back and forth between Tampa and Orlando 4 days a week.  I make the trip in 80 minutes one way.  If I wanted to reduce my travel time to 60 minutes (imagine no speed limits on I-4) I would increase my speed.  And if my car couldn't handle going faster I would get a new car that went faster.  However, if there were 6 additional people that wanted to make the trip i would get an additional car to carry the passengers (or a bigger car if you wanted to think of the seats as the actual queue).

To wrap up, CPU utilization is tied to arrival rate, service time and the number of queues.  An increase in CPU utilization will at a point impact response time.  To understand how CPU utilization impacts the performance its important to characterize the workload -- determine what a transaction means.   This is only a small slice of the information on the subject and I still have more to learn.   The challenge is communicating this information to the people who make the decisions and in a way that is meaningful.   I find it to be a great challenge as they are always trying to look for just one magical number that will show how the databases are performing that will also mean something to the business.