When SSH over to one of the nodes in our RAC cluster the response was horrible, in the neighborhood of 3 minutes plus before being prompted for username and then another 3 minutes plus to be prompted for password.
First I pull up TOP. The first thing I noticed is the Run Queue
load average: 23.09,22.48,21.73 -- Ouch
I check the CPU:
Cpu(s): 0.3%us, 0.6%sy, 0.0% ni, 74.4%id, 24.7% wa, 0.0%hi, 0.0%si
So it's not a process in the CPU, however, I have a high wait at 24.7%
I check the MEM:
Mem: 65742800k total, 22487544k used, 43255256k free, 680408k buffers
That is definitely not right as I should have around 57700140k used. This looks like the instance may not be up.
I check the instance:
ps -ef|grep pmon
and only the asm1_pmon is running
I check the filesystem space
And the root filesystem shows / 100%
That's the answer. I'm using ASM, root filling up can cause the instance to go belly up. It will also cause the run queue to jump as processing that need the root filesystem will not be able to process.
The resolution is to find the largest files in the root filesystem in remove them. In our case the trace file cleanup routine is not functioning properly and I find trace files dating back 2009 to include the cdmp and core dumps. I remove all the trace files. I also zip a few large files that have appeared recently that I know are not being actively utilized and will look to remove them.
I try to restart the instance
srvctl start instance -i orl1 -d oral
But receive a dependency error on the resource, so I opted to bounce the node which will also restart the CRS and both instances.
If your CPU is low, but your run queue is high, check the root filesystem. Its the most common issue we see in this scenario.