I returned from vacation a few weeks ago to a serious performance issue that had hampered processing by at least 75% overall. Some processing wasn't impacted.
I read through all the email threads theorizing that the problem was the redo switching.
I pulled up the gv$session looking keying on the minimum information to see what Oracle was telling us through the wait interface. I was floored the Global Cache Waits were significantly high, processes weren't moving through the GC. I confirmed by running some AWR reports over different time periods since the issue was first reported.
I sent an email explaining my findings using the wait interface, and then discounted the redo switching since we were not seeing any waits for Archiver, Log Writer or any other background process on log waits. There were background processes that waited on GC waits.
I explained to the team that Oracle will tell you what is slowing processing down if you just listen to the wait interface. It won't give you the root cause but it will tell you where to start looking. And even though we were switching frequently, we weren't waiting on the switch to occur and therefore that was not the area to concentrate efforts.
Another theory was sent out that the Interconnect was probably suffering because some of reference tables were cached on one of the old nodes that was no longer being used by the application (we are actually preparing to remove it from the cluster). The theory was that the blocks were remaining on the instance and all the other instances had to always go to that instance to retrieve the blocks. A perfect example of how global cache does not work. I explained how the buffer cache worked and that the blocks would actually be pulled over to the requesting instance and the touch count incremented.
I also explained that it didn't appear we were saturating the Interconnect by the amount of traffic that was trying to be pushed across, but we were experiencing block lost, according to the wait interface.
I took the additional step in disproving the theory by caching data on one of the other instances, and then pulling the data over to one of the other instances. Checking the waits -- Global Cache Waits.
Another theory was raised that because we did a lot of full table scans (FTS) and our multi-block read count was high (16K), that this was why we were experiencing the waits. Because on Metalink there was a note that OS settings may need to be adjust under these circumstance. In particular the UDP buffers, but ours were already at 256K. We have always had a 16K MBR count. So what changed? Did the workload actually start soaring? Were we saturating the bandwidth of the Interconnect?
The network and linux administrator were instructed to start checking the OS and hardware. As well as, I requested all of the changes that went into the database, OS or hardware prior to the waits issue.
In the meantime, the theory of the instance buffer cache kept surfacing from individuals that really didn't have an understanding of how buffer cache or global cache fusion worked. So I decided the easiest way to demonstrate was to shutdown the node. Which I did... and the results were not exactly what I wanted.
The entire cluster started having issues as nodes were evicted.
Things proceeded to get worse. As we tried to bring up the individual nodes they experienced reconfiguration issues and would keep rebooting. After about an hour of the new DBA on the block struggling I realized the issue..... THE SWITCH WAS BAD! Then I started with admonishing myself for not realizing it from the onset of the issue.
I immediately began explaining my ah ha moment. The nodes were having problem reconfiguring and staying in the cluster because they were having trouble communicating across the Interconnect. The Interconnect being nothing more than 8 Cat 5 wires plugged into a 48 port Netgear switch (yes I know not exactly a stellar setup but you learn to make due with what you have at the moment). The chances of all 8 wires going bad was slim, but the switch a definite could happen. Everyone thought I was insane and said switches don't go bad. I differed with them based on my husband managing a team of network engineers who dealt with switches going bad all the time. They are after all nothing more than electronics and they too can over heat, have cards go bad just like any other electronics.
We decided to bring up one node at time and let it come all the way up with the database, before we started the next node. And my boss wanted us to make sure that we didn't bring up the old nodes at all. We obliged, after all by this point I was fairly certain what the outcome would demonstrate. After node one was up we brought up Grid so that we could have an instant picture for assessment. The second node came up and there were the cluster waits. The only activity that was occurring was the opening of the database and then just the background processes. The cluster waits were noticeably high, to those of us who had a mental baseline of what they should be at this time under normal circumstances. We introduced the third node and the waits shot up more. We then started getting Inbound Connection Timeouts on the third node. We made a collective decision to bring down the node.
Not having a spare switch on hand (long story), we decided to grab a little 6 port 100mb switch. Made the swap with the two active node, and the Global Cache Waits pretty much disappeared. Since only half the bandwidth was being used, we decided to experiment and see what would happen if we brought up the third node.
Unfortunately, since we only had 100mb switch and the traffic pushed to just over 90mb with the third node we were saturating the bandwidth. We needed to back off of using the third node until we could be procure a Gig switch. Which we did the next day.
With the new switch in place the Global Cache Waits returned to normal. Block lost were a thing of the past.
The one thing that I came out of this fire fighting activity is to not make troubleshooting more difficult. Think through what comprises the waits in question and remember to include the hardware in the assessment. For Global Cache Waits if you have been cruising right along without any issues and then all of sudden the waits soar chances are that the issue isn't with the database or RAC, but with the Interconnect.
No comments:
Post a Comment