Problem Solving with AWR report

AWR report is by  far one of the best and superior Oracle diagnostic tool. No other database diagnostic tool comes even close to AWR reports. I will discuss one of our recent experiences of  using AWR to resolve a complex issue.  I consider this as an complex issue as the issue was identified with AWR even though it was not an database issue.

.Note: Our SLA’s are in milliseconds and every change to the database has to be performance tested. You will surprised how sensitive the database/OS/App becomes when you have millisecond SLA’s

.Problem : Response times are really very bad in  our performance environment.

.

What could be the most likely culprits?.

  1. Newly applied AIX patch
  2. Application Code
  3. Unusual Storage  activity/SRDF
  4. Network connectivity issues.
  5. Database itself
.
.

Top wait events from Oracle.

..

.

“log file sync” was the top wait event  with average wait of 17ms. Normally this wait event is around 8-9ms. We also noticed unusual number of SQL*net break/reset to client. I think important most part of troubleshooting is to try and relate wait events.  Since “log file sync” doubled , the obvious culprit becomes our EMC storage which was not the actual case.

But the key to resolve our issue was load  profile

From the load profile, the redo size increased from  169 to 1394, the number of block changes increased from 1207 to 15437 which lead to conclusion that some additional unusual activity is occurring in the database. This became more intriguing and complex since there was no change to apps/database. With this clue, we drilled down further into AWR reports and identified few more  more major variations.

..

.

Let me summarize the above data.

  1. “transaction rollbacks”  increased from  281 to 322,439.
  2. “undo change vector size” increased from 62.2 to 330.7
  3. “user commits” increase from 93K to 679K.
  4. “user rollbacks” remained the same.

Here are some of the key  definitions for wait events discussed in this blog

  1. SQL*Net break/reset : A wait event indicating an error/unhandled exception during execution.
  2. User Rollbacks: Rollback issued by application/user
  3. Transaction Rollbacks: Rollbacks performed by Oracle like a transaction could not be completed because of some constraint violation or so.
Because we see an increase in “transaction rollbacks” and “redo size” ,the evidence is  pointing towards some data issues causing execution  of additional business logic or so but were unsuccessful.  A further analysis proved this theory correct and issue was resolved. We had  inadvertently deleted some test data that caused different flow of business logic from our regular tests.
.
So AWR to rescue again

Tagged: , , , ,

%d bloggers like this: