Problem Solving with AWR report

AWR report is by  far one of the best and superior Oracle diagnostic tool. No other database diagnostic tool comes even close to AWR reports. I will discuss one of our recent experiences of  using AWR to resolve a complex issue.  I consider this as an complex issue as the issue was identified with AWR even though it was not an database issue.

.Note: Our SLA’s are in milliseconds and every change to the database has to be performance tested. You will surprised how sensitive the database/OS/App becomes when you have millisecond SLA’s

.Problem : Response times are really very bad in  our performance environment.

.

What could be the most likely culprits?.

  1. Newly applied AIX patch
  2. Application Code
  3. Unusual Storage  activity/SRDF
  4. Network connectivity issues.
  5. Database itself
.
.

Top wait events from Oracle.

..

.

“log file sync” was the top wait event  with average wait of 17ms. Normally this wait event is around 8-9ms. We also noticed unusual number of SQL*net break/reset to client. I think important most part of troubleshooting is to try and relate wait events.  Since “log file sync” doubled , the obvious culprit becomes our EMC storage which was not the actual case.

But the key to resolve our issue was load  profile

From the load profile, the redo size increased from  169 to 1394, the number of block changes increased from 1207 to 15437 which lead to conclusion that some additional unusual activity is occurring in the database. This became more intriguing and complex since there was no change to apps/database. With this clue, we drilled down further into AWR reports and identified few more  more major variations.

..

.

Let me summarize the above data.

  1. “transaction rollbacks”  increased from  281 to 322,439.
  2. “undo change vector size” increased from 62.2 to 330.7
  3. “user commits” increase from 93K to 679K.
  4. “user rollbacks” remained the same.

Here are some of the key  definitions for wait events discussed in this blog

  1. SQL*Net break/reset : A wait event indicating an error/unhandled exception during execution.
  2. User Rollbacks: Rollback issued by application/user
  3. Transaction Rollbacks: Rollbacks performed by Oracle like a transaction could not be completed because of some constraint violation or so.
Because we see an increase in “transaction rollbacks” and “redo size” ,the evidence is  pointing towards some data issues causing execution  of additional business logic or so but were unsuccessful.  A further analysis proved this theory correct and issue was resolved. We had  inadvertently deleted some test data that caused different flow of business logic from our regular tests.
.
So AWR to rescue again

			

ORA-14128: FOREIGN KEY constraint mismatch in ALTER TABLE EXCHANGE PARTITION : The solution

I am writing this blog to provide a simple way  to resolve ORA-14128.    I did not find any reasonable documented  solution to fix this issue w/o disabling the constraints.

Here goes my story.  Upgrading from Oracle 10.2.0.3 to 11.2.0.2 introduced few problems mainly due to our high concurrency and high availability OLTP system.  In 10.2.0.3 , our partition maintenance was done online w/o disabling constraints using the approach described below with an example..

Lets assume tables  transaction and sales  with following attributes:

Transaction (Parent/Child) table with transid as primary key and  index on salesid

.
Sales (Parent) table with salesid as primary key

.
Foreign Key Definition: transaction.salesid  references sales.salesid. Also lot of other tables references transid.transaction. Therefore transaction table is both parent and child.

.

Partition Maintenance Approach: Our ultimate goal for partition maintenance is to drop old partitions.  So all I have  do  to issue “alter table transaction drop partition transaction_part_2011”.  This command will only succeed if there are no foreign key constraints referencing this parent table. Otherwise it will result in “ORA-02266: unique/primary keys in table referenced by enabled foreign keys”.  Since partition maintenance is done online, we don’t have the luxury to disable constraints.

Therefore we had to do all the additional steps involved with partition exchange to accommodate online partition maintenance.  And one of the key requirements of partition exchange is that the tables being exchanged must be identical with respect to structure like same indexes , constraints etc.  You also have the option of disabling constraints to get to this goal.

Note: With steps 1-4 we are trying to make transaction_duplicate table identical to transaction

  1.  create table transaction_Duplicate as select * from Transaction where 1 = 0;
  2. alter table transaction_Duplicate add constraint Transaction_dup_pk primary key(transid);
  3. create index salesid_idx on Transaction_Duplicate(salesid) ;
  4. alter table transaction_Duplicate add constraint fk_salesid foreign key (salesid) references sales(salesid);
  5. alter table transaction exchange partition Transaction_part_2011 with table Transaction_Duplicate including indexes without validation;
  6. alter table transaction drop partition transaction_part_2011;

VERY IMPORTANT: Step-4 is optional if the one of the tables used in exchange is empty which is true in our case as we are creating a dummy duplicate table.

Our problems started with Step-4 after upgrading to 11.2.0.2. We started getting  “ORA-00054:resource busy and acquire with NOWAIT specified” because Oracle changed the locking behavior in 11g.

We opened SR with Oracle and as per Oracle, the code fix for bug 5909305 introduces an intentional change in locking behavior and that change is effective from 11.1.0.6 onwards. ie: From 11g onwards it is correct and expected that DML on the child table will take an SX lock on the parent.

For customers that cannot live with the changed behavior the fix in bug 6117274 allows the change in locking to be reverted to pre-11g behavior by setting “_fix_control” to ‘5909305:OFF’.

.

In our case since Step-4 was optional , we removed step-4 but our problems did not stop there.  Due to nature of our business , every change that goes into the database requires a rollback just in case its required. In other words,  we wanted to rollback the partition  exchange by performing another exchange. Now Step-4 became a must for rollback as both the tables are not empty after the 1st exchange(One of the partition is empty not  the table.).

We were back to square one with small difference. Initially we were trying to fix ORA-00054 and now we are trying to fix ORA-14128. After lot of reading , trials and prototypes, I was able to fix ORA-14128. The solution is very simple. All you had to do was put  the referential integrity constraints in “ENABLE NOVALIDATE“. This was OK for our  database. So if you can put your constraints in  “ENABLE NOVALIDATE” , then you have a simple fix for ORA-14128

Thanks for reading.

Find if SPMs are being used

Oracle provides multiple  ways  to implement plan stability  starting with Outlines(8i)  followed by SQL profiles (10g).  As of 11g,  plan stability is implemented with SPM’s or SQL Plan Management. Off course you can use  hints to implement plan stability.  Some DBA’s swear by hints where as some of them are opposed to hints as they do not let the optimizer to choose  most effective execution plan.  Without going into the merits of one feature over the other, let me discuss about the main purpose of this blog.

If you implement outlines or SQL Profiles or SPM’s, you  really want a way to ensure that  optimizer is behaving as it is supposed to.  Unfortunately with 11.2.0.2, there is no  easy way to find if SPMs are being used or not. The  LAST_EXECUTE column of  DBA_SQL_PLAN_BASELINES  view does not get updated always as documented.  As per documentation LAST_EXECUTED column provides the timestamp when the plan baseline was last executed.

Here is a specific case scenario where  time stamp of LAST_ACTIVE_TIME column from V$SQL  is 11-AUG-2011 14:36:09 where as LAST_EXECUTE column of  DBA_SQL_PLAN_BASELINES  is 08-AUG-11 02.39.58.000000

SQL> select sql_id , SQL_PLAN_BASELINE,LAST_LOAD_TIME,LAST_ACTIVE_TIME from v$sql where sql_id =’fmg67gm9j2skh’;

SQL_ID SQL_PLAN_BASELINE LAST_LOAD_TIME LAST_ACTIVE_TIME
————- —————————— ——————- ——————–
fmg67gm9j2skh SQL_PLAN_7xaxz3w5hd4fbd6e420b9 2011-08-11/14:32:18 11-AUG-2011 14:36:09
fmg67gm9j2skh SQL_PLAN_7xaxz3w5hd4fbd6e420b9 2011-08-11/14:36:09 11-AUG-2011 14:36:09
fmg67gm9j2skh SQL_PLAN_7xaxz3w5hd4fbd6e420b9 2011-08-11/14:36:09 11-AUG-2011 15:17:40

SQL> select LAST_EXECUTED , plan_name from dba_sql_plan_baselines where plan_name=’SQL_PLAN_7xaxz3w5hd4fbd6e420b9′;

LAST_EXECUTED PLAN_NAME
————————————————————————— ——————————
08-AUG-11 02.39.58.000000 PM SQL_PLAN_7xaxz3w5hd4fbd6e420b9

We  opened an SR with Oracle. Oracle got back to us  stating that we are hitting Bug 8864741 and  provided some workaround to query SQLOBJ$. However even this table was NOT getting  updated.  So until this bug is fixed, the only fool proof way is tracing.

Thinking outside of the patch: Resolving “cursor: pin s”

Simple Solutions to “cursor: pin s” wait events
..

There are 100’s of articles regarding “cursor: pin s” wait event but very few discussing real, simple and implementable solutions to “cursor: pin s” wait event and associated CPU spike (Bug 6904068). The objective of this article is to discuss the solution and as such not much emphasis is given to detect “cursor: pin s” wait events. It does not matter if you are a PL/SQL shop or Java shop, the options discussed in this article to reduce “cursor: pin s” focuses on implementing solutions that are simple, manageable and will not require extensive QA testing.

..
Now the million dollar question “Why would someone not apply the patch and resolve this issue?”.  There may be many reasons and I will list just couple of them

..
1.    Millisecond  SLA’s
2.    Use of underscore  parameter specially spare parameters
3.    Patch conflicts
4.    Extensive QA testing

..
Before we discuss some easy ways to resolve “cursor: pin s” issue, let us recap some of the existing solutions.  Please contact  Oracle Support if you are planning to use any of the following.

..
1.    Apply patch for the bug 6904068 and then adjust _FIRST_SPARE_PARAMETER.  After applying the patch, setting _FIRST_SPARE_PARAMETER to ZERO will retain current behavior.

..

2.    Set underscore  parameter “ _CURSOR_FEATURES_ENABLED”  to  10

..

3.    The 2nd solution is   to disable mutexes and embrace library cache pins. Mutexes can be disabled by setting underscore parameter “_KKS_USE_MUTEX_PIN” to false.

..

From the above documented undocumented solutions, the 1st solution is widely used. This is a good if you don’t care about  your SLA’s, just apply the patch and take few milliseconds hit on SLA’s but what happens when your SLA’s are in microseconds or milliseconds, then applying this  patch  is not feasible because the lowest value that can be specified for ”_first_spare_parameter” is 1 centisecond or 10 microseconds.  Even though the lowest value for ”_first_spare_parameter” is 1 centisecond or 10 microseconds, the impact to response time is much lesser.

Now let’s start with basics to understand the root cause of this problem.
..

What are Mutexes?
Oracle introduced Mutexes in Oracle Version 10.2 replacing library cache pin and they stand for mutual exclusion (functionality is similar to mutexes in c).  Mutexes are used to protect data or other resources from concurrent access. Don’t come to any conclusion based on the previous statement.  Please read ahead.

..

How does Oracle Mutexes work?
Oracle uses counters to implement mutexes.  The counters are called as reference or ref counters
1. Every time a particular mutex is obtained, Oracle will increment its value.
2. Every time a particular mutex is released, Oracle will decrement its value.

..

Finally what is “cursor pin s”?
A wait even that occurs when a session wants a mutex in shared mode on a cursor. As mentioned in the previous section, Oracle has to update the ref counters to get the mutex
However it is very important to understand that access to these counters are not concurrent. If there are concurrent sessions trying to obtain the mutex, then only one session can actually increment or decrement the reference counter at a time. Therefore these concurrent sessions must wait.
..

Four key points to avoid confusion are
1.    Oracle is requesting the mutex in shared mode
2.    No session is holding  the mutex in exclusive mode
3.     Wait event occurs as Oracle was not able to update the reference counters.
4.    Sessions will wait over and over until the mutex is obtained. This may cause CPU spike.

..

The simplest way to detect “cursor: pin s” is running AWR report and the queries causing the “cursor: pin s” wait event can be obtained with ADDM report.
The simplest analogy to understand “cursor pin s” is traffic bottlenecks caused by Tolls. There may not be a real traffic problem because of road capacity or lane merger. Instead the traffic problem is caused by toll booth.

..

Now that we have understood the problem, lets us looks at couple of solutions without use of underscore parameter.
Assumptions

..
Connections are made to database via  application servers

Outline Approach: Outlines are used to trick Oracle to create multiple child cursors for  high concurrency SQL statement.

..

Step 1.    Create  4 categories of outlines for the same high concurrency SQL

..

CREATE OUTLINE APP1_SQL1 FOR CATEGORY CAT_APP1
ON select ename from emp where ename =’SCOTT’ ;

CREATE OUTLINE APP2_SQL1 FOR CATEGORY CAT_APP2
ON select ename from emp where ename =’SCOTT’ ;

CREATE OUTLINE APP1_SQL1 FOR CATEGORY CAT_APP3
ON select ename from emp where ename =’SCOTT’ ;

CREATE OUTLINE APP2_SQL1 FOR CATEGORY CAT_APP4
ON select ename from emp where ename =’SCOTT’ ;

..

Step 2.    Issue ALTER SESSION to enable outlines from each application server before executing the high concurrency SQL. Logon triggers or any other approach can be used.
..

ALTER SESSION SET USE_STORED_OUTLINES= CAT_APP1 ;
ALTER SESSION SET USE_STORED_OUTLINES= CAT_APP2 ;
ALTER SESSION SET USE_STORED_OUTLINES= CAT_APP3 ;
ALTER SESSION SET USE_STORED_OUTLINES= CAT_APP3 ;

..

In the example below , I executed the same SQL statement with different outline category. In the 1st session , I execute the query once where as in the 2nd session , I executed the query 7 times. You can see the output  by querying v$sql. Please not that the child_address(child cursor) is different for both the SQL.

..

SQL>SELECT ADDRESS, CHILD_ADDRESS, EXECUTIONS, OUTLINE_CATEGORY, SQL_TEXT
2  FROM V$SQL WHERE SQL_ID=’7xhzp0kt9yt6j’;

ADDRESS          CHILD_ADDRESS    EXECUTIONS OUTLINE_CATEGOR SQL_TEXT
—————- —————- ———- ————— ————————————————–
070000013B771C88 070000013B770D58          1 CAT_APP1        select ename from emp where ename =’SCOTT’
070000013B771C88 070000013B75C4A0          7 CAT_APP2        select ename from emp where ename =’SCOTT’

JAVA prepare statement Approach:  This approach is good when SQL statement is prepare using JAVA where in we modify the SQL statement in a way to have multiple SQL areas for same SQL statement by adding table alias for the SQL statement.  The simplest way to generate table alias is using hostname.

..

“select ename from emp myapp1  where myapp1.ename = :1”
Instead of
“select ename from emp where ename =:1”

“select ename from emp myapp2  where myapp2.ename = :1”
Instead of
“select ename from emp where ename =:1”

Disable Optimizer Features

Every version of Oracle optimizer  has new features and bugs. As requirements of  every Oracle environment varies, there may be a need to enable and disable these features or bugs. From Oracle version 10.2.0.2 , Oracle  introduced underscore parameter “_FIX_CONTROL” to enable and disable certain features or bugs mainly related to optimizer. This parameter can be specified at session or system level or at init.ora level. 

The syntax is “_FIX_CONTROL”=’BUG#:OFF’ or “_FIX_CONTROL”=’BUG#:ON’.  Incorrect syntax may prevent the database from being started or cause hangs if issued with alter system command.

Any patch that you want to disable or enable must be installed and viewable in the V$SYSTEM_FIX_CONTROL. The  view V$SYSTEM_FIX_CONTROL provides brief description of the  patch and the Oracle version it was introduced. At session level , you can query V$SESSION_FIX_CONTROL.

Note: Please consult with Oracle support before setting this parameter

Resolving performance issue with GoldenGate Extract

In general extract does not require any tuning. Tuning is mainly required for  replicat as it must reproduce the source operations by constructing the SQL statement  resulting in  random I/O activity.   We were surprised when we  noticed slow performance of extract  after couple of hours and extract hanging issues  after upgrading  from 10.4 to 11.1.1.1.0.

Slow performance of Extract

The slow performance was traced to the new feature where the “ADD extract” command by defaults registers the SCN number with RMAN   to  prevent RMAN from deleting archive log files that extract may need for restart and recovery.  Very good feature but looks like the there is bug that causes the slowness in performance. As per Oracle, Extract is suppose to try to register every 10ms up to 4 hrs , but after 4 hrs the wait time overflows which cause extract to issue the SQL to register the SCN thousands of times a second.

This new feature is disabled with extract parameter “ TRANLOGOPTIONS LOGRETENTION DISABLED”

Extract Hanging

We occasionally noticed that extract stopped extracting  data from redo logs until it became full and archived log file was created. Once archived logfile was created, the extract caught-up on backlog  and started processing OK.

This issue was trace to new ReadAhead feature implemented to gain improvement on the queue reading logic but it lacks some synchronization with regards to checking the queue statuses (empty, full, etc.) and causes intermittent hang issues.

This new feature is disabled with extract parameter “TRANLOGOPTIONS _NOREADAHEAD ANY ” .  This parameter is an underscore parameter  similar to  initialization parameters in Oracle.

11g Compression Uncompressed

In 11g, there is compression for everyone.  I am really impressed with the direction of Oracle w.r.t to compression. The main problem with advance compression is that it requires additional options licensing.  Not  a bad deal if you are trying to save on storage costs; just share the savings with Oracle.

This blog  will discuss the key features and implementation of compression. To start with compression is supported at following levels.

.

1.Storage

  • OLTP compression for DML (Structured Data)
  • SecureFiles Compression (Unstructured Data)

2. Network

  • Data Guard redo Compression

3.Backup

  • RMAN compression
  • Data pump compression

.

.

OLTP Compression

  • New compression algorithm uses deferred or batched approach
  • Data is inserted as is without compression until  PCTFREE  value is reached.
  • Compression of data starts once PCTFREE threshold is reached
  • Can be enabled at  table, partition or tablespace level
  • No need of decompressing the data during reads
  • Recommended for low update activity tables

.

Data guard Compression

  • Redo is compressed as it is transmitted over a network.
  • Helps efficiently utilize network bandwidth when data guard  is across data centers
  • Faster resynchronization of Data guard  during gap resolution.
  • Recommended for low network bandwidth
  • Implemented with  attribute “COMPRESSION”  of initialization parameter LOG_ARCHIVE_DEST_n

.

RMAN Compression

  • Supports compression of backups using ZLIB algorithm
  • Faster compression and low CPU utilization compared to default BZIP2 (10g)
  • Low compression ratio  compared to BZIP2
  • Implement  with  CONFIGURE COMPRESSION ALGORITHM  ‘value’ command where value can be HIGH, MEDIUM(ZLIB) and LOW(LZO).

.

Data Pump Compression

  • Compression of metadata introduced in 10g
  • Compression of data introduced in 11g.
  • Both are Inline operation
  • Save on storage allocation
  • No need to uncompress before Import
  • Implemented with COMPRESSION attribute, Values supported are ALL,DATA_ONLY,METADATA_ONLY

.

Multisection Backups

Multisection Backups
Prior to Oracle 11g,  the lowest unit of RMAN parallelism was datafile which means you could not backup single large  datafile in parallel. Starting with 11g , Oracle supports backing up of a single datafile in parallel using the concept of sections. A section is contiguous range of blocks in a datafile and as of 11g , up to 256 Sections or channels per datafile is supported. This is very useful if you are using big tablespaces.
Please be aware that multiplexed backups will not contain datafiles or sections from other datafiles if section backup is used.
Benefits
  1. Improved performance for backup and restore.
  2. Restart the failed section backup instead of datafile
In the example below, backups using section concept took 30-40% less time than conventional RMAN backup. Pease click on the image below for large font size .

11g PL/SQL native compilation

.

.
PL/SQL native compilation is not something that is real new but something that became real now in Oracle 11g.
.
.
How does Oracle world look without PL/SQL native compilation
  • PL/SQL  is compiled to intermediate form “mcode”  or machine readable code.
  • mcode is stored in database and interpreted at run time.
  • This is the default mode set via init.ora parameter  PLSQL_CODE_TYPE=INTERPRETED
PL/SQL native compilation in 9i/10g
  • Convert PL/SQL code to C , then compile using C compiler and dynamically link  into Oracle processes.
  • Configure initialization parameters  PLSQL_CODE_TYPE,  PLSQL_NATIVE_LIBRARY_DIR and PLSQL_NATIVE_LIBRARY_SUBDIR_COUNT. One more reason to hire Oracle DBA.
PL/SQL native compilation in 11g
  • No  need of C compiler ; PL/SQL is compiled to machine code(DLL) and stored in the SYSTEM  tablespace instead of file system
  • Just set init.ora parameter PLSQL_CODE_TYPE to NATIVE instead of default value of  INTERPRETED. All Other init.ora parameters have been deprecated
How to set it
.
.
At Session Level
  • At session level before creating the PL/SQL procedure
                          ALTER SESSION SET PLSQL_CODE_TYPE = NATIVE
.
At Object Level
  •   Recompile the stored procedure

……………….ALTER PACKAGE mypackage COMPILE PLSQL_CODE_TYPE = NATIVE.

.
At

At Database Level
  • Start the database in UPGRADE mode.
  • Execute $ORACLE_HOME/rdbms/admin/dbmsupgnv.sql
  • Shutdown immediate and recompile all objects.
Benefits?
  • Improved performance for computation intensive code.
  • PL/SQL procedures with SQL will not see any significant benefits
  • Data type SIMPLE_INTEGER  provides  significant performance improvements  with native compilation  vs. interpreted.

DBMS_PARALLEL_EXECUTE summarized

What is  DBMS_PARALLEL_EXECUTE?

DBMS_PARALLEL_EXECUTE permits updating a large set of data incrementally. Basically you create smaller chunks  of data by segregating the rows based on Rowid  OR numeric column OR user-defined SQL and then update the chunks  in parallel. Other major advantages are that the chunks are committed individually and better performance when chunks are based on ROWID.

Why is DBMS_PARALLEL_EXECUTE important ?

  1. Large single transaction can be broken into multiple chunks (transactions)
  2. Less impact to dependent systems like Golden Gate or streams
  3. Better performance compared to parallel DML when chunks are based on ROWID and blocks.
  4. Less undo space required  and minimizing the chance of ORA-1555
  5. Impact of Rollback due to error is not severe.
  6. Duration of lock is reduced.
  7. Restart  failed chunks (transactions)
  8. Unlike other parallel features , DBMS_PARALLEL_EXECUTE  does not require Enterprise Edition (at least for now)
  9. Eliminates much of manual coding and better uniform distribution across parallel processes

Steps Involved

1.Create a task with CREATE_TASK.

2.Create chunks (size) using one of the following.

  • CREATE_CHUNKS_BY_ROWID
  • CREATE_CHUNKS_BY_NUMBER_COL
  • CREATE_CHUNKS_BY_SQL

3.Define the Update statement.

4.Execute the task in parallel with RUN_TASK.

5.Perform error handling and run failed chunks with RESUME_TASK.