星期四, 5月 03, 2012

[轉載] How to Repair Pending Distributed Transactions in Oracle


http://bbs.newitfarmer.com/read.php?tid-12084-ds-1-toread-1.html

摘要: 
在访问某些表的特定行时报ORA-01591错误 
select * from BF_INCOME_EXPENSES_T 
where account_id = 36816153 
and user_id = 39964213 
and city_code = '185' 
ORA-01591: 锁定已被有问题的分配事务处理72.0.1608712挂起 
SQL> select count(*) from UNITELE.BI_MQSYNC_SOURCE_CONTROL_T1; 
ORA-01591: 锁定已被有问题的分配事务处理72.0.1608712挂起 
由于该表是业务关键表,部分前台业务受到影响。 
关键词:ORA-01591 DBA_2PC_PENDING 分布式事务 
1.故障分析 
首先,在遇到ORA错误时,我们不可能知道每个ORA错误都是什么意思,所以通过oracle的联机文档查错误的cause和action可以让我们初步了解该错误。 
01591, 00000, "lock held by in-doubt distributed transaction %s" 
// *Cause: Trying to access resource that is locked by a dead two-phase commit 
// transaction that is in prepared state. 
// *Action: DBA should query the pending_trans$ and related tables, and attempt 
// to repair network connection(s) to coordinator and commit point. 
// If timely repair is not possible, DBA should contact DBA at commit 
// point if known or end user for correct outcome, or use heuristic 
// default if given to issue a heuristic commit or abort command to 
// finalize the local portion of the distributed transaction. 
Oracle对ORA-01591错误的描述是"lock held by in-doubt distributed transaction %s,由分布式事务持有锁造成的。通过错误的cause可以看到’Trying to access resource that is locked by a dead two-phase commit transaction that is in prepared state’该错误是由访问一个处于prepared状态的二阶段事务所持有锁的资源造成的。 
下面简单介绍一下分布式事务。 
分布式事务,简单来说,是指一个事务在本地和远程执行,本地需要等待确认远程的事务结束后,进行下一步本地的操作。如通过dblink update远程数据库的一行记录,如果在执行过程中网络异常,或者其他事件导致本地数据库无法得知远程数据库的执行情况,此时就会发生in doublt的报错。此时需要dba介入,且需要分多种情况进行处理。 
分布式事务的Two-Phase Commit机制,会经历3个阶段: 
1.PREPARE PHASE: 
1.1 决定哪个数据库为commit point site。(注,参数文件中commit_point_strength值高的那个数据库为commit point site) 
1.2 全局协调者(Global Coordinator)要求所有的点(除commit point site外)做好commit或者rollback的准备。此时,对分布式事务的表加锁。 
1.3 所有分布式事务的节点将它的scn告知全局协调者。 
1.4 全局协调者取各个点的最大的scn作为分布式事务的scn。 
至此,所有的点都完成了准备工作,我们开始进入COMMIT PHASE阶段,此时除commit point site点外所有点的事务均为in doubt状态,直到COMMIT PHASE阶段结束。 
2.COMMIT PHASE: 
2.1 Global Coordinator将最大scn传到commit point site,要求其commit。 
2.2 commit point尝试commit或者rollback。分布式事务锁释放。 
2.3 commit point通知Global Coordinator已经commit。 
2.4 Global Coordinator通知分布式事务的所有点进行commit。 
3.FORGET PHASE: 
3.1 参与的点通知commit point site他们已经完成commit,commit point site就能忘记(forget)这个事务。 
3.2 commit point site在远程数据库上清除分布式事务信息。 
3.3 commit point site通知Global Coordinator可以清除本地的分布式事务信息。 
3.4 Global Coordinator清除分布式事务信息。
有关分布式事务的详细信息请参阅oracle联机文档.
当前的分布式事务处于Two-Phase Commit机制中的prepared阶段,这个阶段事务已经在表上加锁了,现在我们要访问这些表,但事务没有结束,一直持有锁,导致访问资源失败报ORA-01591。(在这里需要指出:分布式事务所持有的锁之所以堵塞读操作,是因为oralce不知道该显示哪个版本的数据) 如果结束这个事务,那相应的锁也会释放,这样就能解决这个问题。我们知道要结束一个事务有两种办法:commit和rollback。现在我们尝试结束这个事务: 
commit force '72.0.1608712'; 
ORA-02058: no prepared transaction found with ID 72.0.1608712 
报错并没有发现prepared状态的事务,由于该事务是分布式事务,我们首先想到的是dba_2pc_pending这个试图 
SQL> select * from dba_2pc_pending; 
no rows selected 
该试图并没有查到信息,所以我们无法用commit force结束这个分布式事务,那么现在我们查看是否存在该事务,通过实际报错,我们可以清晰的看到事务号为72.0.1608712,该事务在72号回滚段的0号事务槽上并且序列号是1608712,这时查询一个基表x$ktuxe,看看72号回滚段上是否有该事务。 
SQL> SELECT KTUXEUSN, KTUXESLT, KTUXESQN, /* Transaction ID */ 
2 KTUXESTA Status, 
3 KTUXECFL Flags 
4 FROM x$ktuxe 
5 WHERE ktuxesta!='INACTIVE' 
6 AND ktuxeusn= 72; 
KTUXEUSN KTUXESLT KTUXESQN STATUS FLAGS 
---------- ---------- ---------- ---------------- ------------------------ 
72 0 1608712 PREPARED SCO|COL|REV|DEAD 
通过x$ktuxe这个基表,我们看到确实存在这个事务,而且是prepared状态。 
此时,我们基本清楚了这个问题的原因:当一个分布式事务死掉时,由于该事务没有正常结束,导致事务持有的锁一直没有释放,所以在访问这个事务涉及的资源时,申请不到锁资源,所以报ORA-01591。由于是分布式事务,当在dba_2pc_pending中查询不到事务信息时,我们是无法通过commit或者rollback结束该事务。 
所以,我们目前的任务是模拟出这个分布式事务。由于dba_2pc_pending试图是依赖于pending_trans$这个表,同时事务是与session关联在一起的,所以我们需要手工往pending_trans$和pending_sessions$两个表中插入数据。 
2.故障处理 
SQL> alter system disable distributed recovery; 
系统已更改。 
SQL> insert into pending_trans$ ( 
2 LOCAL_TRAN_ID, 
3 GLOBAL_TRAN_FMT, 
4 GLOBAL_ORACLE_ID, 
5 STATE, 
6 STATUS, 
7 SESSION_VECTOR, 
8 RECO_VECTOR, 
9 TYPE#, 
10 FAIL_TIME, 
11 RECO_TIME) 
12 values( '72.0.1608712',  
13 306206,  
14 'XXXXXXX.12345.1.2.3',  
15 'prepared','P',  
16 hextoraw( '00000001' ),  
17 hextoraw( '00000000' ),  
18 0, sysdate, sysdate ); 
已创建 1 行。 
SQL> insert into pending_sessions$ 
2 values( '72.0.1608712', 
3 1, hextoraw('05004F003A1500000104'), 
4 'C', 0, 30258592, '', 
5 146 
6 ); 
已创建 1 行。 
SQL> commit; 
提交完成。 
SQL> alter system enable distributed recovery; 
系统已更改。 
此时,查询dba_2pc_pending发现已有该事务,并且状态是我们模拟出的prepared状态 
SQL> select * from dba_2pc_pending; 
LOCAL_TRAN_ID GLOBAL_TRAN_ID STATE MIX A TRAN_COMMENTFAIL_TIME FORCE_TIME RETRY_TIME OS_USER  
OS_TERMINAL HOST DB_USER 
COMMIT# 
---------------- 
72.0.1608712 XXXXXXX.12345.1.2.3 prepared no12-11月-08 12-11月-08 
此时我们结束这个事务  
SQL> COMMIT FORCE '72.0.1608712'; 
提交完成。 
再次查询dba_2pc_pending,发现事务是forced commit状态,该事务已经结束。 
SQL> select * from dba_2pc_pending; 
LOCAL_TRAN_ID GLOBAL_TRAN_ID STATE MIX A TRAN_COMMENTFAIL_TIME FORCE_TIME RETRY_TIME OS_USER  
OS_TERMINAL HOST DB_USER COMMIT# 
---------------- 
72.0.1608712 XXXXXXX.12345.1.2.3 forced commit no12-11月-08 12-11月-08 12-11月-08 
通过x$kutxe 查询事务信息,发现事务释放了回滚段,事务已经结束。 
SQL> SELECT KTUXEUSN, KTUXESLT, KTUXESQN, /* Transaction ID */ 
2 KTUXESTA Status, 
3 KTUXECFL Flags 
4 FROM x$ktuxe 
5 WHERE ktuxesta!='INACTIVE' 
6 AND ktuxeusn= 72; 
未选定行 
此时,我们需要清除dba_2pc_pending中分布式事务的残余信息 
SQL> alter session set "_smu_debug_mode"=4; ―― 在session级别设置回滚段处于手工管理模式,如果不设置这个参数,在回滚段自动管理模式下,清除事务信息会报错 
会话已更改。 
SQL> execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('72.0.1608712'); ――用dbms包清除事务信息 
PL/SQL 过程已成功完成。 
SQL> select * from dba_2pc_pending; 
未选定行 
测试访问业务表 
SQL> select count(*) from UNITELE.BI_MQSYNC_SOURCE_CONTROL_T1; 
COUNT(*) 
---------- 
367 
问题解决。 
其实,我在另外一个客户处也碰到过类似问题,当时也是报ORA-01591,但是在dba_2pc_pending中可以查到prepared状态的事务,此时只需要commit force结束这个事务,并清除事务信息就可以了。对于上面的案例,我怀疑开发商直接清除了分布式事务信息,但是事务并没有结束,导致锁资源得不到释放报ORA-01591。 
总结:ORA-01591错误一般是由于分布式事务造成的,造成分布式事务失败的原因主要是库之间的网络突然中断,造成两个库中的事务信息不一致,所以会有残余的分布式事务信息。此时,要针对不同的事务状态做不同的处理。同时在遇到棘手的问题时,可以查询metalink,该案例参考metalink文档:NOTE:401302.1


-----------------------------


Applies to:

Oracle Server - Enterprise Edition - Version: 8.1.5.0 to 11.1.0.6 - Release: 8.1.5 to 11.1
Information in this document applies to any platform.

Purpose

The purpose of this bulletin is to assist support analysts in understanding and 
resolving the stranded dba_2pc_entries.

Scope and Application

The intended audience are support analysts having good experience on distributed
databases.

How To Resolve Stranded DBA_2PC_PENDING Entries


Contents
1. Problem Description
2. Solutions
2.1 Dba_2pc entries without a corresponding transaction
2.2 Distributed transaction without corresponding dba_2pc entries
2.3 How to PURGE the DISTRIBUTED transaction in PREPARED state, when COMMIT
or ROLLBACK FORCE hangs ?, where we have entries for both Distributed
transaction and dba_2pc entries.

1. Problem Description:
=======================

As a result of a failed commit of a distributed transaction, some entries can
be left in dba_2pc views, i.e. dba_2pc_pending and dba_2pc_neighbors. The RECO
process checks these views to recover the failed txn. However, in some cases
RECO cannot perform. the recovery. One cause is that all sites involved in the
transaction not being  accessible at the same  time. Another cause is dba_2pc
views being  inconsistent with the  transaction table, which  is the topic of
this article. This cause can further be classified as follows:

1. dba_2pc views have entries for a non-existent distributed transaction

2. There is a distributed transaction for which there are no entries in
dba_2pc views

3. How to PURGE the DISTRIBUTED transaction in PREPARED state, when COMMIT
or ROLLBACK FORCE hangs ?, where we have entries for both Distributed
transaction and dba_2pc entries.

Solutions to each subclass is provided in the rest of the article.


2. Solutions:


2.1 Dba_2pc entries without a corresponding transaction


In this case dba_2pc views show distributed transactions but there are no txns
in reality. Ifthe stateof the  transaction is committed,  rollback forced or
commit forced then this is normal and it can be cleaned up using

dbms_transaction.purge_lost_db_entry

However, ifthe stateof the transaction is PREPARED and there is no entry in
the transaction table for it then this entry can be cleaned up manually as
follows:


set transaction use rollback segment SYSTEM; delete from sys.pending_trans$ where local_tran_id =; delete from sys.pending_sessions$ where local_tran_id =; delete from sys.pending_sub_sessions$ where local_tran_id =; commit; Example: -------- The following query reports a dist. txn. in prepared state select local_tran_id, state from dba_2pc_pending; LOCAL_TRAN_ID STATE ---------------------- ---------------- 1.92.66874 prepared Given that a transaction id is composed oftriple, '1.92.66874' is located in rollback segment# 1. To find out the list of active transactions in that rollback segment, use: SELECT KTUXEUSN, KTUXESLT, KTUXESQN, /* Transaction ID */ KTUXESTA Status, KTUXECFL Flags FROM x$ktuxe WHERE ktuxesta!='INACTIVE' AND ktuxeusn= 1; <== this is the rollback segment# no rows selected
It is not possible to rollback force or commit force this transaction.
rollback force '1.92.66874'; ORA-02058: no prepared transaction found with ID 1.92.66874
Hence, we have to manually cleanup that transaction:
set transaction use rollback segment SYSTEM; delete from sys.pending_trans$ where local_tran_id = '1.92.66874'; delete from sys.pending_sessions$ where local_tran_id = '1.92.66874'; delete from sys.pending_sub_sessions$ where local_tran_id = '1.92.66874'; commit;
2.2 Distributed transaction without corresponding dba_2pc entries In this case dba_2pc views are empty but users are receiving distributed txn related errors, e.g. ORA-2054, ORA-1591. Normally such a case should not appear and if it is reproducible a bug should be filed. Here is the list of several alternative solutions that can be used in this case: a. Perform. incomplete recovery b. Truncate the objects referenced by that transaction and import them c. Use _corrupted_rollback_segments parameter to drop that rollback segment d. Insert dummy entries into dba_2pc views and either commit or rollback force the distributed transaction The first three solutions are discussed in Backup and Recovery manuals and in the notes referred above. In the 4th solution a dummy entry is inserted into the dictionary so that the transaction can be manually committed or rolled back. Note that RECO will not be able to process this txn and distributed txn recovery should be disabled before using this method. Furthermore, please take a BACKUP of your database before using this method. The stranded entries is the cause of ORA-01591 so we need to clear the stranded entries by purging them using
execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('transanction_id');
The following example describes how to diagnose and resolve this case. Suppose that users are receiving
ORA-1591: lock held by in-doubt distributed transaction 1.92.66874
and the following query returns no rows:
select local_tran_id, state from dba_2pc_pending where local_tran_id='1.92.66874'; no rows selected
Furthermore querying the rollback segment shows that 1.92.66874 remains in prepared state
SELECT KTUXEUSN, KTUXESLT, KTUXESQN, /* Transaction ID */ KTUXESTA Status, KTUXECFL Flags FROM x$ktuxe WHERE ktuxesta!='INACTIVE' AND ktuxeusn= 1; /* <== Replace this value with your txn undo seg# Which is displayed in the first part of the transaction ID */ KTUXEUSN KTUXESLT KTUXESQN STATUS FLAGS ---------- ---------- ---------- ---------------- ------------------------ 1 92 66874 PREPARED SCO|COL|REV|DEAD
Trying to manually commit or rollback this transaction
commit force '1.92.66874'; ORA-02058: no prepared transaction found with ID 1.92.66874
raises ORA-02058 since dba_2pc views are empty. In order to use commit force or rollback force a dummy record should be inserted into pending_trans$ as follows:
alter system disable distributed recovery; insert into pending_trans$ ( LOCAL_TRAN_ID, GLOBAL_TRAN_FMT, GLOBAL_ORACLE_ID, STATE, STATUS, SESSION_VECTOR, RECO_VECTOR, TYPE#, FAIL_TIME, RECO_TIME) values( '1.92.66874', /* <== Replace this with your local tran id */ 306206, /* */ 'XXXXXXX.12345.1.2.3', /* These values can be used without any */ 'prepared','P', /* modification. Most of the values are */ hextoraw( '00000001' ), /* constant. */ hextoraw( '00000000' ), /* */ 0, sysdate, sysdate ); insert into pending_sessions$ values( '1.92.66874',/* <==Replace only this with your local tran id */ 1, hextoraw('05004F003A1500000104'), 'C', 0, 30258592, '', 146 ); commit; commit force '1.92.66874';
If commit force raises an error then note the errormessage and execute the following:
delete from pending_trans$ where local_tran_id='1.92.66874'; delete from pending_sessions$ where local_tran_id='1.92.66874'; commit; alter system enable distributed recovery;
Otherwise run purge the dummy entry from the dictionary, using
alter system enable distributed recovery; connect / as sysdba COMMIT; Use following query to retrieve the value for such _smu_debug_mod parameter: col Parameter for a20 col "Session Value" for a20 col "Instance Value" for a20 SELECT a.ksppinm "Parameter",b.ksppstvl "Session Value",c.ksppstvl "Instance Value" FROM x$ksppi a, x$ksppcv b, x$ksppsv c WHERE a.indx = b.indx AND a.indx = c.indx AND a.ksppinm = '_smu_debug_mode' / -- set it temporarily to 4: alter system set "_smu_debug_mode" = 4; /* if automatic undo management is being used */ -- in 9.2x alter session can be used instead. commit; /* this is to prevent the ORA-01453 in purge_lost_db_entry call */ exec dbms_transaction.purge_lost_db_entry( '1.92.66874' ); SQL> commit; SQL> alter system set "_smu_debug_mode" = ; SQL> commit;
2.3 How to PURGE the DISTRIBUTED transaction in PREPARED state, when COMMIT or ROLLBACK FORCE hangs ?, where we have entries for both Distributed transaction and dba_2pc entries. ORA-01591: lock held by in-doubt distributed transaction 44.88.85589 The row exist from dba_2pc_pending & Rollback segment
SQL> SELECT LOCAL_TRAN_ID,STATE FROM DBA_2PC_PENDING; LOCAL_TRAN_ID STATE ----------------- ----------- 44.88.85589 prepared SQL> SELECT KTUXEUSN, KTUXESLT, KTUXESQN, /* Transaction ID */ KTUXESTA Status, KTUXECFL Flags FROM x$ktuxe WHERE ktuxesta!='INACTIVE' AND ktuxeusn= 44; /* <== Replace this value with your txn undo seg# Which is displayed in the first part of the transaction ID */ KTUXEUSN KTUXESLT KTUXESQN STATUS FLAGS ---------- ---------- ---------- ---------------- ------------------------ 44 88 85589 PREPARED SCO|COL|REV|DEAD SQL> Commit force 44.88.85589; SQL> rollback force 44.88.85589;
Executing COMMIT or ROLLBACK FORCE hangs The wait event is ""free global transaction table entry" Purging the transaction should fail with below error:
EXECUTE DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('44.88.85589'); BEGIN DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('44.88.85589'); END; * ERROR at line 1: ORA-06510: PL/SQL: unhandled user-defined exception ORA-06512: at "SYS.DBMS_TRANSACTION", line 94 ORA-06512: at line 1
Solution: -------- You have to implement both the solution : 2.1 Dba_2pc entries without a corresponding transaction 2.2 Distributed transaction without corresponding dba_2pc entries 1.
delete from sys.pending_trans$ where local_tran_id = '44.88.85589'; delete from sys.pending_sessions$ where local_tran_id = '44.88.85589'; delete from sys.pending_sub_sessions$ where local_tran_id ='44.88.85589'; commit;
2. Now insert the dummy record as explained in section: 2.2 Distributed transaction without corresponding dba_2pc entries commit; 3. Commit force '44.88.85589' 4. Purge the transaction: exec dbms_transaction.purge_lost_db_entry('44.88.85589'); Note:126069.1 Manually Resolving In-Doubt Transactions: Different Scenario's


Still have questions ?

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion via My Oracle Support Streams and Distributed Database Community


References

NOTE:100664.1 - Master Note for Troubleshooting Oracle Managed Distributed Transactions
NOTE:126069.1 - Manually Resolving In-Doubt Transactions: Different Scenarios

显示相关信息 相关内容

产品

  • Oracle Database Products > Oracle Database > Oracle Database > Oracle Server - Enterprise Edition
关键字

DBA_2PC_NEIGHBORS; DBA_2PC_PENDING; DBMS_TRANSACTION; DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY; DISTRIBUTED; IN-DOUBT; PURGE; TRANSACTION
错误

ORA-2054; ORA-2058; ORA-1591; ORA-6512; ORA-1453; ORA-6510

沒有留言:

LinkWithin-相關文件

Related Posts Plugin for WordPress, Blogger...