转载

技术问题分析14(6.15)

还是继续分析在跟踪和监控JMS服务器发现的一些技术问题。

对于JMS线程增长快而且不释放的问题（大量TIME_WAITING）

首先最近还是发现对于JMS Server活动线程经常会达到400左右的值，而实际上在该时间点基本没有JMS消息分发服务，唯一的就是各个业务系统订阅方监听线程挂在上面。但是仅仅只有10几个系统订购不应该出现如此多的线程。同时前面也分析过在开启Cluster集群和故障漂移设置后会导致线程数增加，原以为这个值关闭后线程数会下降下来并稳定，但是实际情况还是没解决。

从日志里面，我们可以看到大量线程处于TIMED_WAITING状态，具体如下：

"[ACTIVE] ExecuteThread: '404' for queue: 'weblogic.kernel.Default (self-tuning)'" #497 daemon prio=5 os_prio=0 tid=0x00007fc10463a800 nid=0x7d83 in Object.wait() [0x00007fc00afb2000]

java.lang.Thread.State: TIMED_WAITING (on object monitor)

at weblogic.transaction.internal.CoordinatorFactory.getCachedCoordinator(

- locked <0x00000000c0391478> (a

对于该信息不是Error类错误日志，暂时也无法判断是否属于正常情况。在Oracle官方网站可以搜索到一篇文章对该问题有一个说明，具体如下：

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=438956680164704&id=1496472.1&displayIndex=1&_afrWindowMode=0&_adf.ctrl-state=46y7jxfq7_309#SYMPTOM

具体假设描述如下：

In case of two phase commit, when a cached coordinator is required e.g. in order to commit a transaction involving another WLS node resource within a cluster, the initialization of the cached remote coordinator stub makes threads always wait for the whole timeout value (20s+). This sometimes causes entire business transactions to roll-back because of transaction time-out.

The time lost in processing is 20s+ each time, which precisely is the default value of the weblogic.JTA.ContactCoordinatorWaitSeconds parameter.

在这篇文章里面提到1个Bug，但是是Weblogic 10版本才有的Bug，而现在我们已经采用的是Weblogic 12c的版本应该已经不存在。同时提到解决方法为： set -Dweblogic.JTA.ContactCoordinatorWaitSeconds to some value in seconds no less than 3secs in the server start-up.

即需要设置启动参数，将ContactCoordinatorWaitSeconds这个值设置为大于3秒的一个值。而对于该设置暂时没有执行，个人还没有明确搞清楚这个日志是否本身是问题，或者说该问题和该设置之间的关系。

对于JTA超时时间设置的问题

在JTA超时时间设置里面，有两个超时时间，一个超时设置对应到XA两阶段提交中的第一阶段，一个超时设置对应到第二阶段。对于这两个超时时间，在Oracle网站对应的描述信息如下：

timeout seconds：Specifies the maximum amount of time, in seconds, an active transaction is allowed to be in the first phase of a two-phase commit transaction. If the specified amount of time expires, the transaction is automatically rolled back.

Abandon Timeout Seconds ：Specifies the maximum amount of time, in seconds, a transaction manager persists in attempting to complete the second phase of a two-phase commit transaction.

During the second phase of a two-phase commit transaction, the transaction manager continues to try to complete the transaction until all resource managers indicate that the transaction is completed. After the abandon transaction timer expires, no further attempt is made to resolve the transaction. If the transaction is in a prepared state before being abandoned, the transaction manager rolls back the transaction to release any locks held on behalf of the abandoned transaction.

对于Abandon Timeout Seconds超时时间，默认值是86400，即1天的时间，而我们在日志里面也经常发现很大分布式事务一直要等到1天的时间，不断的进行第二阶段重试，1天以后才会最终超时放弃。而实际在JMS消息分发的时候数据量并不大，完全不需要这么长的时间。但是不确定的一个点是该时间设置说到的丢弃，是否是彻底的丢弃不再重试，还是说仅仅是事务回滚掉。

对于分布式事务原理的详细说明

另外在网上还有一篇详细讲解分布式事务的Oracle官方文章，对于分布式事务完整说明文档，可以参考如下一篇文章的详细说明，对于这篇文章后续还需要仔细阅读，以彻底了解整个XA两阶段提交和分布式事务的内部处理逻辑和原理。以方便后续问题的跟踪和分析。

https://docs.oracle.com/cd/E13222_01/wls/docs100/jta/jtatxexp.html

原文 http://blog.sina.com.cn/s/blog_493a84550102xhjs.html

正文到此结束