客户的环境是两台IBM X3850,安装Oracle Linux 6.x x86_64bit的操作系统部署的Oracle 11.2.0.4.0 RAC Database,共享存储是EMC,使用了EMC vplex虚拟化软件对存储做了镜像保护,操作系统安装了EMC原生的多路径软件。故障的现象是当vplex内部发生切换时,RAC其中一个节点的OCR和Votedisk所在的磁盘组变得不可访问,导致ora.crsd服务脱机,Grid Infrastrasture集群堆栈宕掉,但是该节点的数据库实例不受影响,但不再接受外部新的连接,在这个过程中另外一个节点完全不受影响。下面是相关的日志信息:
  1.操作系统日志: Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 4 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 2 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 3 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 1 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 0 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 11 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 12 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 10 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 9 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 8 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 7 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 5 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 6 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Bus 3 to VPLEX CKM00142000957   port CL2-00 is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 1 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 12 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 11 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 10 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 7 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 4 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 8 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 9 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 5 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 3 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 6 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 2 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 0 to CKM00142000957   is dead. Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Bus 3 to VPLEX CKM00142000957   port CL2-04 is dead.     从操作系统日志可以看出,Mar 18 08:25:48的时候port CL2-00和port CL2-04两个链路dead了。  2.ASM日志: Fri Mar 18 08:25:59 2016 WARNING: Waited 15 secs for write IO to PST disk 0 in group 1. WARNING: Waited 15 secs for write IO to PST disk 0 in group 1. WARNING: Waited 15 secs for write IO to PST disk 0 in group 2. WARNING: Waited 15 secs for write IO to PST disk 0 in group 2. WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.     <<<< 几乎在和操作系统报错的相同时间,ASM开始检查所有磁盘的PST(partnership state table),ASM的等待时间为15秒。 WARNING: Waited 15 secs for write IO to PST disk 1 in group 3. WARNING: Waited 15 secs for write IO to PST disk 2 in group 3. WARNING: Waited 15 secs for write IO to PST disk 0 in group 3. WARNING: Waited 15 secs for write IO to PST disk 1 in group 3. WARNING: Waited 15 secs for write IO to PST disk 2 in group 3. Fri Mar 18 08:25:59 2016 NOTE: process _b000_+asm1 (66994) initiating offline of disk 0.3190888900 (OCRVDISK_0000) with mask 0x7e in group 3    <<<< group 3是OCR和Votedisk所在的磁盘组。 NOTE: process _b000_+asm1 (66994) initiating offline of disk 1.3190888899 (OCRVDISK_0001) with mask 0x7e in group 3 NOTE: process _b000_+asm1 (66994) initiating offline of disk 2.3190888898 (OCRVDISK_0002) with mask 0x7e in group 3 NOTE: checking PST: grp = 3 GMON checking disk modes for group 3 at 10 for pid 48, osid 66994 ERROR: no read quorum in group: required 2, found 0 disks    <<<< 由于OCR和Votedisk所在的磁盘组是Normal冗余级别,3个ASM磁盘,要求2个可访问,但是实际是0个可访问。 NOTE: checking PST for grp 3 done. NOTE: initiating PST update: grp = 3, dsk = 0/0xbe3119c4, mask = 0x6a, op = clear NOTE: initiating PST update: grp = 3, dsk = 1/0xbe3119c3, mask = 0x6a, op = clear NOTE: initiating PST update: grp = 3, dsk = 2/0xbe3119c2, mask = 0x6a, op = clear GMON updating disk modes for group 3 at 11 for pid 48, osid 66994 ERROR: no read quorum in group: required 2, found 0 disks    <<<< 0个磁盘可访问。 Fri Mar 18 08:25:59 2016 NOTE: cache dismounting (not clean) group 3/0x3D81E95D (OCRVDISK)  WARNING: Offline for disk OCRVDISK_0000 in mode 0x7f failed.    <<<< OCR和Votedisk所在的磁盘组对应的所有磁盘都脱机。 WARNING: Offline for disk OCRVDISK_0001 in mode 0x7f failed. WARNING: Offline for disk OCRVDISK_0002 in mode 0x7f failed. NOTE: messaging CKPT to quiesce pins Unix process pid: 66996, image: oracle@dzqddb01 (B001) Fri Mar 18 08:25:59 2016 NOTE: halting all I/Os to diskgroup 3 (OCRVDISK)    <<<< OCRVDISK磁盘组下面的所有I/O都不可用。 Fri Mar 18 08:25:59 2016 NOTE: LGWR doing non-clean dismount of group 3 (OCRVDISK) NOTE: LGWR sync ABA=11.69 last written ABA 11.69 Fri Mar 18 08:25:59 2016 kjbdomdet send to inst 2 detach from dom 3, sending detach message to inst 2 Fri Mar 18 08:25:59 2016 List of instances:  1 2 Dirty detach reconfiguration started (new ddet inc 1, cluster inc 96)  Global Resource Directory partially frozen for dirty detach * dirty detach - domain 3 invalid = TRUE  Fri Mar 18 08:25:59 2016 NOTE: No asm libraries found in the system  2 GCS resources traversed, 0 cancelled Dirty Detach Reconfiguration complete Fri Mar 18 08:25:59 2016 WARNING: dirty detached from domain 3 NOTE: cache dismounted group 3/0x3D81E95D (OCRVDISK)  SQL> alter diskgroup OCRVDISK dismount force /* ASM SERVER:1031924061 */     <<<< dismount OCRVDISK磁盘组。 Fri Mar 18 08:25:59 2016 NOTE: cache deleting context for group OCRVDISK 3/0x3d81e95d GMON dismounting group 3 at 12 for pid 51, osid 66996 NOTE: Disk OCRVDISK_0000 in mode 0x7f marked for de-assignment NOTE: Disk OCRVDISK_0001 in mode 0x7f marked for de-assignment NOTE: Disk OCRVDISK_0002 in mode 0x7f marked for de-assignment NOTE:Waiting for all pending writes to complete before de-registering: grpnum 3 ASM Health Checker found 1 new failures  3.Clusterware告警日志: 2016-03-18 11:53:19.394:  [crsd(47973)]CRS-1006:The OCR location +OCRVDISK is inaccessible. Details in /u01/app/11.2.0/grid/log/dzqddb01/crsd/crsd.log.    <<<< 时间上要比OCRVDISK被dismount的时间晚很多。 2016-03-18 11:53:38.437:  [/u01/app/11.2.0/grid/bin/oraagent.bin(48283)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/oraagent_oracle' disconnected from server. Details at (:CRSAGF00117:) {0:7:121} in /u01/app/11.2.0/grid/log/dzqddb01/agent/crsd/oraagent_oracle/oraagent_oracle.log. 2016-03-18 11:53:38.437:  [/u01/app/11.2.0/grid/bin/scriptagent.bin(80385)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/scriptagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:9:7} in /u01/app/11.2.0/grid/log/dzqddb01/agent/crsd/scriptagent_grid/scriptagent_grid.log. 2016-03-18 11:53:38.437:  [/u01/app/11.2.0/grid/bin/orarootagent.bin(48177)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:5:3303} in /u01/app/11.2.0/grid/log/dzqddb01/agent/crsd/orarootagent_root/orarootagent_root.log. 2016-03-18 11:53:38.437:  [/u01/app/11.2.0/grid/bin/oraagent.bin(48168)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:1:7} in /u01/app/11.2.0/grid/log/dzqddb01/agent/crsd/oraagent_grid/oraagent_grid.log. 2016-03-18 11:53:38.442:  [ohasd(47343)]CRS-2765:Resource 'ora.crsd' has failed on server 'dzqddb01'.    <<<< ora.crsd 已经OFFLINE。 2016-03-18 11:53:39.773:  [crsd(45323)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/dzqddb01/crsd/crsd.log. 2016-03-18 11:53:39.779:  [crsd(45323)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage ]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/dzqddb01/crsd/crsd.log.    <<<< 物理设备不可访问。 2016-03-18 11:53:40.470:  [ohasd(47343)]CRS-2765:Resource 'ora.crsd' has failed on server 'dzqddb01'.      这里我们会产生一个疑问,为什么ora.crsd挂掉,但是ora.cssd没有OFFLINE(通过crsctl stat res -t -init可以确认ora.cssd没有挂掉,数据库实例还正常运行,节点并没有被踢出去),原因在于OCRVDISK对应的磁盘只是短暂的不可访问,cssd进程是直接访问OCRVDISK对应的3个ASM磁盘,并不依赖于OCRVDISK磁盘组是MOUNT状态,并且Clusterware默认的磁盘心跳超时时间为200秒,所以cssd进程没有出现问题。      由此我们又会有更多的疑问,为什么RAC的另外一个节点没有出现故障?为什么只有OCRVDISK磁盘组dismount,其他的磁盘组都正常?      在出现问题后重启has服务之后该节点即可恢复正常,加上其他磁盘组,其他节点并没有出现故障,所以可以简单的判断共享存储没有太大的问题,只是链路断掉之后有短时间的不可访问,寻找问题的关键是ASM实例日志中的这个信息:WARNING: Waited 15 secs for write IO to PST disk,15秒的时间是否过短影响了OCRVDISK的脱机,下面是MOS上的解释:
     Generally this kind messages comes in ASM alertlog file on below situations,  
   Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,   <<<< 在normal或high冗余度的磁盘组上的ASM磁盘被执行延迟ASM PST心跳检查。
  thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.    <<<< 检查失败,ASM实例会dismount磁盘组,默认的超时时间为15秒。  
      
   By the way the heart beat delays are sort of ignored for external redundancy diskgroup.    <<<< PST heartbeat检查会忽略外部冗余的磁盘组。
 ASM instance stop issuing more PST heart beat until it succeeds PST revalidation,
 but the heart beat delays do not dismount external redundancy diskgroup directly.    <<<< PST heartbeat检查即使超过了15秒也不会dismount外部冗余的磁盘组。  
   The ASM disk could go into unresponsiveness, normally in the following scenarios:    <<< ASM磁盘出现无反应的情况通常是由于以下几个原因:  
   +    Some of the paths of the physical paths of the multipath device are offline or lost    <<<< 1.聚合设备下的一些物理路径offline或丢失。
 +    During path 'failover' in a multipath set up                                                         <<<< 2.具有设备下的物理路径发生failover。
 +    Server load, or any sort of storage/multipath/OS maintenance                                <<<< 3.系统或设备的维护操作。  
      通过上面的这段描述,能大概的解释出现问题的原因,由于存储链路断掉了2条(可能发生failover),导致聚合后的共享存储设备短暂的不可访问,OCRVDISK是Normal冗余度的磁盘组,ASM会执行PST heartbeat检查,由于超过15秒OCRVDISK对应的磁盘组不可访问导致ASM将OCRVDISK直接dismount,进而导致OCR文件不可访问,导致crs服务OFFLINE,由于cssd的磁盘心跳超时时间为200秒,且是直接访问ASM磁盘,不经过ASM磁盘组,所以css服务没有受影响,hasd高可用堆栈依然正常工作,集群节点未被踢出,数据库实例正常工作。      Oracle给出了在数据库层面解决这个问题的办法:     If  you can not keep the disk unresponsiveness to below 15 seconds, then  the below parameter can be set in the ASM instance ( on all the Nodes of  RAC ):
 
     _asm_hbeatiowait    <<<< 该参数指定了PST heartbeat超时时间。
     
 As per internal bug 17274537 , based on internal testing the value should be increased to 120 secs, which is fixed in 12.1.0.2    <<<< 从12.1.0.2开始,该参数默认值被增加到了120秒。  
      
   Run below in asm instance to set desired value for _asm_hbeatiowait  
   alter system set "_asm_hbeatiowait"= scope=spfile sid='*';    <<<< 运行这条命令修改ASM实例的该参数,之后重启ASM实例,CRS。  
   And then restart asm instance / crs, to take new parameter value in effect.  
  为了避免类似的问题,可以将OCR镜像到不同的ASM磁盘组,这样将进一步的提高ora.crsd服务的可用性。  更详细的内容请参考文章:《ASM diskgroup dismount with "Waited 15 secs for write IO to PST" (文档 ID 1581684.1)》  --end--