Passo a Passo para resolver problemas ao iniciar o Oracle Grid Infrastructure

Troubleshooting Grid Infrastructure startup

This has been an interesting story today when one of my blades decided to reboot after an EXT3 journal error. The hard facts first:

  • Oracle Linux 5.5 with kernel 2.6.18-194.11.4.0.1.el5
  • Oracle 11.2.0.2 RAC
  • Bonded NICs for private and public networks
  • BL685-G6 with 128G RAM

First I noticed the node had problems when I tried to get all databases configured on the cluster. I got the dreaded “cannot communicate with the CRSD”

1
[oracle@node1.example.com] $ srvctl config database
2 PRCR-1119 : Failed to look up CRS resources of database type
3 PRCR-1115 : Failed to find entities of type resource that match filters (TYPE ==ora.database.type) and contain attributes DB_UNIQUE_NAME,ORACLE_HOME,VERSION
4
Cannot communicate with crsd

Not too great, especially since everything worked when I left yesterday. What could have gone wrong?An obvious reason for this could be a reboot, and fair enough, there has been one:

[grid@node1.example.com] $ uptime
09:09:22 up  2:40,  1 user,  load average: 1.47, 1.46, 1.42

The next step was to check if the local CRS stack was up, or better, to check what was down. Sometimes it’s only crsd which has a problem. In my case everything was down:

01 [grid@node1.example.com] $ crsctl check crs
02 CRS-4638: Oracle High Availability Services is online
03 CRS-4535: Cannot communicate with Cluster Ready Services
04 CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
05 CRS-4534: Cannot communicate with Event Manager
06 [grid@node1.example.com] $ crsctl check cluster -all
07 **************************************************************
08 node1:
09 CRS-4535: Cannot communicate with Cluster Ready Services
10 CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
11 CRS-4534: Cannot communicate with Event Manager
12 **************************************************************
13 CRS-4404: The following nodes did not reply within the allotted time:
14 node2,node3, node4, node5, node6, node7, node8

The CRS-4404 was slightly misleading, I assumed all cluster nodes were down after a clusterwide reboot. Sometimes a single node reboot triggers worse things. However, logging on to node 2 I saw that all but the first node were ok.

CRSD really needs CSSD to be up and running, and CSSD requires the OCR to be there. I wanted to know if the OCR was impacted in any way:

[grid@node1.example.com] $ ocrcheck
PROT-602: Failed to retrieve data from the cluster registry
PROC-26: Error while accessing the physical storage
ORA-29701: unable to connect to Cluster Synchronization Service

Well it seemed that the OCR location was unavailable. I know that on this cluster, the OCR is stored on ASM. Common reasons for the PROC-26 error are

  • Unix admin upgrades the kernel but forgets to upgrade the ASMLib kernel module (common grief with ASMLib!)
  • Storage is not visible on the host, i.e. SAN connectivity broken/taken away (happens quite frequently with storage/sys admin unaware of ASM)
  • Permissions not set correctly on the block devices (not an issue when using asmlib)

I checked ASMLib and it reported a working status:

[oracle@node1.example.com] $ /etc/init.d/oracleasm status
Checking if ASM is loaded: yes
Checking if /dev/oracleasm is mounted: yes

That was promising, /dev/oracleasm/ was populated and the matching kernel modules loaded. /etc/init.d/oracleasm listdisks listed all my disks as well. Physical storage not accessible (PROC-26) seemed a bit unlikely now.

I could rule out permission problems since ASMLib was working fine, and I also rule out the kernel upgrade/missing libs problem by comparing the RPM with the kernel version: they matched. So maybe it’s storage related?

Why did the node go down?

Good question, usually to be asked towards the unix administration team. Luckily I have a good contact placed right inside that team and I could get the following excerpt from /var/log/messages arond the time of the crash (6:31 this morning):

01 Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192116, count = 1
02 Mar 17 06:26:06 node1 kernel: Aborting journal on device dm-2.
03 Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
04 Mar 17 06:26:06 node1 last message repeated 55 times
05 Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192216, count = 1
06 Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
07 Mar 17 06:26:06 node1 last message repeated 56 times
08 Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192166, count = 1
09 Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
10 Mar 17 06:26:06 node1 last message repeated 55 times
11 Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192122, count = 1
12 Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
13 Mar 17 06:26:06 node1 last message repeated 55 times
14 Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192140, count = 1
15 Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
16 Mar 17 06:26:06 node1 last message repeated 56 times
17 Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192174, count = 1
18 Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
19 Mar 17 06:26:06 node1 last message repeated 10 times
20 Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_reserve_inode_write: Journal has aborted
21 Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_truncate: Journal has aborted
22 Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_reserve_inode_write: Journal has aborted
23 Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_orphan_del: Journal has aborted
24 Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_reserve_inode_write: Journal has aborted
25 Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_delete_inode: Journal has aborted
26 Mar 17 06:26:06 node1 kernel: __journal_remove_journal_head: freeing b_committed_data
27 Mar 17 06:26:06 node1 kernel: ext3_abort called.
28 Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_journal_start_sb: Detected aborted journal
29 Mar 17 06:26:06 node1 kernel: Remounting filesystem read-only
30 Mar 17 06:26:06 node1 kernel: __journal_remove_journal_head: freeing b_committed_data
31 Mar 17 06:26:06 node1 snmpd[25651]: Connection from UDP: [127.0.0.1]:19030
32 Mar 17 06:26:06 node1 snmpd[25651]: Received SNMP packet(s) from UDP: [127.0.0.1]:19030
33 Mar 17 06:26:06 node1 snmpd[25651]: Connection from UDP: [127.0.0.1]:19030
34 Mar 17 06:26:06 node1 snmpd[25651]: Connection from UDP: [127.0.0.1]:41076
35 Mar 17 06:26:06 node1 snmpd[25651]: Received SNMP packet(s) from UDP: [127.0.0.1]:41076
36 Mar 17 06:26:09 node1 kernel: SysRq : Resetting
37 Mar 17 06:31:15 node1 syslogd 1.4.1: restart.

So it looks like a file system error triggered the reboot-I’m glad the box came back up ok on it’s own. The $GRID_HOME/log/hostname/alerthostname.log didn’t show anything specific to storage. Normally you would see that it starts counting a node down if it lost contact to the voting disks (in this case OCR and voting disks share the same diskgroup).

And why does Clusteware not start?

After some more investigation it seems there was no underlying problem with the storage, so I tried to manually start the cluster, traililng the ocssd.log file for possible clues.

[root@node1 ~]# crsctl start cluster
CRS-2672: Attempting to start ‘ora.cssd’ on ‘node1′
CRS-2674: Start of ‘ora.cssd’ on ‘node1′ failed
CRS-2679: Attempting to clean ‘ora.cssd’ on ‘node1′
CRS-2681: Clean of ‘ora.cssd’ on ‘node1′ succeeded
CRS-5804: Communication error with agent process
CRS-2672: Attempting to start ‘ora.cssdmonitor’ on ‘node1′
CRS-2676: Start of ‘ora.cssdmonitor’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.cssd’ on ‘node1′

… the command eventually failed. The ocssd.log file showed this:

01 ...
02 2011-03-17 09:47:49.073: [GIPCHALO][1081923904] gipchaLowerProcessNode: no valid interfaces found to node for 10996354 ms, node 0x2aaab008a260 { host 'node4', haName 'CSS_lngdsu1-c1', srcLuid b04d4b7b-a7491097, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [61 : 61], createTime 10936224, flags 0x4 }
03 2011-03-17 09:47:49.084: [GIPCHALO][1081923904] gipchaLowerProcessNode: no valid interfaces found to node for 10996364 ms, node 0x2aaab008a630 { host 'node6', haName 'CSS_lngdsu1-c1', srcLuid b04d4b7b-2f6ece1c, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [61 : 61], createTime 10936224, flags 0x4 }
04 2011-03-17 09:47:49.113: [    CSSD][1113332032]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
05 2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 2, node2, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 30846440, LATS 10996434, lastSeqNo 30846437, uniqueness 1300108895, timestamp 1300355268/3605443434
06 2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 3, node3, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31355257, LATS 10996434, lastSeqNo 31355254, uniqueness 1300344405, timestamp 1300355268/10388584
07 2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 4, node4, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31372473, LATS 10996434, lastSeqNo 31372470, uniqueness 1297097908, timestamp 1300355268/3605182454
08 2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 5, node5, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31384686, LATS 10996434, lastSeqNo 31384683, uniqueness 1297098093, timestamp 1300355268/3604696294
09 2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 6, node6, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31388819, LATS 10996434, lastSeqNo 31388816, uniqueness 1297098327, timestamp 1300355268/3604712934
10 2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 7, node7, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 29612975, LATS 10996434, lastSeqNo 29612972, uniqueness 1297685443, timestamp 1300355268/3603054884
11 2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 8, node8, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31203293, LATS 10996434, lastSeqNo 31203290, uniqueness 1297156000, timestamp 1300355268/3604855704
12 2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 3, node33, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31355258, LATS 10996434, lastSeqNo 31355255, uniqueness 1300344405, timestamp 1300355268/10388624
13 2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 4, node4, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31372474, LATS 10996434, lastSeqNo 31372471, uniqueness 1297097908, timestamp 1300355268/3605182494
14 2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 5, node5, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31384687, LATS 10996434, lastSeqNo 31384684, uniqueness 1297098093, timestamp 1300355268/3604696304
15 2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 6, node6, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31388821, LATS 10996434, lastSeqNo 31388818, uniqueness 1297098327, timestamp 1300355268/3604713224
16 2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 7, node7, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 29612977, LATS 10996434, lastSeqNo 29612974, uniqueness 1297685443, timestamp 1300355268/3603055224
17 2011-03-17 09:47:49.197: [    CSSD][1094928704]clssnmvDHBValidateNCopy: node 2, node2, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 30846441, LATS 10996474, lastSeqNo 30846438, uniqueness 1300108895, timestamp 1300355269/3605443654
18 2011-03-17 09:47:49.197: [    CSSD][1094928704]clssnmvDHBValidateNCopy: node 3, node3, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31355259, LATS 10996474, lastSeqNo 31355256, uniqueness 1300344405, timestamp 1300355268/10389264
19 2011-03-17 09:47:49.197: [    CSSD][1094928704]clssnmvDHBValidateNCopy: node 8, node8, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31203294, LATS 10996474, lastSeqNo 31203291, uniqueness 1297156000, timestamp 1300355269/3604855914
20 2011-03-17 09:47:49.619: [    CSSD][1116485952]clssnmSendingThread: sending join msg to all nodes
21 ...

The interesting bit is the “BUT NO NETWORK HB”, i.e. something must be wrong with the network configuration. I quickly checked the output of ifconfig and found a missing entry for my private interconnect. This is defined in the GPnP profile if you are unsure:

1 <gpnp:Network-Profile>
2  <gpnp:HostNetwork id="gen" HostName="*">
3  <gpnp:Network id="net1" IP="17x.xx.x9.1x0" Adapter="bond1.251" Use="cluster_interconnect" />
4  </gpnp:HostNetwork>
5 </gpnp:Network-Profile>

Now that’s a starting point! If tried to bring up bond1.251, but that failed:

[root@node1 network-scripts]# ifup bond1.251
ERROR: trying to add VLAN #251 to IF -:bond1:-  error: Invalid argument
ERROR: could not add vlan 251 as bond1.251 on dev bond1

The “invalid argument” didn’t mean too much to me, so I ran ifup with the “-x” flag to get more information about which argument was invalid:

01 [root@node1 network-scripts]# which ifup
02 /sbin/ifup
03 [root@node1 network-scripts]# view /sbin/ifup
04 # turned out it's a shell script! Let's run with debug output enabled
05 [root@node1 network-scripts]# bash -x /sbin/ifup bond1.251
06 + unset WINDOW
07 ...
08 + MATCH='^(eth|hsi|bond)[0-9]+\.[0-9]{1,4}$'
09 + [[ bond1.251 =~ ^(eth|hsi|bond)[0-9]+\.[0-9]{1,4}$ ]]
10 ++ echo bond1.251
11 ++ LC_ALL=C
12 ++ sed 's/^[a-z0-9]*\.0*//'
13 + VID=251
14 + PHYSDEV=bond1
15 + [[ bond1.251 =~ ^vlan[0-9]{1,4}? ]]
16 + '[' -n 251 ']'
17 + '[' '!' -d /proc/net/vlan ']'
18 + test -z ''
19 + VLAN_NAME_TYPE=DEV_PLUS_VID_NO_PAD
20 + /sbin/vconfig set_name_type DEV_PLUS_VID_NO_PAD
21 + is_available bond1
22 + LC_ALL=
23 + LANG=
24 + ip -o link
25 + grep -q bond1
26 + '[' 0 = 1 ']'
27 + return 0
28 + check_device_down bond1
29 + echo bond1
30 + grep -q :
31 + LC_ALL=C
32 + ip -o link
33 + grep -q 'bond1[:@].*,UP'
34 + return 1
35 + '[' '!' -f /proc/net/vlan/bond1.251 ']'
36 + /sbin/vconfig add bond1 251
37 ERROR: trying to add VLAN #251 to IF -:bond1:-  error: Invalid argument
38 + /usr/bin/logger -p daemon.info -t ifup 'ERROR: could not add vlan 251 as bond1.251 on dev bond1'
39 + echo 'ERROR: could not add vlan 251 as bond1.251 on dev bond1'
40 ERROR: could not add vlan 251 as bond1.251 on dev bond1
41 + exit 1

Hmmm so it seemed that the underlying interface bond1 was missing-which was true. The output of ifconfig didn’t show it as configured, and trying to start it manually using ifup bond1 failed as well. It turned out that the ifcfg-bond1 file was missing and had to be recreated from the documentation. All network configuration files in Red Hat based systems belong into /etc/sysconfig/network-scripts/ifcfg-interfaceName. With the recreated file in place, I was back in the running:

[root@node1 network-scripts]# ll *bond1*
-rw-r–r– 1 root root 129 Mar 17 10:07 ifcfg-bond1
-rw-r–r– 1 root root 168 May 19  2010 ifcfg-bond1.251
[root@node1 network-scripts]# ifup bond1
[root@node1 network-scripts]# ifup bond1.251
Added VLAN with VID == 251 to IF -:bond1:-
[root@node1 network-scripts]#

Now I could try to start the lower stack again:

CRS-2672: Attempting to start ‘ora.cssdmonitor’ on ‘node1′
CRS-2676: Start of ‘ora.cssdmonitor’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.cssd’ on ‘node1′
CRS-2676: Start of ‘ora.cssd’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.cluster_interconnect.haip’ on ‘node1′
CRS-2672: Attempting to start ‘ora.ctssd’ on ‘node1′
CRS-2676: Start of ‘ora.ctssd’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.evmd’ on ‘node1′
CRS-2676: Start of ‘ora.evmd’ on ‘node1′ succeeded
CRS-2676: Start of ‘ora.cluster_interconnect.haip’ on ‘node1′ succeeded
CRS-2679: Attempting to clean ‘ora.asm’ on ‘node1′
CRS-2681: Clean of ‘ora.asm’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.asm’ on ‘node1′
CRS-2676: Start of ‘ora.asm’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.crsd’ on ‘node1′
CRS-2676: Start of ‘ora.crsd’ on ‘node1′ succeeded
[root@node1 network-scripts]# crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

Brilliant-problem solved. This is actually the first time that an incorrect network config prevented a cluster I looked after from starting. The best indication in this case is in the gipcd log file, but it didn’t occur to me to have a look at is as the error was clearly related to storage.

Sobre Alexandre Pires

ORACLE OCS Goldengate Specialist, OCE RAC 10g R2, OCP 12C, 11g, 10g , 9i e 8i - Mais de 25 anos de experiência na área de TI. Participei de projetos na G&P alocado na TOK STOK, EDINFOR alocado na TV CIDADE "NET", 3CON Alocado no PÃO DE AÇUCAR, DISCOVER alocado na VIVO, BANCO IBI e TIVIT, SPC BRASIL, UOLDIVEO alocado no CARREFOUR e atualmente na ORACLE ACS atendendo os seguintes projetos: VIVO, CLARO, TIM, CIELO, CAIXA SEGUROS, MAPFRE, PORTO SEGURO, SULAMERICA, BRADESCO SEGUROS, BANCO BRADESCO, BASA, SANTANDER, CNJ, TSE, ELETROPAULO, EDP, SKY, NATURA, ODEBRESHT, NISSEI, SICREDI, CELEPAR, TAM, TIVIT, IBM, SMILES, CELEPAR, SERPRO,OKI,BANCO PAN, etc
Esse post foi publicado em RAC, TUTORIAL e marcado , , , , , , , , , , , , , , , , . Guardar link permanente.

Deixe uma resposta

Preencha os seus dados abaixo ou clique em um ícone para log in:

Logotipo do WordPress.com

Você está comentando utilizando sua conta WordPress.com. Sair / Alterar )

Imagem do Twitter

Você está comentando utilizando sua conta Twitter. Sair / Alterar )

Foto do Facebook

Você está comentando utilizando sua conta Facebook. Sair / Alterar )

Foto do Google+

Você está comentando utilizando sua conta Google+. Sair / Alterar )

Conectando a %s