HP rx3600小机已经出现4次无故当机(磁盘资源被不正常的进程占用完,导致系统及应用软件都无法正常的使用)的情况。
首先硬件经过hp工程师确认是没问题的。都是正常运行状态。问题出在软件方面,之前一直认为是HP系统或者是Service Guard的问题,但通过hp工程师确认hp的软件也是正常的之后,开始把重点放到了Oracle身上,终于找到了一些切入口来分析。
2 问题分析及处理1. hp工程师确定了hp硬件、软件都没有异常。
2. 开始着手着重从oracle方面查找原因(这是我的问题,之前意识上将问题强加于hp)。
处理步骤:
Ø 认真查看告警日志:alter_dlyx.log
发现了如下异常:
Errors in file /oracle/admin/dlyx/bdump/dlyx_cjq0_16372.trc:
Wed Jun 30 18:41:00 EAT 2010
Process q001 died, see its trace file
Wed Jun 30 18:41:08 EAT 2010
ksvcreate: Process(q001) creation failed
Wed Jun 30 18:44:34 EAT 2010
Process J000 died, see its trace file
Wed Jun 30 18:44:37 EAT 2010
kkjcre1p: unable to spawn jobq slave process
Wed Jun 30 18:44:40 EAT 2010
Errors in file /oracle/admin/dlyx/bdump/dlyx_cjq0_16372.trc:
Wed Jun 30 18:47:29 EAT 2010
Process m000 died, see its trace file
Wed Jun 30 18:47:32 EAT 2010
ksvcreate: Process(m000) creation failed
Wed Jun 30 18:50:25 EAT 2010
Process J000 died, see its trace file
Wed Jun 30 18:50:25 EAT 2010
kkjcre1p: unable to spawn jobq slave process
Wed Jun 30 18:50:27 EAT 2010
Errors in file /oracle/admin/dlyx/bdump/dlyx_cjq0_16372.trc:
可以发现大量的J000,q000进程死掉了。
Ø 打开跟踪文件:
/oracle/admin/dlyx/bdump/dlyx_cjq0_16372.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /oracle/OraHome_1
System name: HP-UX
Node name: lsyxdb
Release: B.11.31
Version: U
Machine: ia64
Instance name: dlyx
Redo thread mounted by this instance: 1
Oracle process number: 10
Unix process pid: 16372, image: (CJQ0)
*** 2010-06-30 18:36:11.770
*** SERVICE NAME:(SYS$BACKGROUND) 2010-06-30 18:35:59.220
*** SESSION ID:(657.1) 2010-06-30 18:35:58.812
Waited for process J000 to initialize for 60 seconds
*** 2010-06-30 18:36:16.660
Process diagnostic dump for J000, OS id=26847
-------------------------------------------------------------------------------
loadavg : 0.20 0.23 0.30
Swapinfo :
Avail = 7761.29Mb Used = 7640.66Mb
Swap free = 120.62Mb Kernel rsvd = 1511.08Mb
Free Mem = 15.14Mb
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME COMD
1401 S oracle 26847 1 0 128 20 e00000018b663700 2291 e000000175f1c100 18:34:55 ? 0:00 ora_j000_dlyx
*** 2010-06-30 18:36:55.052
Stack:
skgpgcmdout: read() for cmd /usr/bin/echo "set pagination off
bt 50
detach" | /opt/langtools/bin/gdb -quiet /oracle/OraHome_1/bin/oracle 26847 2>&1 | /bin/grep "#" timed out after 11.321 second
s
从这段可以看出oracle在60秒内等待J000的启动。小型机维修并且可用的物理只有15.14MB。
可以猜想由于内存的不足,导致Oracle的进程无法正常启动,从而导致本次磁盘IO的异常,进而导致整个数据库服务器的异常。
这与之前的HP-UX系统报错也是吻合的:
Jun 30 17:35:27 lsyxdb cmnetd[16101]:Warning: process was unable to run for the last 6 seconds
Jun 30 17:38:58 lsyxdb telnetd[25347]:getpid: peer died: Error 0
通读了Oracle告警日志出现此报错跟发生几次当机的时间比较吻合,也无其他能引起当机的报错出现。