cy_88 发表于 2016-11-22 06:57:50

PostreSQL崩溃试验全记录

  磨砺技术珠矶,践行数据之道,追求卓越价值
  回到上一级页面: PostgreSQL基础知识与基本操作索引页   回到顶级页面:PostgreSQL索引页
  [作者 高健@博客园luckyjackgao@gmail.com]
  
  由于客户提到,运行某些大运算量批处理时,PostgreSQL突发性使用大量资源的问题,
  进行了如下的调查,发现确实发生了崩溃。PostgreSQL 需要资源控制方案啊。
  现在正在考虑是否必须要用 os 级别的限制方案:
  过程如下:
  测试环境:
  内存:大约1024MB
  postgresql.conf 设定:
      使用缺省值:checkpoint_segments =3shard_buffers=32MB
      这些是有意为之,就是想看看数据量大、shared_buffer小的状况:
  建立表(一条记录大约1024字节):
      postgres=# create table test01(id integer, val char(1024));
  向表中插入大量数据(总共插入2400MB)
     postgres=# insert into test01 values(generate_series(1,2457600),repeat( chr(int4(random()*26)+65),1024));
  插入操作会花费一些时间,此时用ps 命令看PostgreSQL个进程对内存使用状况:‘
  



# ps aux | grep post
root      31800.00.0 105296   712 pts/1    S    16:31   0:00 su - postgres
postgres31810.00.070304   676 pts/1    S+   16:31   0:00 -bash
postgres32190.00.2 1136442864 pts/1    S    16:32   0:00 /usr/local/pgsql/bin/postgres -D /gao/data
postgres32210.43.0 113724 35252 ?      Ss   16:32   0:01 postgres: writer process                  
postgres32220.20.1 1136441616 ?      Ds   16:32   0:00 postgres: wal writer process            
postgres32230.00.0 1143801148 ?      Ss   16:32   0:00 postgres: autovacuum launcher process   
postgres32240.00.073332   472 ?      Ss   16:32   0:00 postgres: stats collector process         
root      32520.00.0 105296   712 pts/2    S    16:32   0:00 su - postgres
postgres32530.00.070304   676 pts/2    S    16:32   0:00 -bash
postgres32850.00.083488   740 pts/2    S+   16:32   0:00 ./psql
postgres3286 14.8 80.2 2598332 924308 ?      Ds   16:32   0:35 postgres: postgres postgres INSERT
root      33330.00.065424   812 pts/3    S+   16:36   0:00 grep post
  可以发现,INSERT操作正在消耗80%以上的内存。
  再等片刻,发现psql端传来反馈:



WARNING:terminating connection because of crash of another server process
DETAIL:The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:In a moment you should be able to reconnect to the database and repeat your command.
The connection to the server was lost. Attempting reset: Failed.
!>
  此时看看Log,可以看到:background writer (3321) 已经被干掉了,所有连接被重置。



LOG:autovacuum launcher started
LOG:database system is ready to accept connections
LOG:checkpoints are occurring too frequently (2 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (1 second apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (2 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (1 second apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (2 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (1 second apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (2 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (4 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (10 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (9 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (7 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (9 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (7 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (5 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (5 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (10 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (11 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (15 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (23 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (7 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (8 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (13 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (7 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (8 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (8 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (8 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (4 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (5 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:checkpoints are occurring too frequently (9 seconds apart)
HINT:Consider increasing the configuration parameter "checkpoint_segments".
LOG:background writer process (PID 3221) was terminated by signal 9: Killed
LOG:terminating any other active server processes
WARNING:terminating connection because of crash of another server process
DETAIL:The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:In a moment you should be able to reconnect to the database and repeat your command.
WARNING:terminating connection because of crash of another server process
DETAIL:The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:In a moment you should be able to reconnect to the database and repeat your command.
WARNING:terminating connection because of crash of another server process
DETAIL:The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:In a moment you should be able to reconnect to the database and repeat your command.
WARNING:terminating connection because of crash of another server process
DETAIL:The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:In a moment you should be able to reconnect to the database and repeat your command.
LOG:all server processes terminated; reinitializing
FATAL:the database system is in recovery mode
LOG:database system was interrupted; last known up at 2013-08-30 16:36:42 CST
LOG:database system was not properly shut down; automatic recovery in progress
LOG:consistent recovery state reached at 0/B7657BD0
LOG:redo starts at 0/B60FE2B8
LOG:unexpected pageaddr 0/B044C000 in log file 0, segment 184, offset 4505600
LOG:redo done at 0/B844B940
LOG:autovacuum launcher started
LOG:database system is ready to accept connections
  各个进程都重新生成了:



# ps aux | grep post
root      31800.00.0 105296   660 pts/1    S    16:31   0:00 su - postgres
postgres31810.00.070304   628 pts/1    S+   16:31   0:00 -bash
postgres32190.00.5 1136446276 pts/1    S    16:32   0:00 /usr/local/pgsql/bin/postgres -D /gao/data
root      32520.00.0 105296   660 pts/2    S    16:32   0:00 su - postgres
postgres32530.00.070304   628 pts/2    S    16:32   0:00 -bash
postgres32850.00.0834881144 pts/2    S+   16:32   0:00 ./psql
postgres33480.00.0 113644   984 ?      Ss   16:40   0:00 postgres: writer process                  
postgres33490.00.0 113644   732 ?      Ss   16:40   0:00 postgres: wal writer process            
postgres33500.00.1 1143361756 ?      Ss   16:40   0:00 postgres: autovacuum launcher process   
postgres33510.00.073332   580 ?      Ss   16:40   0:00 postgres: stats collector process         
root      33610.00.065420   796 pts/3    R+   16:44   0:00 grep post
#
  回到psql端再看,发现连接已经失效了:



!> \
Invalid command \. Try \? for help.
!> \dt;
You are currently not connected to a database.
!>
  根据向社区提问,据说是因为OS级别的OOM错误发生,所以杀死了Postmaster进程。
  总之,此种情况表明,如果没有对资源消费总量的控制,那么突发性的用户访问会导致崩溃的。
  
  [作者 高健@博客园luckyjackgao@gmail.com]
  回到上一级页面: PostgreSQL基础知识与基本操作索引页   回到顶级页面:PostgreSQL索引页
  磨砺技术珠矶,践行数据之道,追求卓越价值
页: [1]
查看完整版本: PostreSQL崩溃试验全记录