基本的实现过程如下所示:
>> 配置主服务器的相关参数,创建当前数据库集群的归档文件目录,然后需要配置当前的模式,通过修改postgresql.conf的配置:
archive_mode= on
archive_command = 'cp %p /ire_gp/xxxx/datanodearch/%f'
max_wal_senders = 10
wal_level = hot_standby # minimal, archive, or hot_standby
>> 修改主服务器的客户端认证配置,运行流复制, 通过pg_hba.conf进行设置:
# replication privilege.
local replication xxxx trust
host replication xxxx 127.0.0.1/32 trust
>> 关闭服务器进行文件系统的拷贝,并将pg_xlog文件夹中的相关文件删除。
[xxxx@drdb02 datanode11bk]$cp -fr ../datanode11/* .
[xxxx@drdb02 datanode11bk]$ll
total 1632
drwx------ 6 xxxx ire_gp 4096 Jun 8 14:21 base
-rw------- 1 xxxx ire_gp 1573341 Jun 8 14:21 datanode11.log
drwx------ 2 xxxx ire_gp 4096 Jun 8 14:21 global
drwx------ 2 xxxx ire_gp 4096 Jun 8 14:21 pg_clog
-rw------- 1 xxxx ire_gp 4996 Jun 8 14:21 pg_hba.conf
-rw------- 1 xxxx ire_gp 1636 Jun 8 14:21 pg_ident.conf
drwx------ 4 xxxx ire_gp 4096 Jun 8 14:21 pg_multixact
drwx------ 2 xxxx ire_gp 4096 Jun 8 14:21 pg_notify
drwx------ 2 xxxx ire_gp 4096 Jun 8 14:21 pg_serial
drwx------ 2 xxxx ire_gp 4096 Jun 8 14:21 pg_snapshots
drwx------ 2 xxxx ire_gp 4096 Jun 8 14:21 pg_stat_tmp
drwx------ 2 xxxx ire_gp 4096 Jun 8 14:21 pg_subtrans
drwx------ 2 xxxx ire_gp 4096 Jun 8 14:21 pg_tblspc
drwx------ 2 xxxx ire_gp 4096 Jun 8 14:21 pg_twophase
-rw------- 1 xxxx ire_gp 4 Jun 8 14:21 PG_VERSION
drwx------ 3 xxxx ire_gp 4096 Jun 8 14:21 pg_xlog
-rw------- 1 xxxx ire_gp 21733 Jun 8 14:21 postgresql.conf
-rw------- 1 xxxx ire_gp 78 Jun 8 14:21 postmaster.opts
[xxxx@drdb02 pg_xlog]$rm -fr *
[xxxx@drdb02 pg_xlog]$
[xxxx@drdb02 pg_xlog]$ll
total 0
>> 修改备机实例目录下的配置文件,主要修改一些参数用于标识该服务器为备机,同时若在同一台服务器可能需要设置为不同的端口等相关的处理。
port = 11962 # (change requires restart)
pooler_port = 12962 # Pool Manager TCP port
hot_standby = on # "on" allows queries during recovery
>> 拷贝一个recovery.conf文件,并根据对应的需要设置对应的参数,主要是restore_command。
standby_mode = on //当前为备机模式
primary_conninfo = 'host=localhost port=11961' //连接的对端的
trigger_file = 'failover.now' //进行切换的触发文件,即存在该文件时将进行触发
restore_command = 'cp /ire_gp/xxxx/datanodearch/%f %p' //对应的归档文件拷贝到当前路径的方式
>> 启动备机服务器
[xxxx@drdb02 datanode11bk]$pg_ctl start -Z datanode -D /ire_gp/data/datanode11bk/ -l /ire_gp/data/datanode11bk/datanode11_back.log
[xxxx@drdb02 datanode11bk]$vi datanode11_back.log //查看备机的情况,启动异常,但是可以忽略,主机还未启动
cp: cannot stat `/ire_gp/data/archive_wal/000000010000000000000002': No such file or directory
FATAL: could not connect to the primary server: could not connect to server: Connection refused
Is the server running on host "localhost" (::1) and accepting
TCP/IP connections on port 11961?
could not connect to server: Connection refused
Is the server running on host "localhost" (127.0.0.1) and accepting
TCP/IP connections on port 11961?
>> 启动主机服务器
[xxxx@drdb02 datanode11bk]$pg_ctl start -Z datanode -D /ire_gp/data/datanode11/ -l /ire_gp/data/datanode11/datanode11.log
[xxxx@drdb02 datanode11bk]$vi datanode11_back.log //查看备机端的流处理情况
FATAL: could not connect to the primary server: could not connect to server: Connection refused
Is the server running on host "localhost" (::1) and accepting
TCP/IP connections on port 11961?
could not connect to server: Connection refused
Is the server running on host "localhost" (127.0.0.1) and accepting
TCP/IP connections on port 11961?
cp: cannot stat `/ire_gp/xxxx/datanodearch/000000010000000000000005': No such file or directory
cp: cannot stat `/ire_gp/xxxx/datanodearch/000000010000000000000005': No such file or directory
LOG: streaming replication successfully connected to primary //执行流复制成功
LOG: redo starts at 0/5000080 //redo根据日志进行事务的重新执行,进行事务的更新操作
[xxxx@drdb02 datanode11bk]$
在测试的过程中出现了如下的问题:
cp: cannot stat `/ire_gp/data/archive_wal/000000010000000000000002': No such file or directory
LOG: streaming replication successfully connected to primary
WARNING: WAL was generated with wal_level=minimal, data may be missing
HINT: This happens if you temporarily set wal_level=minimal without taking a new base backup.
FATAL: hot standby is not possible because wal_level was not set to "hot_standby" on the master server
HINT: Either set wal_level to "hot_standby" on the master, or turn off hot_standby here.
LOG: startup process (PID 12549) exited with exit code 1
LOG: aborting startup due to startup process failure
出现这个原因的问题是当前进行重新执行的WAL文件000000010000000000000002是主机在minimal模式下产生的日志。同时检查配置文件,确认相关的参数正确。通常出现这个原因的问题是在备份文件系统的过程中存在问题,最简单的方式是重新备份一次文件系统,然后备机在最新的文件系统之上进行恢复操作,即在关闭主机的服务器之后再进行备份,然后先启动备机的数据库,最后启动主机服务器。