Openstack中当物理机故障时的灾难恢复

wojkxlq · 发表于 2015-4-12 10:54:05

　　原文：http://dachary.org/?p=1961
　　Disaster recovery on host failure in OpenStack
The host bm0002.the.re becomes unavailable because of a partial disk failure on an Essex based OpenStack cluster using LVM based volumes and multi-host nova-network. The host had daily backups using rsync / and each LV was copied and compressed. Although the disk is failing badly, the host is not down and some reads can still be done. The nova services are shutdown, the host disabled using nova-manage and an attempt is made to recover from partially damaged disks and LV, when it leads to better results than reverting to yesterday’s backup.
　　restoring an instance from backup
　　The host is marked as unavailable
　　nova-manage service disable --host=bm0002.the.re --service=nova-compute
nova-manage service disable --host=bm0002.the.re --service=nova-network
nova-manage service disable --host=bm0002.the.re --service=nova-volume
and shows as such when listed
　　# nova-manage service list --host=bm0002.the.re
Binary          Host Zone Status    State Updated_At
nova-compute    bm0002.the.re  bm0002  disabled XXX 2013-05-11 09:18:25
nova-network    bm0002.the.re  bm0002  disabled XXX 2013-05-11 09:18:30
nova-volume    bm0002.the.re  bm0002  disabled XXX 2013-05-11 09:18:33
It can be removed completely later by modifying the mysql database directly. The april-ci instance was running on bm0002.the.re:
　　# nova list --name april-ci
+--------------------------------------+----------+---------+--------------------------------------+
|                ID                | Name |  Status |             Networks             |
+--------------------------------------+----------+---------+--------------------------------------+
| 4e8a8126-b27d-4c9e-abeb-4dc574c54254 | april-ci | SHUTOFF | novanetwork=10.145.9.5, 176.31.18.26 |
+--------------------------------------+----------+---------+--------------------------------------+
It is artificially moved to a host that is enabled:
　　mysql -e "update instances set host = 'bm0001.the.re', availability_zone = 'bm0001' where hostname = 'april-ci'" nova
and deleted
　　nova delete april-ci
Assuming the content of failed host was backed up entirely ( i.e. rsync / ), the april-ci disk is located using the id shown above as the output of nova list
　　# grep 4dc574c54254 /var/lib/nova/instances/*/*.xml
/var/lib/nova/instances/instance-000001de/libvirt.xml: 4e8a8126-b27d-4c9e-abeb-4dc574c54254
and the corresponding disk is turned into a minimal file system
　　chroot /backup/bm0002.the.re
mount -t proc none /proc
qemu-nbd --port 20000 /var/lib/nova/instances/instance-000001de/disk &
nbd-client localhost 20000 /dev/nbd0
pv /dev/nbd0 > april-ci.april-ci.img
fsck -fy $(pwd)/april-ci.april-ci.img
resize2fs -M april-ci.april-ci.img
exit
and uploaded to glance, using the same kernel and initrd, as shown with nova image-show original-image-of-april-ci
　　glance add name="april-ci-2013-05-11" disk_format=ami container_format=ami \
kernel_id=2e714ea3-45e5-4bb8-ab5d-92bfff64ad28 \
ramdisk_id=6458acca-24ef-4568-bb2b-e52322a5a11c < /backup/bm0002.the.re/april-ci.april-ci.img
it is then rebooted using the same flavor
　　nova boot --image 'april-ci-2013-05-11' \
  --flavor e.1-cpu.10GB-disk.1GB-ram \
  --key_name loic --availability_zone=bm0001 --poll april-ci
recovering from a partially damaged logical volume
　　A 30GB volume contains bad blocks toward the end ( after 26GB ) but it was not full. A fsck is run on a copy of the disk to check how much the recovery process would lose. It turns out to be less than a hundred files in a non-critical area. A new disk of the same size is allocated on another machine with
　　# euca-create-volume --zone bm0001 --size 30
VOLUME  vol-0000005b 30    bm0001  creating       2013-05-11T11:22:19.889Z
and the content of the damaged volume are copied over, until it fails with an I/O error.
　　ssh -A root@bm0001.the.re
ssh bm0002.the.re pv /dev/nova-volumes/volume-00000143 | \
pv > /dev/nova-volumes/volume-0000005b
and it is repaired
　　fsck -fy /dev/nova-volumes/volume-0000005b
The volume residing on the failed host is removed directly from the database
　　mysql -e "update volumes set deleted = 1 where id = 30" nova
recovering from a partially damanged instance disk
　　An instance disk has a few failed blocks and may be recovered if the others are copied over. Because rsync is more resilient to I/O errors than dd or pv, it is used to recover as much as possible with:
　　# ssh -A root@bm0002.the.re
# rsync --inplace --progress /var/lib/nova/instances/instance-00000089/disk root@bm0001.the.re:/backup/bm0002.the.re/var/lib/nova/instances/instance-00000089/disk
  1843396608 100% 8.41MB/s 0:03:28 (xfer#1, to-check=0/1)
rsync: read errors mapping "/mnt/var/lib/nova/instances/instance-00000089/disk": Input/output error (5)
WARNING: disk failed verification -- update retained (will try again).
disk
  1843396608 100% 37.37MB/s 0:00:47 (xfer#2, to-check=0/1)
rsync: read errors mapping "/var/lib/nova/instances/instance-00000089/disk": Input/output error (5)
ERROR: disk failed verification -- update retained.
sent 1843836447 bytes  received 858892 bytes  7000741.32 bytes/sec
total size is 1843396608  speedup is 1.00
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1070) [sender=3.0.9]
It is then turned into a file using nbd as shown above and checked for errors:
　　# fsck -fy $(pwd)/openstack.jenkins.img
fsck from util-linux 2.20.1
e2fsck 1.42.5 (29-Jul-2012)
/openstack.jenkins.img: recovering journal
Clearing orphaned inode 117551 (uid=0, gid=0, mode=0100644, size=0)
Clearing orphaned inode 9764 (uid=0, gid=0, mode=0100644, size=1393052)
Clearing orphaned inode 9765 (uid=0, gid=0, mode=0100644, size=302040)
Clearing orphaned inode 7050 (uid=105, gid=109, mode=0100644, size=0)
Clearing orphaned inode 8841 (uid=0, gid=0, mode=0100644, size=81800)
Clearing orphaned inode 10235 (uid=0, gid=0, mode=0100644, size=253328)
Clearing orphaned inode 10240 (uid=0, gid=0, mode=0100644, size=180624)
Clearing orphaned inode 8840 (uid=0, gid=0, mode=0100644, size=874608)
Clearing orphaned inode 6469 (uid=0, gid=0, mode=0100755, size=1245180)
Clearing orphaned inode 10739 (uid=0, gid=0, mode=0100644, size=18192)
Clearing orphaned inode 10927 (uid=0, gid=0, mode=0100644, size=19908)
Clearing orphaned inode 10754 (uid=0, gid=0, mode=0100644, size=100820)
Clearing orphaned inode 10738 (uid=0, gid=0, mode=0100644, size=11468)
Clearing orphaned inode 10926 (uid=0, gid=0, mode=0100644, size=31568)
Clearing orphaned inode 10956 (uid=0, gid=0, mode=0100644, size=18780)
Clearing orphaned inode 10958 (uid=0, gid=0, mode=0100644, size=22312)
Clearing orphaned inode 10723 (uid=0, gid=0, mode=0100644, size=13976)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (2299561, counted=2283092).
Fix? yes
Free inodes count wrong (538192, counted=534536).
Fix? yes
/openstack.jenkins.img: ***** FILE SYSTEM WAS MODIFIED *****
/openstack.jenkins.img: 52984/587520 files (0.3% non-contiguous), 338348/2621440 blocks
If the lossage is better than recovering from yesterday's backup, the instance is rebooting using this copy.

账号		自动登录	找回密码
密码			立即注册

VMware vcenter+vSphere 6.5 U2共享

zabbix3.4.1安装部署+微信推送信息+大屏显

winhex数据恢复教程（非常巨大，内容丰富）

【跟谁学】韩宇极简英语课-技术人员不得不

用Zabbix通过JMX方式监控weblogic

Symantec Backup Exec 2015 2016/2012 BE20

NetScaler VPX部署之：NetScaler Gateway调

[经验分享] Openstack中当物理机故障时的灾难恢复

扫码加入运维网微信交流群