设为首页 收藏本站
查看: 1330|回复: 0

[经验分享] Openstack中当物理机故障时的灾难恢复

[复制链接]

尚未签到

发表于 2015-4-12 10:54:05 | 显示全部楼层 |阅读模式
  原文:http://dachary.org/?p=1961
  Disaster recovery on host failure in OpenStack
The host bm0002.the.re becomes unavailable because of a partial disk failure on an Essex based OpenStack cluster using LVM based volumes and multi-host nova-network. The host had daily backups using rsync / and each LV was copied and compressed. Although the disk is failing badly, the host is not down and some reads can still be done. The nova services are shutdown, the host disabled using nova-manage and an attempt is made to recover from partially damaged disks and LV, when it leads to better results than reverting to yesterday’s backup.
  restoring an instance from backup
  The host is marked as unavailable
  nova-manage service disable --host=bm0002.the.re --service=nova-compute
nova-manage service disable --host=bm0002.the.re --service=nova-network
nova-manage service disable --host=bm0002.the.re --service=nova-volume
and shows as such when listed
  # nova-manage service list --host=bm0002.the.re
Binary           Host    Zone Status     State Updated_At
nova-compute     bm0002.the.re  bm0002  disabled   XXX   2013-05-11 09:18:25
nova-network     bm0002.the.re  bm0002  disabled   XXX   2013-05-11 09:18:30
nova-volume      bm0002.the.re  bm0002  disabled   XXX   2013-05-11 09:18:33
It can be removed completely later by modifying the mysql database directly. The april-ci instance was running on bm0002.the.re:
  # nova list --name april-ci
+--------------------------------------+----------+---------+--------------------------------------+
|                  ID                  |   Name   |  Status |               Networks               |
+--------------------------------------+----------+---------+--------------------------------------+
| 4e8a8126-b27d-4c9e-abeb-4dc574c54254 | april-ci | SHUTOFF | novanetwork=10.145.9.5, 176.31.18.26 |
+--------------------------------------+----------+---------+--------------------------------------+
It is artificially moved to a host that is enabled:
  mysql -e "update instances set host = 'bm0001.the.re', availability_zone = 'bm0001' where hostname = 'april-ci'" nova
and deleted
  nova delete april-ci
Assuming the content of failed host was backed up entirely ( i.e. rsync / ), the april-ci disk is located using the id shown above as the output of nova list
  # grep 4dc574c54254 /var/lib/nova/instances/*/*.xml
/var/lib/nova/instances/instance-000001de/libvirt.xml:    4e8a8126-b27d-4c9e-abeb-4dc574c54254
and the corresponding disk is turned into a minimal file system
  chroot /backup/bm0002.the.re
mount -t proc none /proc
qemu-nbd --port 20000 /var/lib/nova/instances/instance-000001de/disk &
nbd-client localhost 20000 /dev/nbd0
pv /dev/nbd0 > april-ci.april-ci.img
fsck -fy $(pwd)/april-ci.april-ci.img
resize2fs -M april-ci.april-ci.img
exit
and uploaded to glance, using the same kernel and initrd, as shown with nova image-show original-image-of-april-ci
  glance add name="april-ci-2013-05-11" disk_format=ami container_format=ami \
kernel_id=2e714ea3-45e5-4bb8-ab5d-92bfff64ad28 \
ramdisk_id=6458acca-24ef-4568-bb2b-e52322a5a11c < /backup/bm0002.the.re/april-ci.april-ci.img
it is then rebooted using the same flavor
  nova boot --image 'april-ci-2013-05-11' \
  --flavor e.1-cpu.10GB-disk.1GB-ram \
  --key_name loic --availability_zone=bm0001 --poll april-ci
recovering from a partially damaged logical volume
  A 30GB volume contains bad blocks toward the end ( after 26GB ) but it was not full. A fsck is run on a copy of the disk to check how much the recovery process would lose. It turns out to be less than a hundred files in a non-critical area. A new disk of the same size is allocated on another machine with
  # euca-create-volume --zone bm0001 --size 30
VOLUME  vol-0000005b    30      bm0001  creating        2013-05-11T11:22:19.889Z
and the content of the damaged volume are copied over, until it fails with an I/O error.
  ssh -A root@bm0001.the.re
ssh bm0002.the.re pv /dev/nova-volumes/volume-00000143 | \
pv > /dev/nova-volumes/volume-0000005b
and it is repaired
  fsck -fy /dev/nova-volumes/volume-0000005b
The volume residing on the failed host is removed directly from the database
  mysql -e "update volumes set deleted = 1 where id = 30" nova
recovering from a partially damanged instance disk
  An instance disk has a few failed blocks and may be recovered if the others are copied over. Because rsync is more resilient to I/O errors than dd or pv, it is used to recover as much as possible with:
  # ssh -A root@bm0002.the.re
# rsync --inplace --progress /var/lib/nova/instances/instance-00000089/disk root@bm0001.the.re:/backup/bm0002.the.re/var/lib/nova/instances/instance-00000089/disk
  1843396608 100%    8.41MB/s    0:03:28 (xfer#1, to-check=0/1)
rsync: read errors mapping "/mnt/var/lib/nova/instances/instance-00000089/disk": Input/output error (5)
WARNING: disk failed verification -- update retained (will try again).
disk
  1843396608 100%   37.37MB/s    0:00:47 (xfer#2, to-check=0/1)
rsync: read errors mapping "/var/lib/nova/instances/instance-00000089/disk": Input/output error (5)
ERROR: disk failed verification -- update retained.
sent 1843836447 bytes  received 858892 bytes  7000741.32 bytes/sec
total size is 1843396608  speedup is 1.00
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1070) [sender=3.0.9]
It is then turned into a file using nbd as shown above and checked for errors:
  # fsck -fy $(pwd)/openstack.jenkins.img
fsck from util-linux 2.20.1
e2fsck 1.42.5 (29-Jul-2012)
/openstack.jenkins.img: recovering journal
Clearing orphaned inode 117551 (uid=0, gid=0, mode=0100644, size=0)
Clearing orphaned inode 9764 (uid=0, gid=0, mode=0100644, size=1393052)
Clearing orphaned inode 9765 (uid=0, gid=0, mode=0100644, size=302040)
Clearing orphaned inode 7050 (uid=105, gid=109, mode=0100644, size=0)
Clearing orphaned inode 8841 (uid=0, gid=0, mode=0100644, size=81800)
Clearing orphaned inode 10235 (uid=0, gid=0, mode=0100644, size=253328)
Clearing orphaned inode 10240 (uid=0, gid=0, mode=0100644, size=180624)
Clearing orphaned inode 8840 (uid=0, gid=0, mode=0100644, size=874608)
Clearing orphaned inode 6469 (uid=0, gid=0, mode=0100755, size=1245180)
Clearing orphaned inode 10739 (uid=0, gid=0, mode=0100644, size=18192)
Clearing orphaned inode 10927 (uid=0, gid=0, mode=0100644, size=19908)
Clearing orphaned inode 10754 (uid=0, gid=0, mode=0100644, size=100820)
Clearing orphaned inode 10738 (uid=0, gid=0, mode=0100644, size=11468)
Clearing orphaned inode 10926 (uid=0, gid=0, mode=0100644, size=31568)
Clearing orphaned inode 10956 (uid=0, gid=0, mode=0100644, size=18780)
Clearing orphaned inode 10958 (uid=0, gid=0, mode=0100644, size=22312)
Clearing orphaned inode 10723 (uid=0, gid=0, mode=0100644, size=13976)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (2299561, counted=2283092).
Fix? yes
Free inodes count wrong (538192, counted=534536).
Fix? yes
/openstack.jenkins.img: ***** FILE SYSTEM WAS MODIFIED *****
/openstack.jenkins.img: 52984/587520 files (0.3% non-contiguous), 338348/2621440 blocks
If the lossage is better than recovering from yesterday's backup, the instance is rebooting using this copy.

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-56227-1-1.html 上篇帖子: 【转】 [官版翻译]OpenStack centos版安装(一) 下篇帖子: openstack中运行定时任务的两种方法及源代码分析
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表