TcaplusDB数据备份之gluster节点维护
gluster节点扩容
先将新gluster节点安装好gluster-server,将盘准备合并好挂载上,然后添加gluster邻居加点
扩容 18:59:32 | Gluster Controller 9.xxx.xxx.16 | PWD: /var/log/glusterfs [root@xxx /var/log/glusterfs]# gluster volume add-brick glusterfs_na_qcloud_20xxxx21 9.xx.xx.xx:/glusterfs_disk_md0 9.xx.xx.xx:/glusterfs_disk_md0 9.xx.xx.xx:/glusterfs_disk_md0 100.xx.xx.xx:/glusterfs_disk_md0 9.xx.xx.56:/glusterfs_disk_md0 9.xx.xx.xx:/glusterfs_disk_md0 9.xx.xx.xx:/glusterfs_disk_md0 9.xx.xx.xx:/glusterfs_disk_md0 9.xx.xx.221:/glusterfs_disk_md0 9.xx.xx.xx:/glusterfs_disk_md0 force volume add-brick: success
均衡 [root@xxx /var/log/glusterfs]# gluster volume rebalance glusterfs_na_qcloud_20xxxx21 start force
gluster节点机器故障修复
如果是一对一的机器替换(机型相同,ip不同),参考这篇文章 https://icicimov.github.io/blog/high-availability/Replacing-GlusterFS-failed-node/
注:如果数据盘没故障(即lsblk能看到数据盘组成Raid0阵列),只是系统盘故障重装系统的,以下部分步骤可以跳过(2,3,4,14,15,16), 如果系统盘正常,数据盘因维修盘组信息的, 在重做mdadm盘组后,执行14-17步骤
1. 安装gluster server
cd gluster3.8.15
echo "glusterfs-libs-3.8.15-1.el6.x86_64.rpm
glusterfs-3.8.15-1.el6.x86_64.rpm
glusterfs-client-xlators-3.8.15-1.el6.x86_64.rpm
glusterfs-fuse-3.8.15-1.el6.x86_64.rpm
glusterfs-cli-3.8.15-1.el6.x86_64.rpm
glusterfs-api-3.8.15-1.el6.x86_64.rpm
python-argparse-1.2.1-2.1.el6.noarch.rpm
pyxattr-0.5.0-1.el6.x86_64.rpm
userspace-rcu-0.7.3-1.el6.x86_64.rpm" | xargs -i rpm -ivh {}
yum install -y lvm2.x86_64 nfs-utils rpcbind
yum -y install libudev-devel.x86_64
echo "device-mapper-devel-1.02.66-6.el6.x86_64.rpm
device-mapper-event-devel-1.02.66-6.el6.x86_64.rpm
glusterfs-server-3.8.15-1.el6.x86_64.rpm" | xargs -i rpm -ivh {}
2. 如果mdadm搭的是软raid0, 则无法自我修复, 解散md0
umount /dev/md0
mdadm --stop /dev/md0
3. 重组md0
echo "y" | mdadm -C /dev/md0 -l 0 -n 12 /dev/sd[b,c,d,e,f,g,h,i,j,k,l,m]
4. 格式化
mkfs.ext4 /dev/md0
5. 写入配置文件及挂载
mdadm --detail --scan >/etc/mdadm.conf
mdadm -As /dev/md0
mount -o noatime,acl,user_xattr /dev/md0 /glusterfs_disk_md0
6. 在存活的其他gluster节点查询故障机器的UUID
[root@xxx ~]# grep -rn XXX.XXX.211.43 /var/lib/glusterd/peers/
/var/lib/glusterd/peers/afd8de7e-6daf-4861-9b4e-07a6103e756b:3:hostname1=XXX.XXX.211.43
7. 停止glusterd服务
systemctl stop glusterd.service
8. 将第6步得到的UUID填入glusterd.info配置文件, /var/lib/glusterd/glusterd.info
9. 启动glusterd
systemctl start glusterd.service
10. 检查gluster peer是否还在,如果丢失需要手工加回去
gluster peer status | grep Hostname
11. 再次重启glusterd,检查peer状态, systemctl restart glusterd.service
12. 检查自身volume状态
gluster volume status | grep 211.43 (grep本机IP, 目前有故障,输出结果为一行,正常情况下是两行)
Brick XXX.XXX.211.43:/glusterfs_disk_md0 N/A N/A N N/A
13. 重新拉取一次volume信息, 填写其他存活gluster节点IP
gluster volume sync 9.XXX.XXX.16 all
14. 在任一台存活的gluster节点和故障gluster节点上分别安装attr, yum install -y attr
15. 在存活gluster节点机上查询挂载卷的volume ID
[root@xxx ~]# getfattr -n trusted.glusterfs.volume-id /glusterfs_disk_md0
getfattr: Removing leading '/' from absolute path names
# file: glusterfs_disk_md0
trusted.glusterfs.volume-id=0s5zO2a7DvS7XXXXXX+XlVww==
16. 在故障gluster节点上修改挂载卷的volume ID,写入第15步的查询结果
setfattr -n trusted.glusterfs.volume-id -v '0s5zO2a7DvS7XXXXXX+XlVww==' /glusterfs_disk_md0
17. 重启glusterd服务,启动数据修复
systemctl restart glusterd.service
gluster volume status
gluster volume heal glusterfs_na_qcloud_20XXXX21 full
gluster volume heal glusterfs_na_qcloud_20XXXX21 info
18. 观察流量情况,sar -n DEV 1
如果本来是Raid5的软raid,在硬盘更换后,可以两句命令完成替换 mdadm --zero-superblock /dev/sdh mdadm /dev/md0 --add /dev/sdh