XenLvmDrbd
From OptionC
| Table of contents |
LVM (v2) + DRBD (v 0.7) + Heartbeat (v1.2.3) on Debian/Sarge
For some reason I decided that I wanted to be able to failover some of my domU machines (web sites) to a remote location. I looked at various distributed file sharing solutions (GFS/GNDB, XFS, iSCSI, ATA over Ethernet), and even considered just doing old-school mirroring with rsync, but opted for this combination because of the hardware at hand, and the fact that this will be running over a WAN.
There's a lot of good howtos and guides out there already, but I still managed to spend a lot of time with the fiddly bits, so I'm writing down the instructions so that I can remember where I got stuck. Also, keep in mind that this is not an optimal, or even recommended, configuration. For example, normally you'd want a dedicated NIC on each node and a crossover cable, plus some redundancy (perhaps a serial connection) for the heartbeat connection. However, since this is over a WAN, none of that really works.
From the official pages... "Drbd as LVM2 "physical" volumes does work, but you should know what you are doing ..." (http://linux-ha.org/DataRedundancyByDrbd#head-25da1eb596fac86b9b39073f3a9566007cc6a505)
drbd
1. Install the necessary packages (from Sarge/Stable 2005-08-03). In my case the kernels are the same, so I build on one machine and only need the utils on both nodes:
milo1:/ # apt-get install drbd0.7-module-source module-assistant drbd0.7-utils milo2:/ # apt-get install drbd0.7-utils
drbd0.7-utils are the administrative utilities; it's a dependency of the module you are about to build, so you might as well install it at the beginning.
2. Build the modules - make sure the source is the correct source for the Xen kernel you are running (as in, already patched and such)
milo1:/ # ARCH=xen module-assistant --kernel-dir=/usr/src/kernel-source-VERSION build drbd0.7-module
You should end up with something like drbd0.7-module-*.deb in the directory above your kernel source directory.
3. Install and test the module
milo1:/ # dpkg -i /usr/src/drbd0.7-module-*.deb milo1:/ # update-modules milo1:/ # modprobe drbd milo1:/ # lsmod
Module Size Used by drbd 147312 2
Copy drbd0.7-module*.deb to the second node and repeat.
4. Prep your file system
- You will want physical partitions of the same size on both machines
- It's probably best to start with clean partitions (I didn't and it was okay, but I'll explain that later)
- drbd needs about 128M for its meta-data; you can dedicate a partition to that, but it's simpler to allow it to be stored internally. It will grab the space it needs from the end of the partition
5. Configure drbd
You must be able to resolve the remote node by its hostname (this goes for heartbeat as well), so you'll need entries in the hosts files for all the nodes.
# /etc/hosts 127.0.0.1 localhost 192.168.0.1 milo1 192.168.0.2 milo2
Similar (or identical) hardware is best if you have it, but I didn't. I'm posting my configuration as it is - not to be confusing, but to show that you can still get this to work on such a wacky collection of hardware, as long as you are careful about the obvious stuff (such as making sure the partions are the same size). If you install from debian there should be a well-annotated sample in /etc/drbd.conf that you will probably want to take at least a quick look at.
#/etc/drbd.conf
resource r1 {
protocol C;
incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";
startup {
degr-wfc-timeout 90;
}
disk {
on-io-error detach;
}
syncer {
rate 10M;
group 2;
al-extents 257;
}
on milo1 {
device /dev/drbd1;
disk /dev/sda3;
address 192.168.0.1:7789;
meta-disk internal;
}
on milo2 {
device /dev/drbd1;
disk /dev/hdd3;
address 192.168.0.2:7789;
meta-disk internal;
}
}
"Device" is the drbd block device - this should be the same on each machine for a given resource. "Disk" is the underlying hardware.
The drbd.conf file should be the same on both nodes, so once you've created it, copy it over to the other machine.
5. Start drbd
Drbd isn't started automatically (make sure the physical partitions that you are about to create your drbd devices upon are not mounted before starting); you'll need to start it on both nodes.
milo1:/ # /etc/init.d/drbd start milo2:/ # /etc/init.d/drbd start
drbd devices have two modes - primary and secondary (well, technically three - if a node can't reach the other node, the mode for it is "unknown"). The node needs to be in primary node in order to mount/read the drbd device. drbd will come up on both nodes in "secondary" mode - this way neither node tries to synchronize with the other until it gets further direction from you.
If you had data on a partition on one machine and want to keep it, make sure that you run this command on that machine. If you are starting fresh, it doesn't really matter.
milo1:/ # drbdsetup /dev/drbd1 primary --do-what-I-say
You can check the drbd status with:
milo1:/ # cat /proc/drbd >version: 0.7.10 (api:77/proto:74) >SVN Revision: 1743 build by phil@mescal, 2005-01-31 12:22:07 >1: cs:Connected st:Secondary/Secondary ld:Consistent > ns:1353580 nr:136292 dw:438352 dr:2256514 al:785 bm:1491 lo:0 pe:0 ua:0 ap:0
If you had data on the primary node, you will probably need to wait for the synchronization to complete before moving to the next step.
lvm
1. Install lvm on both nodes
milo1:/# apt-get install lvm2 milo2:/# apt-get install lvm2
If you aren't at all familiar with lvm, you might want to look at the information about LVM on this wiki before moving on. Also, I believe that lvm is active on install, but if not you'll need to do something like "/etc/init.d/lvm start"
2. Edit /etc/lvm/lvm.conf
Normally all physicall partitions are valid for lvm. However, because some of the tools can "see past" the /dev/drbd device to the underlying hardware (a bad thing) we want to change the default configuration so that we look for /dev/drbd devices and not the hardware. Make alterations such as the following to /ect/lvm/lvm.conf (this is based on the previous configurations).
# by default only the cdrom drive is excluded - exclude the underlying partition, add the drbd device filter = [ "r|/dev/cdrom|", "r|/dev/sda3|", "a|/dev/drbd1|" ]
# List of pairs of additional acceptable block device types found # in /proc/devices with maximum (non-zero) number of partitions. types = [ "drbd", 16 ]
2. Set up lvm on the primary node (you won't be able to on the secondary, as you don't have write access).
milo1:/# pvcreate /dev/drbd1 milo1:/# vgcreate my_vg /dev/drbd1 milo1:/# vgchange -a y my_vg
And create an logical volume to test with:
milo1:/# lvcreate -L300M -ntest_lv my_vg my_vg milo1:/# mkfs.ext3 /dev/my_vg milo1:/# mkdir /mnt/lvm1 milo1:/# mount /dev/my_vg /mnt/lvm1 milo1:/# echo a > /mnt/lvm1/b milo1:/# umount /mnt/lvm1
Pause to enjoy your progress until now
If everything is going well, there should be two nodes, both with drbd and lvm. One of the nodes is primary, the other secondary. We created a logical volume on the primary and put some token data there. You're probably curious to see if this replication thing is working.
There is lvm meta-data backed up in /etc/lvm - however that's just a backup. You don't need to synchronize that directory, just scan for the data that has been synchronized to the drbd partion.
To test, on the primary node, deactivate the volume group (you cannot make a primary a secondary while a process has rw access to it):
milo1:/# vgchange -a n my_vg
Make this node the secondary
milo1:/# drbdadm secondary r1
One the secondary node:
milo1:/# drbdadm primary r1 milo2:/# vgscan milo2:/# vgchange -a y my_vg milo2:/# mkdir /mnt/lvm1 milo2:/# mount /dev/vg /mnt/lvm1 milo2:/# ls /mnt/lvm1 a lost+found milo2:/# echo c > /mnt/lvm1 d
I think you get the idea. Feel free to play with things. The important part is to get the order correct. When making a primary node a secondary, you need to deactive the volume group, then make the drbd change from primary to secondary. When making a secondary node a primary, first you need to make the primary secondary, then make the secondary primary, scan and activate the volume group. (That made more sense in my head than it probably does reading it.)
heartbeat
1. Install heartbeat on both nodes
milo1:/# apt-get install heartbeat milo2:/# apt-get install heartbeat
2. Configure heartbeat In debian, the default configuration files can be found is /usr/share/doc/heartbeat.
Here's my configurations - it is not the most basic (for example, all the default configurations use "bcast eth0"), but it is close - I've made a few comments.
ha.cf - defines the cluster nodes
# # begin /etc/heartbeat/ha.cf # logfacility local0 # node names must be resolvable to IP address, # and also match the output of "uname -n" for a given machine node milo1 milo2 # nodes can also be on separate lines - if you seen it this way, it works as well #node milo1 #node milo2 # my keepalive is longer than most - this is how often (in seconds) that the heart beats.. keepalive 5 # length of time, in seconds, until the peer node is declared dead deadtime 90 # typical default configurations used bcast, not unicast #bcast xen-br0 # local ip is ignored; both are listed so that /etc/ha.d/ha.cf is identical on both machines ucast xen-br0 192.168.0.1 ucast xen-br0 192.168.0.2 # if one node is primary, and should take resources back after it fails and come back, set to on auto_failback on # this is because of a problem with permissions for cl_status - explanation will later apiauth cl_status uid=root # # end /etc/heartbeat/ha.cf #
The haresources file - what resources are being shared. In this I'm saying that milo1 is "primary", and the shared resources are drbd/lvm (which we've already set up) and the domUs that we define.
# /etc/heartbeat/haresources # milo1 drbddisk::r1 LVM::vg xenHA
I made a mess of the "xendomains" script to start things. The original xendomains script:
*normally lives in "/etc/init.d/xendomains." *on "xendomains start" it starts the domains with configurations in /etc/xen/auto *on "xendomains stop" it stops all running domains *if there are already domains running, it exits and won't start domains
I copied it to "/etc/ha.d/resource.d/xenHA." My version:
*on "xenHA start" it starts the domains with configurations in /etc/xen/ha.d *on "xendomains stop" it stops all running domains *if there are already domains running, it stops them and continues
You'll no doubt want to decide what behaviour works best in your circumstances, so I won't post my version here. Given that we are mirroring, a clean shutdown after failure of a node is less critical (since the other node already has "control" of the file system) but clean shutdown is still desirable since (hopefully) most failovers will be intentional. Start-up behavior is most important, and what you want to do really depends on what the backup machine was doing while it was waiting to be called into service (for example, ours has a storage intensive but non-critical application that it runs to keep itself busy, and should be shutdown in a failover situation so that the critical apps have all cpu/memory resources).
authkeys (keys used to communicate)
# /etc/ha.d/authkeys # this is very insecure, but it is best for the first install, as it is one less point of failure auth 1 1 crc
authkeys needs to be rw root
chmod 600 /etc/ha.d/authkeys
cl_status - this is an odd one. cl_status is an administrative tool that can tell you all sorts of useful information about heartbeat status. However, there seems to be some permissions issues. Although this topic has been mentioned in the lists, and it looks like patches are submitted, if you try to use it and get _nothing_ (or in the logs get "client can't authenticate" errors) then...
- make sure the /etc/ha.d/ha.cf has the following line
apiauth cl_status uid=root
- change the group of cl_status
chgrp root /usr/bin/cl_status
Other posts suggest various things (make sure root is a member of haclient, change it to suid - all I know is, that last step worked for me and the others did not)
Test the heartbeat resource scripts to ensure they are able to bring up/down all the resources. First make sure they are down:
milo1:/# vgchange -a n my_vg milo1:/# drbdadmin secondary r1
Then test the heartbeat-specific scripts:
milo1:/# /etc/ha.d/resource.d/drbddisk r1 start milo1:/# /etc/ha.d/resource.d/LVM my_vg start
If you've got scripts for starting/stopping domUs, test them here:
milo1:/# /etc/ha.d/resource.d/xenHA start milo1:/# /etc/ha.d/resource.d/xenHA stop
Otherwise, just stop the volume group and make the drbd device secondary
milo1:/# /etc/ha.d/resource.d/LVM my_vg stop milo1:/# /etc/ha.d/resource.d/drbddisk r1 stop
Bring up heartbeat on both machines
milo1:/# /etc/init.d/heartbeat start milo2:/# /etc/init.d/heartbeat start
Check status on both nodes
milo1:/# /etc/init.d/heartbeat status heartbeat OK [pid 2959 et al] is running on milo1 [milo1]... milo2:/# /etc/init.d/heartbeat status heartbeat OK [pid 2959 et al] is running on milo2 [milo2]...
Honestly, at this point there will probably be lots of log monitoring and stuff to do. As with most things, I've just spent a long time getting up to speed, and although if it was domU configuration I would test everything from scratch, dom0 testing does require the physical hardware. So, if you're here and things still aren't working, sorry. The "cl_status" tool I mentioned earlier is very helpful at this point. Most of the logs go to "/var/log/syslog." Another pair of helpful tools are hb_standy and hb_takeover (usually in /usr/lib/heartbeat) which allow you to gracefully/manually give up resources are take them over.
Gotchas (well, they got me)
It's getting late, and this wiki descends into madness. However, here's a few more points to remind myself of issues...
- cl_status permissions
- xen-br0 instead of eth0
- mandatory ip address (not mandatory and doesn't play well with the Xen scripts)
- lvm.conf (don't scan the underlying hardware)
- already have lvm data? then you might just get lucky. make the changes (as I did) and the meta-data from lvm is read as part of the /dev/drbd* device just like it was the physical device
- firewall - I didn't have this problem, but according to the lists, 9 out of 10 issues with "the nodes won't talk to each other" is because of some sort of filtering so that packets don't get through for udp port 694
Recommended Reading
- DRBD quickstart (http://www.drbd.org/quickstart.html) - from drbd.org
- Getting Started with Heartbeat (http://wiki.linux-ha.org/GettingStarted) - from linux-ha.org
- Data Redundancy By DRBD (http://linux-ha.org/DataRedundancyByDrbd) - by Lars Ellenberg
- Debian, Xen and DRBD: Enabling true server redundancy (http://lists.xensource.com/archives/html/xen-devel/2005-06/msg00544.html) - from the xen-devel list

