XenLvmDrbd

From OptionC

Table of contents

LVM (v2) + DRBD (v 0.7) + Heartbeat (v1.2.3) on Debian/Sarge

For some reason I decided that I wanted to be able to failover some of my domU machines (web sites) to a remote location. I looked at various distributed file sharing solutions (GFS/GNDB, XFS, iSCSI, ATA over Ethernet), and even considered just doing old-school mirroring with rsync, but opted for this combination because of the hardware at hand, and the fact that this will be running over a WAN.

There's a lot of good howtos and guides out there already, but I still managed to spend a lot of time with the fiddly bits, so I'm writing down the instructions so that I can remember where I got stuck. Also, keep in mind that this is not an optimal, or even recommended, configuration. For example, normally you'd want a dedicated NIC on each node and a crossover cable, plus some redundancy (perhaps a serial connection) for the heartbeat connection. However, since this is over a WAN, none of that really works.

From the official pages... "Drbd as LVM2 "physical" volumes does work, but you should know what you are doing ..." (http://linux-ha.org/DataRedundancyByDrbd#head-25da1eb596fac86b9b39073f3a9566007cc6a505)

drbd

1. Install the necessary packages (from Sarge/Stable 2005-08-03). In my case the kernels are the same, so I build on one machine and only need the utils on both nodes:

milo1:/ # apt-get install drbd0.7-module-source module-assistant drbd0.7-utils
milo2:/ # apt-get install drbd0.7-utils

drbd0.7-utils are the administrative utilities; it's a dependency of the module you are about to build, so you might as well install it at the beginning.

2. Build the modules - make sure the source is the correct source for the Xen kernel you are running (as in, already patched and such)

 milo1:/ # ARCH=xen module-assistant --kernel-dir=/usr/src/kernel-source-VERSION build drbd0.7-module

You should end up with something like drbd0.7-module-*.deb in the directory above your kernel source directory.

3. Install and test the module

milo1:/ # dpkg -i /usr/src/drbd0.7-module-*.deb
milo1:/ # update-modules
milo1:/ # modprobe drbd
milo1:/ # lsmod
Module                  Size  Used by
drbd                  147312  2 

Copy drbd0.7-module*.deb to the second node and repeat.

4. Prep your file system

  • You will want physical partitions of the same size on both machines
  • It's probably best to start with clean partitions (I didn't and it was okay, but I'll explain that later)
  • drbd needs about 128M for its meta-data; you can dedicate a partition to that, but it's simpler to allow it to be stored internally. It will grab the space it needs from the end of the partition

5. Configure drbd

You must be able to resolve the remote node by its hostname (this goes for heartbeat as well), so you'll need entries in the hosts files for all the nodes.

# /etc/hosts
127.0.0.1 localhost
192.168.0.1 milo1
192.168.0.2 milo2

Similar (or identical) hardware is best if you have it, but I didn't. I'm posting my configuration as it is - not to be confusing, but to show that you can still get this to work on such a wacky collection of hardware, as long as you are careful about the obvious stuff (such as making sure the partions are the same size). If you install from debian there should be a well-annotated sample in /etc/drbd.conf that you will probably want to take at least a quick look at.

 #/etc/drbd.conf
 resource r1 {
   protocol C;
   incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";
   startup {
     degr-wfc-timeout 90; 
   }

   disk {
     on-io-error   detach;
   }

  syncer {
    rate 10M;
    group 2;
    al-extents 257;
  }

  on milo1 {
    device     /dev/drbd1;
    disk       /dev/sda3;
    address    192.168.0.1:7789;
    meta-disk  internal;
  }

  on milo2 {
    device    /dev/drbd1;
    disk      /dev/hdd3;
    address   192.168.0.2:7789;
    meta-disk internal;
  }
 }

"Device" is the drbd block device - this should be the same on each machine for a given resource. "Disk" is the underlying hardware.

The drbd.conf file should be the same on both nodes, so once you've created it, copy it over to the other machine.

5. Start drbd

Drbd isn't started automatically (make sure the physical partitions that you are about to create your drbd devices upon are not mounted before starting); you'll need to start it on both nodes.

milo1:/ #  /etc/init.d/drbd start
milo2:/ #  /etc/init.d/drbd start

drbd devices have two modes - primary and secondary (well, technically three - if a node can't reach the other node, the mode for it is "unknown"). The node needs to be in primary node in order to mount/read the drbd device. drbd will come up on both nodes in "secondary" mode - this way neither node tries to synchronize with the other until it gets further direction from you.

If you had data on a partition on one machine and want to keep it, make sure that you run this command on that machine. If you are starting fresh, it doesn't really matter.

milo1:/ #  drbdsetup /dev/drbd1 primary --do-what-I-say

You can check the drbd status with:

milo1:/ #  cat /proc/drbd
>version: 0.7.10 (api:77/proto:74)
>SVN Revision: 1743 build by phil@mescal, 2005-01-31 12:22:07
>1: cs:Connected st:Secondary/Secondary ld:Consistent
>  ns:1353580 nr:136292 dw:438352 dr:2256514 al:785 bm:1491 lo:0 pe:0 ua:0 ap:0

If you had data on the primary node, you will probably need to wait for the synchronization to complete before moving to the next step.

lvm

1. Install lvm on both nodes

milo1:/# apt-get install lvm2
milo2:/# apt-get install lvm2

If you aren't at all familiar with lvm, you might want to look at the information about LVM on this wiki before moving on. Also, I believe that lvm is active on install, but if not you'll need to do something like "/etc/init.d/lvm start"

2. Edit /etc/lvm/lvm.conf

Normally all physicall partitions are valid for lvm. However, because some of the tools can "see past" the /dev/drbd device to the underlying hardware (a bad thing) we want to change the default configuration so that we look for /dev/drbd devices and not the hardware. Make alterations such as the following to /ect/lvm/lvm.conf (this is based on the previous configurations).

# by default only the cdrom drive is excluded - exclude the underlying partition, add the drbd device
filter = [ "r|/dev/cdrom|", "r|/dev/sda3|", "a|/dev/drbd1|" ]
# List of pairs of additional acceptable block device types found
# in /proc/devices with maximum (non-zero) number of partitions.
types = [ "drbd", 16 ]

2. Set up lvm on the primary node (you won't be able to on the secondary, as you don't have write access).

milo1:/# pvcreate /dev/drbd1
milo1:/# vgcreate my_vg /dev/drbd1
milo1:/# vgchange -a y my_vg

And create an logical volume to test with:

milo1:/# lvcreate -L300M -ntest_lv my_vg my_vg
milo1:/# mkfs.ext3 /dev/my_vg
milo1:/# mkdir /mnt/lvm1
milo1:/# mount /dev/my_vg /mnt/lvm1
milo1:/# echo a > /mnt/lvm1/b
milo1:/# umount /mnt/lvm1

Pause to enjoy your progress until now

If everything is going well, there should be two nodes, both with drbd and lvm. One of the nodes is primary, the other secondary. We created a logical volume on the primary and put some token data there. You're probably curious to see if this replication thing is working.

There is lvm meta-data backed up in /etc/lvm - however that's just a backup. You don't need to synchronize that directory, just scan for the data that has been synchronized to the drbd partion.

To test, on the primary node, deactivate the volume group (you cannot make a primary a secondary while a process has rw access to it):

milo1:/# vgchange -a n my_vg

Make this node the secondary

milo1:/# drbdadm secondary r1

One the secondary node:

milo1:/# drbdadm primary r1
milo2:/# vgscan
milo2:/# vgchange -a y my_vg
milo2:/# mkdir /mnt/lvm1
milo2:/# mount /dev/vg /mnt/lvm1  
milo2:/# ls /mnt/lvm1
 a
 lost+found 
milo2:/# echo c > /mnt/lvm1 d

I think you get the idea. Feel free to play with things. The important part is to get the order correct. When making a primary node a secondary, you need to deactive the volume group, then make the drbd change from primary to secondary. When making a secondary node a primary, first you need to make the primary secondary, then make the secondary primary, scan and activate the volume group. (That made more sense in my head than it probably does reading it.)

heartbeat

1. Install heartbeat on both nodes

milo1:/# apt-get install heartbeat
milo2:/# apt-get install heartbeat

2. Configure heartbeat In debian, the default configuration files can be found is /usr/share/doc/heartbeat.

Here's my configurations - it is not the most basic (for example, all the default configurations use "bcast eth0"), but it is close - I've made a few comments.

ha.cf - defines the cluster nodes

 #
 # begin /etc/heartbeat/ha.cf
 #

 logfacility     local0

 # node names must be resolvable to IP address, 
 # and also match the output of "uname -n" for a given machine
 node milo1 milo2         

 # nodes can also be on separate lines - if you seen it this way, it works as well
 #node milo1 
 #node milo2 

 # my keepalive is longer than most - this is how often (in seconds) that the heart beats..
 keepalive 5   

 # length of time, in seconds, until the peer node is declared dead
 deadtime 90

 # typical default configurations used bcast, not unicast
 #bcast xen-br0

 # local ip is ignored; both are listed so that /etc/ha.d/ha.cf is identical on both machines
 ucast xen-br0 192.168.0.1
 ucast xen-br0 192.168.0.2

 # if one node is primary, and should take resources back after it fails and come back, set to on
 auto_failback on

 # this is because of a problem with permissions for cl_status - explanation will later
 apiauth cl_status uid=root
 #
 # end /etc/heartbeat/ha.cf
 #

The haresources file - what resources are being shared. In this I'm saying that milo1 is "primary", and the shared resources are drbd/lvm (which we've already set up) and the domUs that we define.

# /etc/heartbeat/haresources
#
milo1 drbddisk::r1 LVM::vg xenHA 

I made a mess of the "xendomains" script to start things. The original xendomains script:

*normally lives in "/etc/init.d/xendomains." 
*on "xendomains start" it starts the domains with configurations in /etc/xen/auto
*on "xendomains stop" it stops all running domains
*if there are already domains running, it exits and won't start domains 

I copied it to "/etc/ha.d/resource.d/xenHA." My version:

*on "xenHA start" it starts the domains with configurations in /etc/xen/ha.d
*on "xendomains stop" it stops all running domains
*if there are already domains running, it stops them and continues 

You'll no doubt want to decide what behaviour works best in your circumstances, so I won't post my version here. Given that we are mirroring, a clean shutdown after failure of a node is less critical (since the other node already has "control" of the file system) but clean shutdown is still desirable since (hopefully) most failovers will be intentional. Start-up behavior is most important, and what you want to do really depends on what the backup machine was doing while it was waiting to be called into service (for example, ours has a storage intensive but non-critical application that it runs to keep itself busy, and should be shutdown in a failover situation so that the critical apps have all cpu/memory resources).

authkeys (keys used to communicate)

# /etc/ha.d/authkeys
# this is very insecure, but it is best for the first install, as it is one less point of failure
auth 1
1 crc

authkeys needs to be rw root

chmod 600 /etc/ha.d/authkeys

cl_status - this is an odd one. cl_status is an administrative tool that can tell you all sorts of useful information about heartbeat status. However, there seems to be some permissions issues. Although this topic has been mentioned in the lists, and it looks like patches are submitted, if you try to use it and get _nothing_ (or in the logs get "client can't authenticate" errors) then...

  • make sure the /etc/ha.d/ha.cf has the following line
apiauth cl_status uid=root
  • change the group of cl_status
chgrp root /usr/bin/cl_status

Other posts suggest various things (make sure root is a member of haclient, change it to suid - all I know is, that last step worked for me and the others did not)

Test the heartbeat resource scripts to ensure they are able to bring up/down all the resources. First make sure they are down:

milo1:/# vgchange -a n my_vg
milo1:/# drbdadmin secondary r1 

Then test the heartbeat-specific scripts:

milo1:/# /etc/ha.d/resource.d/drbddisk r1 start 
milo1:/# /etc/ha.d/resource.d/LVM my_vg start

If you've got scripts for starting/stopping domUs, test them here:

milo1:/# /etc/ha.d/resource.d/xenHA start 
milo1:/# /etc/ha.d/resource.d/xenHA stop 

Otherwise, just stop the volume group and make the drbd device secondary

milo1:/# /etc/ha.d/resource.d/LVM my_vg stop 
milo1:/# /etc/ha.d/resource.d/drbddisk r1 stop 

Bring up heartbeat on both machines

milo1:/# /etc/init.d/heartbeat start
milo2:/# /etc/init.d/heartbeat start

Check status on both nodes

milo1:/# /etc/init.d/heartbeat status
 heartbeat OK [pid 2959 et al] is running on milo1 [milo1]...
milo2:/# /etc/init.d/heartbeat status
 heartbeat OK [pid 2959 et al] is running on milo2 [milo2]...

Honestly, at this point there will probably be lots of log monitoring and stuff to do. As with most things, I've just spent a long time getting up to speed, and although if it was domU configuration I would test everything from scratch, dom0 testing does require the physical hardware. So, if you're here and things still aren't working, sorry. The "cl_status" tool I mentioned earlier is very helpful at this point. Most of the logs go to "/var/log/syslog." Another pair of helpful tools are hb_standy and hb_takeover (usually in /usr/lib/heartbeat) which allow you to gracefully/manually give up resources are take them over.

Gotchas (well, they got me)

It's getting late, and this wiki descends into madness. However, here's a few more points to remind myself of issues...

  • cl_status permissions
  • xen-br0 instead of eth0
  • mandatory ip address (not mandatory and doesn't play well with the Xen scripts)
  • lvm.conf (don't scan the underlying hardware)
  • already have lvm data? then you might just get lucky. make the changes (as I did) and the meta-data from lvm is read as part of the /dev/drbd* device just like it was the physical device
  • firewall - I didn't have this problem, but according to the lists, 9 out of 10 issues with "the nodes won't talk to each other" is because of some sort of filtering so that packets don't get through for udp port 694

Recommended Reading