stoney conductor: VM Backup

From stoney cloud
Revision as of 13:43, 23 October 2013 by Pat (Talk | contribs)


Jump to: navigation, search

Overview

This page describes how the VMs and VM-Templates are backed-up and restored inside the stoney cloud.

Backup

Basic idea

The main idea to backup a VM or a VM-Template is, to divide the task into three subtasks:

  • Snapshot: Save the machines state (CPU, Memory and Disk)
  • Merge: Merge the Disk-Image-Snapshot with the Live-Image
  • Retain: Export the snapshot files

A more detailed and technical description for these three sub-processes can be found here.

Furthermore there is an control instance, which can independently call these three sub-processes for a given machine. Like that, the stoney cloud is able to handle different cases:

Backup a single machine

The procedure for backing up a single machine is very simple. Just call the three sub-processes (snapshot, merge and retain) one after the other. So the control instance would do some very basic stuff:

object machine = args[0];
 
if( snapshot( machine ) )
{
 
    if ( merge( machine ) )
    {
 
        if ( retain( machine ) )
        {
            printf("Successfully backed up machine %s\n", machine);
 
        } else
        {
            printf("Error while retaining machine %s: %s\n", machine, error);
        }
 
    } else
    {
        printf("Error while merging machine %s: %s\n", machine, error);
    }
 
} else
{
    printf("Error while snapshotting machine %s: %s\n", machine, error);
}

Backup multiple machines at the same time

When backing up multiple machines at the same time, we need to make sure that the downtime for the machines are as close together as possible. Therefore the control instance should call first the snapshot process for all machines. After every machine has been snapshotted, the control instance can call the merge and retain process for every machine. The most important part here is, that the control instance somehow remembers, if the snapshot for a given machine was successful or not. Because if the snapshot failed, it must not call the merge and retain process. So the control instance needs a little bit more logic:

object machines[] = args[0];
object successful_snapshots[];
 
# Snapshot all machines
for( int i = 0; i <  sizeof(machines) / sizeof(object) ; i++ )
{
    # If the snapshot was successful, put the machine into the 
    # successful_snapshots array
    if ( snapshot( machines[i] ) )
    {
        successful_snapshots[machines[i]];
    } else
    {
        printf("Error while snapshotting machine %s: %s\n", machines[i],error);
    }
}
 
# Merge and reatin all successful_snapshot machines
for ( int i = 0; i <  sizeof(successful_snapshots) / sizeof(object) ; i++ ) )
{
    # Check if the element at this position is not null, then the snapshot 
    # for this machine was successful
    if ( successful_snapshots[i] )
    {
        if ( merge( successful_snapshots[i] ) )
        {
            if ( retain( successful_snapshots[i] ) )
            {
              printf("Successfully backed-up machine %s\n", successful_snapshots[i]);
            } else
            {
                printf("Error while retaining machine %s: %s\n", successful_snapshots[i],error);
            }
 
        } else
        {
            printf("Error while merging machine %s: %s\n", successful_snapshots[i],error);
        }
    }
}

Sub-Processes

Snapshot

  1. Create a snapshot with state:
    • If the VM vm-001 is running:
      • Save the state of VM vm-001 to the file vm-001.state (This file can either be created on a RAM-Disk or directly in the retain location. This example however saves the file to a RAM-Disk):
        virsh save vm-001 /path/to/ram-disk/vm-001.state
      • After this command, the VMs CPU and memory state is represented by the file /path/to/ram-disk/vm-001.state and the VM vm-001 is shut down.
    • If the VM vm-001 is shut down:
      • Create a fake state file for the VM:
        echo "Machine is not runnung, no state file" > /path/to/ram-disk/vm-001.state
  2. Move the disk image /path/to/images/vm-001.qcow2 to the retain location:
    mv /path/to/images/vm-001.qcow2 /path/to/retain/vm-001.qcow2
    • Please note: The retain directory (/path/to/retain/) has to be on the same partition as the images directory (/path/to/images/). This will make the mv operation very fast (only renaming the inode). So the downtime (remember the VM vm-001 is shut down) is as short as possible.
    • Please note2: If the VM vm-001 has more than just one disk-image, repeat this step for every disk-image
  3. Create the new (empty) disk image with the old as backing store file:
    qemu-img create -f qcow2 -b /path/to/retain/vm-001.qcow2 /path/to/images/vm-001.qcow2
    • Please note: If the VM vm-001 has more than just one disk-image, repeat this step for every disk-image
  4. Set correct ownership and permission to the newly created image:
    • chmod 660 /path/to/images/vm-001.qcow2
    • chown root:vm-storage /path/to/images/vm-001.qcow2
    • Please note: If the VM vm-001 has more than just one disk-image, repeat these steps for every disk-image
  5. Save the VMs XML description
    • Save the current XML description of VM vm-001 to a file at the retain location:
      virsh dumpxml vm-001 > /path/to/retain/vm-001.xml
  6. Save the backend entry
    • There is no generic command to save the backend entry (since the command depends on the backend). Important here is, that the backend entry of the VM vm-001 is saved to the retain location: /path/to/retain/vm-001.backend
  7. Restore the VMs vm-001 from its saved state (this will also start the VM):
    virsh restore /path/to/ram-disk/vm-001.state
    • Please note: After this operation the VM vm-001 is running again (continues where we stopped it), and we have a consistent backup for the VM vm-001:
      • The file /path/to/ram-disk/vm-001.state contains the CPU and memory state of VM vm-001 at time T1
      • The file /path/to/retain/vm-001.qcow2 contains the disk state of VM vm-001 at time T1
        • Important: Remember: The live-disk-image /path/to/images/vm-001.qcow2 still contains a reference to this file!! So you cannot delete or move it!!!
      • The file /path/to/retain/vm-001.xml contains the XML description of VM vm-001 at time T1
      • The file /path/to/retain/vm-001.backend contains the backend entry of VM vm-001 at time T1
  8. Move the state file from the RAM-Disk to the retain location (if you used the RAM-Disk to save the VMs state)
    • mv /path/to/ram-disk/vm-001.state /path/to/retain/vm-001.state


See also: Snapshot workflow

Merge

  1. Check if the VM vm-001 is running
    • If not, start the VM in paused state:
      virsh start --paused vm-001
  2. Merge the live-disk-image (/path/to/images/vm-001.qcow2) with its backing store file (/path/to/retain/vm-001.qcow2):
    virsh qemu-monitor-command vm-001 --hmp "block_stream drive-virtio-disk0"
    • Please note: If a VM has more than just one disk-image, repeat this step for every image. Just increase the number at the end of the command. So command to merge the second disk image would be:
      virsh qemu-monitor-command vm-001 --hmp "block_stream drive-virtio-disk1"
  3. If the machine is running in paused state (means we started it in 1. because it was not running), stop it again:
    • virsh shutdown vm-001

Please note: After these steps, the live-disk-image /path/to/image/vm-001.qcow2 no longer contains a reference to the image at the retain location (/path/to/retain/vm-001.qcow2). This is important for the retain process.


See also: Merge workflow

Retain

  1. Move the all the files in from the retain directory (/path/to/retain/) to the backup directory (/path/to/backup/)
    1. Move the VMs state file to the backup directory
      • mv /path/to/retain/vm-001.state /path/to/backup/vm-001.state
    2. Move the VMs disk image to the backup directory
      • mv /path/to/retain/vm-001.qcow2 /path/to/backup/vm-001.qcow2
        • Please note: If the VM vm-001 has more than just one disk image, repeat this step for each disk image
    3. Move the VMs XML description file to the backup directory
      • mv /path/to/retain/vm-001.xml /path/to/backup/vm-001.xml
    4. Move the VMs backend entry file to the backup directory
      • mv /path/to/retain/vm-001.backend /path/to/backup/vm-001.backend


See also Retain workflow

Communication through backend

Since the stoney cloud is (as the name says already) a cloud solution, it makes sense to have a backend (in our case openLDAP) involved in the whole process. Like that it is possible to run the backup jobs decentralized on every vm-node. The control instance can then modify the backend, and theses changes are seen by the diffenrent backup daemons on the vm-nodes. So the communication could look like shown in the following picture (Figure 1):

Figure 1: Communication between the control instance and the prov-backup-kvm daemon through the LDAP backend

Control-Instance Daemon Interaction for creating a Backup with LDIF Examples

The step numbers correspond with the graphical overview from above.

Step 00: Backup Configuration for a virtual machine
# The following backup configuration says, that the backup should be done daily, at 03:00 hours (localtime).
# * * * * * command to be executed
# - - - - -
# | | | | |
# | | | | +----- day of week (0 - 6) (Sunday=0)
# | | | +------- month (1 - 12)
# | | +--------- day of month (1 - 31)
# | +----------- hour (0 - 23)
# +------------- min (0 - 59)
# localtime in the crontab entry
dn: ou=backup,sstVirtualMachine=kvm-005,ou=virtual machines,ou=virtualization,ou=services,o=stepping-stone,c=ch
objectclass: top
objectclass: organizationalUnit
objectclass: sstVirtualizationBackupObjectClass
objectclass: sstCronObjectClass
ou: backup
description: This sub tree contains the backup plan for the virtual machine kvm-005.
sstCronMinute: 0
sstCronHour: 3
sstCronDay: *
sstCronMonth: *
sstCronDayOfWeek: *
sstCronActive: TRUE
sstBackupRootDirectory: file:///var/backup/virtualization
sstBackupRetainDirectory: file:///var/virtualization/retain
sstBackupRamDiskLocation: file:///mnt/ramdisk-test
sstVirtualizationDiskImageFormat: qcow2
sstVirtualizationDiskImageOwner: root
sstVirtualizationDiskImageGroup: vm-storage
sstVirtualizationDiskImagePermission: 0660
sstBackupNumberOfIterations: 1
sstVirtualizationVirtualMachineForceStart: FALSE
sstVirtualizationBandwidthMerge: 0
Step 01: Initialize Backup Sub Tree (Control instance daemon)

The sub tree ou=20121002T010000Z,ou=backup,sstVirtualMachine=kvm-005,ou=virtual machines,ou=virtualization,ou=services,o=stepping-stone,c=ch reflects the time, when the backup is planned (in the form of [YYYY][MM][DD]T[hh][mm][ss]Z (ISO 8601) and it should be written at the time, when the backup is planned and should be executed. The section 20121002T010000Z means the following:

  • Year: 2012
  • Month: 10
  • Day of Month: 02
  • Hour of Day: 01
  • Minutes: 00
  • Seconds: 00

Please be aware the the time is to be written in UTC (see also the comment in the LDIF example below).

# This entry is the place holder for the backup, which is to be executed at 03:00 hours (localtime with daylight-saving). This
# leads to the 20121002T010000Z timestamp (which is written in UTC).
dn: ou=20121002T010000Z,ou=backup,sstVirtualMachine=kvm-005,ou=virtual machines,ou=virtualization,ou=services,o=stepping-stone,c=ch
objectclass: top
objectclass: sstProvisioning
objectclass: organizationalUnit
ou: 20121002T010000Z
sstProvisioningExecutionDate: 0
sstProvisioningMode: initialize
sstProvisioningReturnValue: 0
sstProvisioningState: 20121002T014513Z
Step 02: Finalize the Initialization (Control instance daemon)
# The attribute sstProvisioningState is updated with current time by the fc-brokerd, when sstProvisioningMode is modified.
dn: ou=20121002T010000Z,ou=backup,sstVirtualMachine=kvm-005,ou=virtual machines,ou=virtualization,ou=services,o=stepping-stone,c=ch
changetype: modify
replace: sstProvisioningState
sstProvisioningState: 20121002T010001Z
-
replace: sstProvisioningMode
sstProvisioningMode: initialized
Step 03: Start the Snapshot Process (Control instance daemon)

With the setting of the sstProvisioningMode to snapshot, the actual backup process is kicked off by the Control instance daemon.

# The attribute sstProvisioningState is set to zero by the fc-brokerd, when sstProvisioningMode is modified to
# snapshot (this way the Provisioning-Backup-VKM daemon knows, that it must start the snapshotting process).
dn: ou=20121002T010000Z,ou=backup,sstVirtualMachine=kvm-005,ou=virtual machines,ou=virtualization,ou=services,o=stepping-stone,c=ch
changetype: modify
replace: sstProvisioningState
sstProvisioningState: 0
-
replace: sstProvisioningMode
sstProvisioningMode: snapshot
Step 04: Starting the Snapshot Process (Provisioning-Backup-KVM daemon)

As soon as the Provisioning-Backup-KVM daemon receives the snapshot command, it sets the sstProvisioningMode to snapshotting to tell the Control instance daemon and other interested parties, that it is snapshotting the virtual machine or virtual machine template.

# The attribute sstProvisioningMode is set to snapshotting by the Provisioning-Backup-VKM daemon.
dn: ou=20121002T010000Z,ou=backup,sstVirtualMachine=kvm-005,ou=virtual machines,ou=virtualization,ou=services,o=stepping-stone,c=ch
changetype: modify
replace: sstProvisioningMode
sstProvisioningMode: snapshotting
Step 05: Finalizing the Snapshot Process (Provisioning-Backup-KVM daemon)

As soon as the Provisioning-Backup-KVM daemon has executed the snapshot command, it sets the sstProvisioningMode to snapshotted, the sstProvisioningState to the current timestamp (UTC) and sstProvisioningReturnValue to zero to tell the Control instance daemon and other interested parties, that the snapshot of the virtual machine or virtual machine template is finished.

# The attribute sstProvisioningState is set with the current timestamp by the Provisioning-Backup-VKM daemon, when
# the attributes sstProvisioningReturnValue and sstProvisioningMode are set.
# With this combination, the fc-brokerd knows, that it can proceed.
dn: ou=20121002T010000Z,ou=backup,sstVirtualMachine=kvm-005,ou=virtual machines,ou=virtualization,ou=services,o=stepping-stone,c=ch
changetype: modify
replace: sstProvisioningState
sstProvisioningState: 20121002T010011Z
-
replace: sstProvisioningReturnValue
sstProvisioningReturnValue: 0
-
replace: sstProvisioningMode
sstProvisioningMode: snapshotted
Step 06: Start the Merge Process (Control instance daemon)

With the setting of the sstProvisioningMode to merge, the Control instance daemon tells the Provisioning-Backup-KVM daemon to merge the backing file disk image back into the current disk image.

# The attribute sstProvisioningState is set to zero by the fc-brokerd, when sstProvisioningMode is modified to
# merge (this way the Provisioning-Backup-VKM daemon knows, that it must start the merging process).
dn: ou=20121002T010000Z,ou=backup,sstVirtualMachine=kvm-005,ou=virtual machines,ou=virtualization,ou=services,o=stepping-stone,c=ch
changetype: modify
replace: sstProvisioningState
sstProvisioningState: 0
-
replace: sstProvisioningMode
sstProvisioningMode: merge
Step 07: Starting the Merge Process (Provisioning-Backup-KVM daemon)

As soon as the Provisioning-Backup-KVM daemon receives the merge command, it sets the sstProvisioningMode to merging to tell the Control instance daemon and other interested parties, that it is merging the virtual machine or virtual machine template.

# The attribute sstProvisioningMode is set to merging by the Provisioning-Backup-VKM daemon.
dn: ou=20121002T010000Z,ou=backup,sstVirtualMachine=kvm-005,ou=virtual machines,ou=virtualization,ou=services,o=stepping-stone,c=ch
changetype: modify
replace: sstProvisioningMode
sstProvisioningMode: merging
Step 08: Finalizing the Merging Process (Provisioning-Backup-KVM daemon)

As soon as the Provisioning-Backup-KVM daemon has executed the merge command, it sets the sstProvisioningMode to merged, the sstProvisioningState to the current timestamp (UTC) and sstProvisioningReturnValue to zero to tell the Control instance daemon and other interested parties, that the merging of the virtual machine or virtual machine template is finished.

# The attribute sstProvisioningState is set with the current timestamp by the Provisioning-Backup-VKM daemon, when
# the attributes sstProvisioningReturnValue and sstProvisioningMode are set.
# With this combination, the fc-brokerd knows, that it can proceed.
dn: ou=20121002T010000Z,ou=backup,sstVirtualMachine=kvm-005,ou=virtual machines,ou=virtualization,ou=services,o=stepping-stone,c=ch
changetype: modify
replace: sstProvisioningState
sstProvisioningState: 20121002T010500Z
-
replace: sstProvisioningReturnValue
sstProvisioningReturnValue: 0
-
replace: sstProvisioningMode
sstProvisioningMode: merged
Step 09: Start the Retain Process (Control instance daemon)

With the setting of the sstProvisioningMode to retain, the Control instance daemon tells the Provisioning-Backup-KVM daemon to retain (copy and then delete) all the necessary files to the configured backup location.

# The attribute sstProvisioningState is set to zero by the fc-brokerd, when sstProvisioningMode is modified to
# retain (this way the Provisioning-Backup-VKM daemon knows, that it must start the retaining process).
dn: ou=20121002T010000Z,ou=backup,sstVirtualMachine=kvm-005,ou=virtual machines,ou=virtualization,ou=services,o=stepping-stone,c=ch
changetype: modify
replace: sstProvisioningState
sstProvisioningState: 0
-
replace: sstProvisioningMode
sstProvisioningMode: retain
Step 10: Starting the Retain Process (Provisioning-Backup-KVM daemon)

As soon as the Provisioning-Backup-KVM daemon receives the retain command, it sets the sstProvisioningMode to retaining to tell the Control instance daemon and other interested parties, that it is retaining the necessary files to the configured backup location.

# The attribute sstProvisioningMode is set to retaining by the Provisioning-Backup-VKM daemon.
dn: ou=20121002T010000Z,ou=backup,sstVirtualMachine=kvm-005,ou=virtual machines,ou=virtualization,ou=services,o=stepping-stone,c=ch
changetype: modify
replace: sstProvisioningMode
sstProvisioningMode: retaining
Step 11: Finalizing the Retaing Process (Provisioning-Backup-KVM daemon)

As soon as the Provisioning-Backup-KVM daemon has executed the retain command, it sets the sstProvisioningMode to retained, the sstProvisioningState to the current timestamp (UTC) and sstProvisioningReturnValue to zero to tell the Control instance daemon and other interested parties, that the retaining of all the necessary files to the configured backup location is finished.

# The attribute sstProvisioningState is set with the current timestamp by the Provisioning-Backup-VKM daemon, when
# the attributes sstProvisioningReturnValue and sstProvisioningMode are set.
# With this combination, the fc-brokerd knows, that it can proceed.
dn: ou=20121002T010000Z,ou=backup,sstVirtualMachine=kvm-005,ou=virtual machines,ou=virtualization,ou=services,o=stepping-stone,c=ch
changetype: modify
replace: sstProvisioningState
sstProvisioningState: 20121002T012000Z
-
replace: sstProvisioningReturnValue
sstProvisioningReturnValue: 0
-
replace: sstProvisioningMode
sstProvisioningMode: retained
Step 12: Finalizing the Backup Process (Control instance daemon)

As soon as the Control instance daemon notices, that the attribute sstProvisioningMode ist set to retained, it sets the sstProvisioningMode to finished and the sstProvisioningState to the current timestamp (UTC). All interested parties now know, that the backup process is finished, there for a new backup process could be started.

# The attribute sstProvisioningState is updated with current time by the fc-brokerd, when sstProvisioningMode is
# set to finished.
# All interested parties now know, that the backup process is finished, there for a new backup process could be started.
dn: ou=20121002T010000Z,ou=backup,sstVirtualMachine=kvm-005,ou=virtual machines,ou=virtualization,ou=services,o=stepping-stone,c=ch
changetype: modify
replace: sstProvisioningState
sstProvisioningState: 20121002T012001Z
-
replace: sstProvisioningMode
sstProvisioningMode: finished

State of the art

Since we do not have a working control instance, we need to have a workaround for backing up the machines:

  • We do already have a BackupKVMWrapper.pl script (File-Backend) which executes the three sub-processes in the correct order for a given list of machines (see #Backup multiple machines at the same_time).
  • We do already have the implementation for the whole backup with the LDAP-Backend (see stoney conductor: prov backup kvm ).
  • We can now combine these two existing scripts and create a wrapper (lets call it KVMBackup) which, in some way, adds some logic to the BackupKVMWrapper.pl. In fact the KVMBackup wrapper will generate the list of machines which need a backup.

The behaviour on our servers is as follows (c.f. Figure 2):

  1. The (decentralized) KVMBackup wrapper generates a list off all machines running on the current host.
    • For each of these machines:
      • Check if the machine is excluded from the backup, if yes, remove the machine from the list
      • Check if the last backup was successful, if not, remove the machine from the list
  2. Update the backup subtree for each machine in the list
    • Remove the old backup leaf (the "yesterday-leaf"), and add a new one (the "today-leaf")
    • After this step, the machines are ready to be backed up
  3. Call the BackupKVMWrapper.pl script with the machines list as a parameter
  4. Wait for the BackupKVMWrapper.pl script to finish
  5. Go again through all machines and update the backup subtree a last time
    • Check if the backup was successful, if yes, set sstProvisioningMode = finished (see also TBD)
Figure 2: How the two wrapper interact with the LDAP backend

Next steps