The XEN hypervisor uses big files (a couple of gigabytes) as filesystem
images for virtual machines. Unlike other virtualisation solutions XEN
does not impose its own internal structure on the image file. The big
file simply has to contain an ordinary ext3 filesystem and, optionally,
a partition table just as if it were a real hard disk.
The ability to use big files as hard disks comes in handy if you are
running short of space on your main hard disk. With an external hard
disk you should be well prepared to run a number of virtual machines
as big files.
However, having the filesystem of a virtual machine in a big file
raises the question of how to boot the virtual machine.
Essentially there are two options to do that:
For the rest of this article we will focus on the second option which is much more appealing as everything is kept inside the big file. I will show you how exactly the big file is turned into a virtual hard disk and how you can access and modify the information stored in the virtual machine's own filesystem.
As a second step we use this chunk of 4141875200 bytes to act as a hard disk and try to partition the bigfile as usual:
losetup /dev/loop0 bigfile
fdisk /dev/loop0
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel. Changes will remain in memory only,
until you decide to write them. After that, of course, the previous
content won't be recoverable.
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)
Command (m for help):
As expected, the fdisk program throws a number of error messages at us, because we have given a big file instead of a real hard disk to the program. But let's see how the fdisk program recognizes our new hard disk in detail
.
Obviously there is no partition table yet, but the program assumes that the big file represents a hard disk with 255 heads and 63 sectors of 512 bytes data each. Every cylinder of our virtual hard disk is made of 255 x 63 x 512 bytes = 8225280 bytes which represents the units in which we can chop the hard disk space into partitions now. All in all there are 503 cylinders in our virtual hard disk which makes a total of 503 x 8225280 bytes = 4137315840 bytes to spend on partitions.
But wait, didn't we create 4141875200 bytes in the first place? That's 4559360 bytes less than what we had originally. Well, this loss is due to the fact that for the 504th cylinder we'd need 8225280 bytes which we don't have, so this loss is inevitable. But the important consequence of this reduction of space is that we cannot create a filesystem on the whole bunch of data we supplied. At the moment the size of our filesystem is not determined at all. The next step is to create a new primary partition inside our big file using all the space we have:
After having written the partition table to the big file, have you checked for the new device file /dev/loop0p1? Don't worry, it does not exist! Adding p1 to the disk label is fdisk's way to denote partitions, that does not mean that you'll find such a thing in the /dev directory.
Of course these 32256 bytes at the beginning of the big file are there for a purpose, which is to store the partition table. Our first partition begins right after this amount of data, at an offset of 32256 inside the big file. The amount of 32256 bytes results from the fact that one track (63 sectors of 512 bytes for one head) are put away for the partition table. Now it's time to use a second loop device (/dev/loop1) to poke inside the big file at exactly the point where our first partition begins and create a new filesystem there:
It's essential that we supply the number of blocks as a parameter to the mkfs command to ensure, that our new filesystem on the first partition fits exactly in the space we have allocated. Without this parameter our filesystem would become too big, as the 4.5 megabytes after the first partition would be used for the filesystem too, and when the virtual machine is going to use the filesystem its actual size would conflict with the numbers in the partition table. Either the partition table or the filesystem's superblock is lying, which will cause distress for the virtual machine that expects a consistent filesystem to operate.
Now your spick-and-span virtual hard disk is ready to boot.
- Provide the VM's kernel and the init-ramdisk, which are usually stored
inside the filesystem (in the /boot directory), as separate files
together with the big file, and modify the VM's configuration to use
them.
- leave the kernel and the init-ramdisk in the big file and provide a working boot sector that accesses the kernel inside the big file, using the native XEN pygrub bootloader to start the virtual machine.
For the rest of this article we will focus on the second option which is much more appealing as everything is kept inside the big file. I will show you how exactly the big file is turned into a virtual hard disk and how you can access and modify the information stored in the virtual machine's own filesystem.
Getting Partitions And Filesystem Sizes Sorted
Our journey through the big file's internal structure naturally begins with the creation of the big file.
dd if=/dev/zero of=bigfile bs=1M count=3950
As a second step we use this chunk of 4141875200 bytes to act as a hard disk and try to partition the bigfile as usual:
losetup /dev/loop0 bigfile
fdisk /dev/loop0
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel. Changes will remain in memory only,
until you decide to write them. After that, of course, the previous
content won't be recoverable.
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)
Command (m for help):
As expected, the fdisk program throws a number of error messages at us, because we have given a big file instead of a real hard disk to the program. But let's see how the fdisk program recognizes our new hard disk in detail
.
Disk /dev/loop0: 4141 MB, 4141875200 bytes
255 heads, 63 sectors/track, 503 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
255 heads, 63 sectors/track, 503 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
Obviously there is no partition table yet, but the program assumes that the big file represents a hard disk with 255 heads and 63 sectors of 512 bytes data each. Every cylinder of our virtual hard disk is made of 255 x 63 x 512 bytes = 8225280 bytes which represents the units in which we can chop the hard disk space into partitions now. All in all there are 503 cylinders in our virtual hard disk which makes a total of 503 x 8225280 bytes = 4137315840 bytes to spend on partitions.
But wait, didn't we create 4141875200 bytes in the first place? That's 4559360 bytes less than what we had originally. Well, this loss is due to the fact that for the 504th cylinder we'd need 8225280 bytes which we don't have, so this loss is inevitable. But the important consequence of this reduction of space is that we cannot create a filesystem on the whole bunch of data we supplied. At the moment the size of our filesystem is not determined at all. The next step is to create a new primary partition inside our big file using all the space we have:
Disk /dev/loop0: 4141 MB, 4141875200 bytes
255 heads, 63 sectors/track, 503 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/loop0p1 1 503 4040316 83 Linux
255 heads, 63 sectors/track, 503 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
After having written the partition table to the big file, have you checked for the new device file /dev/loop0p1? Don't worry, it does not exist! Adding p1 to the disk label is fdisk's way to denote partitions, that does not mean that you'll find such a thing in the /dev directory.
Poking Inside The Big File
From the partition table you can see that 4040316 blocks have been allocated for the new partition. With each block storing 1024 bytes we now know our first partition size, it's 4040316 x 1024 bytes = 4137283584 bytes. This is another number we never saw before! After having written off some 4.5 megabytes because we cannot use half a cylinder, we now face another loss of exactly 4137315840 - 4137283584 = 32256 bytes.Of course these 32256 bytes at the beginning of the big file are there for a purpose, which is to store the partition table. Our first partition begins right after this amount of data, at an offset of 32256 inside the big file. The amount of 32256 bytes results from the fact that one track (63 sectors of 512 bytes for one head) are put away for the partition table. Now it's time to use a second loop device (/dev/loop1) to poke inside the big file at exactly the point where our first partition begins and create a new filesystem there:
losetup -o 32256 /dev/loop1 bigfile
mkfs -t ext3 -c /dev/loop1 4040316
mkfs -t ext3 -c /dev/loop1 4040316
It's essential that we supply the number of blocks as a parameter to the mkfs command to ensure, that our new filesystem on the first partition fits exactly in the space we have allocated. Without this parameter our filesystem would become too big, as the 4.5 megabytes after the first partition would be used for the filesystem too, and when the virtual machine is going to use the filesystem its actual size would conflict with the numbers in the partition table. Either the partition table or the filesystem's superblock is lying, which will cause distress for the virtual machine that expects a consistent filesystem to operate.
Writing The Master Boot Record
You can fill up the filesystem with whatever carefully selected quality open source software you can find on the planet, but in the end we need to write the new virtual disk's master boot record to boot the jewel. There is one step of preparation to be done before we can use the grub shell to write the MBR. We have to make a symbolic link named /dev/loop to the device that points to the master boot record, that is to the beginning of the big file, /dev/loop0 in the example above.
grub> device (hd0) /dev/loop
grub> root (hd0,0)
grub> setup (hd0)
grub> quit
grub> root (hd0,0)
grub> setup (hd0)
grub> quit
Now your spick-and-span virtual hard disk is ready to boot.


Alternatively, you can just do the dd command you mention at the beginning and then boot your Xen instance with the initial install initrd.img and vmlinuz of your chosen OS (RHEL5 for example) and use the GUI installation procedure of that to do your 'disk' partitioning.
Your method is an excellent way of teaching people what's going on beneath the surface but it's probably not necessary every time you want to install a bespoke DomU (-:
Of course, if you're cloning DomUs you don't even need to do that, just do a bitwise copy of the file, boot it in isolation and change the necessary settings inside. But you knew that (-:
Note also that the Xen Hypervisor can use big files for its DomU filesystems. But it can also use a whole raw disk and a partition of a disk too.
Here's an interesting technical thought: Create two identically-sized blank files with dd. Have one on a local disk and one on an external disk. Assign them both to a Xen instance by specifying them in (for example) /etc/xen/image:
[...]
disk = [ 'tap:aio:/image_disk0,xvda,w', ]
disk = [ 'tap:aio:/mnt/USB/image_disk1,xvda,w', ]
[...]
You could now boot using your OS of choice's installation vmlinuz and initrd.img by adding the lines:
kernel = "/etc/xen/vmlinuz"
ramdisk = "/etc/xen/initrd.img"
and doing the usual "xm create -c image" and then at the partitioning stage create a RAID1 set of /boot / and swap on both image_disk0 and image_disk1 and do your installation. What you now have is a redundantly-protected OS which you can break the mirror on whenever you like to keep a snapshot of the DomU at any point you like (of course, you'll need to install GRUB on both sides of the mirror if you ever want to boot the side that didn't get GRUB at install time).
Sorry, went off at a bit of a tangent there. Great blog, love reading it.
Thanks Ben for your comment,
I haven't tried your proposal for a software RAID-1 setup yet, but I can imagine it will work, if the USB disk is assigned as xvdb to make it distinguishable from the local image file.
While the main benefit of such a setup is reliability and to a lesser degree performance, you'll have to plug in your USB disk all the time when you use your Xen instance, to ensure that the software RAID is in sync.
But in case you are planning to plug in your USB disk only when you like to snapshot the system, I'd fear that running the software RAID "on one leg" most of the time may cause problems. Anyway, after connecting your external disk you'll have to face a long time waiting until the two disks have synchronized. Although the sync will take place in the background it will surely slow down the machine.
I suppose it would be easier to copy the local image file to create a snapshot?
Heh, yes it would take a fair while I guess. Even with USB2. Definitely not the best idea in the world!
However I don't think running RAID1 in permanently degraded mode would be any kind of an issue, it's basically what you do 'normally' when you don't have any RAID.
I guess I was musing out loud as a thought experiment rather than something I'd ever dream of putting into a live environment. Still, amusing enough in its own way.
You're right of course in that it would be much easier just to copy the local image file to create a snapshot except in the the cases I mentioned at the top of my original comment when I reminded you that Xen can use a whole raw disk or simply an unassigned partition of a disk as its virtual hard disk. In that case making a USB drive the other half of a RAID1 would actually be simpler than working out a simple way of cloning a partition or whole disk containing a 'raw' Xen partition layout.
Elaborating upon your thought experiment a little bit more and given that the USB disk is used as a raw partition, would it be possible to revert to the old snapshot taken a while ago simply by connecting the USB disk? That would be great.
I'm not sure if booting from the USB is required to set the USB as a master from which the array is rebuilt subsequently. In any case, the local disk partition would surely have to be removed first to prevent the valuable snapshot being "updated" inadvertently. Whether that can be done using mdadm or by removing the respective tap entry isn't entirely clear to me at the moment. I suspect that the local partition has to be assigned a completely new device file, say xvdc, to be able to be added as a spare disk to the array while the array is running on the USB part.
I think it's worth an experiment.