Using Big Files As Hard Disks

| 4 Comments | No TrackBacks
The XEN hypervisor uses big files (a couple of gigabytes) as filesystem images for virtual machines. Unlike other virtualisation solutions XEN does not impose its own internal structure on the image file. The big file simply has to contain an ordinary ext3 filesystem and, optionally, a partition table just as if it were a real hard disk. The ability to use big files as hard disks comes in handy if you are running short of space on your main hard disk. With an external hard disk you should be well prepared to run a number of virtual machines as big files. However, having the filesystem of a virtual machine in a big file raises the question of how to boot the virtual machine. Essentially there are two options to do that:

  1. Provide the VM's kernel and the init-ramdisk, which are usually stored inside the filesystem (in the /boot directory), as separate files together with the big file, and modify the VM's configuration to use them.

  2. leave the kernel and the init-ramdisk in the big file and provide a working boot sector that accesses the kernel inside the big file, using the native XEN pygrub bootloader to start the virtual machine.
Both options require that the big file must be associated with a real, special device file (i.e /dev/loop0) in order to create a filesystem on the big file. While for the first option it is sufficient to simply connect the big file with the loop device, using the "losetup /dev/loop0 bigfile" command, the second option is much more complex, as the big file has to be partitioned like an ordinary hard disk before the filesystem can be created.

For the rest of this article we will focus on the second option which is much more appealing as everything is kept inside the big file. I will show you how exactly the big file is turned into a virtual hard disk and how you can access and modify the information stored in the virtual machine's own filesystem.

Getting Partitions And Filesystem Sizes Sorted

Our journey through the big file's internal structure naturally begins with the creation of the big file.

dd if=/dev/zero of=bigfile bs=1M count=3950

As a second step we use this chunk of 4141875200 bytes to act as a hard disk and try to partition the bigfile as usual:


losetup /dev/loop0 bigfile
fdisk /dev/loop0

Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel. Changes will remain in memory only,
until you decide to write them. After that, of course, the previous
content won't be recoverable.

Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

Command (m for help):

As expected, the fdisk program throws a number of error messages at us, because we have given a big file instead of a real hard disk to the program. But let's see how the fdisk program recognizes our new hard disk in detail
.
Disk /dev/loop0: 4141 MB, 4141875200 bytes
255 heads, 63 sectors/track, 503 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System

Obviously there is no partition table yet, but the program assumes that the big file represents a hard disk with 255 heads and 63 sectors of 512 bytes data each. Every cylinder of our virtual hard disk is made of 255 x 63 x 512 bytes = 8225280 bytes which represents the units in which we can chop the hard disk space into partitions now. All in all there are 503 cylinders in our virtual hard disk which makes a total of 503 x 8225280 bytes = 4137315840 bytes to spend on partitions.

But wait, didn't we create 4141875200 bytes in the first place? That's 4559360 bytes less than what we had originally. Well, this loss is due to the fact that for the 504th cylinder we'd need 8225280 bytes which we don't have, so this loss is inevitable. But the important consequence of this reduction of space is that we cannot create a filesystem on the whole bunch of data we supplied. At the moment the size of our filesystem is not determined at all. The next step is to create a new primary partition inside our big file using all the space we have:

Disk /dev/loop0: 4141 MB, 4141875200 bytes
255 heads, 63 sectors/track, 503 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System /dev/loop0p1 1 503 4040316 83 Linux

After having written the partition table to the big file, have you checked for the new device file /dev/loop0p1? Don't worry, it does not exist! Adding p1 to the disk label is fdisk's way to denote partitions, that does not mean that you'll find such a thing in the /dev directory.

Poking Inside The Big File

From the partition table you can see that 4040316 blocks have been allocated for the new partition. With each block storing 1024 bytes we now know our first partition size, it's 4040316 x 1024 bytes = 4137283584 bytes. This is another number we never saw before! After having written off some 4.5 megabytes because we cannot use half a cylinder, we now face another loss of exactly 4137315840 - 4137283584 = 32256 bytes.

Of course these 32256 bytes at the beginning of the big file are there for a purpose, which is to store the partition table. Our first partition begins right after this amount of data, at an offset of 32256 inside the big file. The amount of 32256 bytes results from the fact that one track (63 sectors of 512 bytes for one head) are put away for the partition table. Now it's time to use a second loop device (/dev/loop1) to poke inside the big file at exactly the point where our first partition begins and create a new filesystem there:

losetup -o 32256 /dev/loop1 bigfile
mkfs -t ext3 -c /dev/loop1 4040316

It's essential that we supply the number of blocks as a parameter to the mkfs command to ensure, that our new filesystem on the first partition fits exactly in the space we have allocated. Without this parameter our filesystem would become too big, as the 4.5 megabytes after the first partition would be used for the filesystem too, and when the virtual machine is going to use the filesystem its actual size would conflict with the numbers in the partition table. Either the partition table or the filesystem's superblock is lying, which will cause distress for the virtual machine that expects a consistent filesystem to operate.

Writing The Master Boot Record

You can fill up the filesystem with whatever carefully selected quality open source software you can find on the planet, but in the end we need to write the new virtual disk's master boot record to boot the jewel. There is one step of preparation to be done before we can use the grub shell to write the MBR. We have to make a symbolic link named /dev/loop to the device that points to the master boot record, that is to the beginning of the big file, /dev/loop0 in the example above.

grub> device (hd0) /dev/loop
grub> root (hd0,0)
grub> setup (hd0)
grub> quit

Now your spick-and-span virtual hard disk is ready to boot.

No TrackBacks

TrackBack URL: http://linuxcoaching.ie/cgi-bin/mt/mt-tb.cgi/21

4 Comments

Alternatively, you can just do the dd command you mention at the beginning and then boot your Xen instance with the initial install initrd.img and vmlinuz of your chosen OS (RHEL5 for example) and use the GUI installation procedure of that to do your 'disk' partitioning.

Your method is an excellent way of teaching people what's going on beneath the surface but it's probably not necessary every time you want to install a bespoke DomU (-:

Of course, if you're cloning DomUs you don't even need to do that, just do a bitwise copy of the file, boot it in isolation and change the necessary settings inside. But you knew that (-:

Note also that the Xen Hypervisor can use big files for its DomU filesystems. But it can also use a whole raw disk and a partition of a disk too.


Here's an interesting technical thought: Create two identically-sized blank files with dd. Have one on a local disk and one on an external disk. Assign them both to a Xen instance by specifying them in (for example) /etc/xen/image:

[...]
disk = [ 'tap:aio:/image_disk0,xvda,w', ]
disk = [ 'tap:aio:/mnt/USB/image_disk1,xvda,w', ]
[...]

You could now boot using your OS of choice's installation vmlinuz and initrd.img by adding the lines:

kernel = "/etc/xen/vmlinuz"
ramdisk = "/etc/xen/initrd.img"

and doing the usual "xm create -c image" and then at the partitioning stage create a RAID1 set of /boot / and swap on both image_disk0 and image_disk1 and do your installation. What you now have is a redundantly-protected OS which you can break the mirror on whenever you like to keep a snapshot of the DomU at any point you like (of course, you'll need to install GRUB on both sides of the mirror if you ever want to boot the side that didn't get GRUB at install time).

Sorry, went off at a bit of a tangent there. Great blog, love reading it.

Heh, yes it would take a fair while I guess. Even with USB2. Definitely not the best idea in the world!

However I don't think running RAID1 in permanently degraded mode would be any kind of an issue, it's basically what you do 'normally' when you don't have any RAID.

I guess I was musing out loud as a thought experiment rather than something I'd ever dream of putting into a live environment. Still, amusing enough in its own way.

You're right of course in that it would be much easier just to copy the local image file to create a snapshot except in the the cases I mentioned at the top of my original comment when I reminded you that Xen can use a whole raw disk or simply an unassigned partition of a disk as its virtual hard disk. In that case making a USB drive the other half of a RAID1 would actually be simpler than working out a simple way of cloning a partition or whole disk containing a 'raw' Xen partition layout.

Leave a comment

 

Recent Comments

  • Ralph: Elaborating upon your thought experiment a little bit more and read more
  • Ben: Heh, yes it would take a fair while I guess. read more
  • Ralph: Thanks Ben for your comment, I haven't tried your proposal read more
  • Ben: Alternatively, you can just do the dd command you mention read more
OpenID accepted here Learn more about OpenID