How to set up RAID on the rootfs

Using filesystems with RAID (Redundant Array of Inexpensive Disks) has many advantages. First there is speed. RAID combines several disks and reads/writes chunks from the disks in a sequence. That way it can reach transfer speeds up to three times that of the slowest disk, maybe even more. Second you are able to get bigger filesystems than your largest disk (useful for /var/spool/news, /home/ftp/pub etc.). Third there is the possibility to get redundancy so a disk failure won't hurt.

For technical information on RAID please refer to <URL:ftp://ftp.infodrom.north.de/pub/doc/tech/raid/>.

To do RAID with Linux you first need a kernel with appropriate support. Linux 2.0.x supports linear and striping modes (the latter is also known as RAID-0). Linux kernel 2.1.63 also supports RAID-4 and RAID-5. To use either of them you need to have special tools installed. For linear and RAID-0 you need the mdutils package. To use RAID 4/5 you need to have the raidtools package installed and a kernel version higher than 2.1.62.

With RAID (not linear) you'll get best results if you use partitions with exactly the same sizes. The RAID driver will work with different sizes, too, but is less efficient as you may imagine after reading some RAID documents.

Setting up RAID

Setting up RAID for normal filesystems such as /var, /home or /usr is quite simple. First you need to partition your disks. After you've done that you need to tell the RAID subsystem how you want to organize the partitions, e.g. with

  mdcreate -c4k raid0 /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1
  mdcreate -c8k raid0 /dev/md1 /dev/sdd1 /dev/sde1 /dev/sdf1
This creates two RAIDs, each of them consisting of three partitions. The first one has a chunk size of 4k while the second one uses 8k chunks. These commands will create appropriate entries in /etc/mdtab. The next step is to activate these devices with:

  mdadd -ar
From now on you may refer to /dev/md0 and /dev/md1 as block devices carrying your filesystems. Now you may create your filesystem on the new devices just and add them to /etc/fstab just as usual.

Debian GNU/Linux is configured to initialize and activate any RAID at boot time so you should not run into problems. Please note that you need to have the RAID drivers compiled into the kernel, modules might not work.

Swapping over RAID

The kernel has native support for distributing swap space over several disks. Just add all swap partitions to /etc/fstab and use 'swapon -a' to activate all of them. The kernel uses striping (RAID-0) for them. Here's a sample setup:

  /dev/sda3               none            swap    sw
  /dev/sdb3               none            swap    sw
  /dev/sdc3               none            swap    sw

Root filesystem on RAID

The use of RAID for the root filesystem is a little bit tricky. The problem is that LILO can't read and boot the kernel if it is not stored linear on the disk (like it is on ext2 or dos). The solution is to put the kernel on a different filesystem that doesn't use RAID.

This way LILO would boot the kernel but the kernel itself would be unable to mount the root filesystem because its RAID subsystem isn't initialized.

For late 2.1.x kernels there's a kernel parameter that can be used to load the kernel from a RAID. This is

  md=,,,,dev0,dev1,...,devn
This needs to be added to lilo using the append="" option or directly at the lilo prompt during boot stage. You'll find more information in Documention/md.txt in the Linux source tree.

For stable kernels (2.0.x) and "not soo late" development kernels (2.1.x) you need a mechanism to call some programs, at least mdadd, before the kernel tries to mount the root file system and after the kernel is loaded.

The only way to achive this is to use the initial ramdisk also known as initrd. General information about initrd may be found in the Documentation directory inside of the kernel source tree. If the Linux kernel uses initrd it mounts the ramdisk as root file system and executes /linuxrc if it is around. After this is finished the kernel continues its boot process and mounts the real root filesystem. The old / (from the initrd) will be moved to /initrd if that directory is available or umounted otherwise.

The initrd file is a simple rootdisk. It should contain all the files that are needed for processing the /linuxrc file. This includes a working shell if it's a shell script and all tools that are used in this script. This might include a working libc with ld.so and tools, too.

After you have initialized RAID from /linuxrc you need to tell the kernel where its root filesystem resides. As it uses the initrd it might not know. There is an easy interface for this using the proc filesystem. You only need to echo the appropriate device number to /proc/sys/kernel/real-root-dev and the kernel continues with that setting.

As lilo isn't able to boot from a non-linear block device (such as RAID) you need to reserve a small partition with the kernel on it. I've decided to use a 10MB partition which I use as /boot and put stuff on it. 10MB is plenty of space for only one kernel and initrd, currently my system only uses 2.5 MB of it. So /etc/lilo.conf still points to /boot/vmlinuz-2.0.34 in this setup.

Now, decide what needs to be done in the /linuxrc script. You only need to activate RAID and tell the kernel where your root filesystem resides. The following script should do it:

  #! /bin/ash

  if [ -s /etc/mdtab -a -f /sbin/mdadd ]
  then
        echo "Preparing system for rootfs raid."
        /sbin/mdadd -ar
        /bin/mount -t proc /proc /proc
        echo 0x900 > /proc/sys/kernel/real-root-dev
        /bin/umount /proc
  else
        echo "No mdtab or mdadd found."
  fi
You may use any block device as root filesystem. 0x900 stands for major number 9 and minor number 0 which is /dev/md0.

Now make a list of binaries needed and additional files. Of course you need some device files in /dev/ as well. To get the /linuxrc script working at all, you need to have /dev/tty1. The other devices depend on your /etc/mdtab file. You will at least need /dev/md0.

Binaries: ash, mount, umount, mdadd
Files: mdtab, fstab and mtab and for safety passwd
Devices: tty1, depending on /etc/mdtab

I use this mdtab:

  # mdtab entry for /dev/md0
  /dev/md0        raid0,4k,0,93f5553f     /dev/hda2 /dev/hdb2
  # mdtab entry for /dev/md1
  /dev/md1        raid0,8k,0,3ffaa1d8     /dev/hda4 /dev/hdb4
Therefore I have created these block devices:

  /dev/hda2
  /dev/hda4
  /dev/hdb2
  /dev/hdb4
  /dev/md0
  /dev/md1
  /dev/md2
  /dev/md3
You can use the mknod program to create the device files, e.g. with the following command for tty1:

  mknod dev/tty1 c 4 1
Ok, but how does one create the initrd file? The best thing you can do is to create the directory /tmp/initrd and install everything in it. When you're finished you determine the diskspace it uses (du -s) and create the initrd itself. The following command would create a 1M initial ramdisk. This is what I use.

  dd if=/dev/zero of=/tmp/initrd.bin bs=1k count=1024
  mke2fs /tmp/initrd.bin
  mount -o loop /mnt /tmp/initrd.bin
As you probably use dynamic linked binaries you need to make sure that the Linker and the dynamic libraries are installed, too. You need to copy at least /lib/libc*.so and /lib/ld-linux.so.2 as link to /lib/ld-2.0.6.so. You also need an appropriate /etc/ld.so.config file. Appropriate here means that "/lib" should be the only line in it. You need to create a new library cache /etc/ld.so.cache file with "ldconfig -r /initrd". Of course you also have to install the needed binaries in appropriate directories /sbin and /bin.

Don't forget to create the /proc directory or mount will fail. The fstab and mtab files can be empty. They will only be read, not written to, but they need to exist. For the /etc/passwd file it's sufficient to include the root user.

After you have copied everything from /tmp/initrd to the ramdisk, umount it (e.g. with the command "umount /mnt") and move the file to /boot/initrd.bin. Now you need to tell lilo to load the kernel and the ramdisk. That's no problem, just use a record in /etc/lilo.conf similar to the following:

  image=/boot/vmlinuz-2.0.34
    initrd=/boot/initrd.bin
    label=linux
    read-only
Issue the command "lilo" and you're nearly done. As the RAID subsystem is now configured at boot stage before any /etc/init.d scripts were issued you should disable the mdadd call in /etc/init.d scripts.


© Joey, 11 Jul '98