may 2
RAIDing New Territory
(Jim, I think I deserve an 'A' just for the title.)
First, some Resources
These are the articles and forum posts I found most relevant or useful in my escapades.
Introduction to Disks, Partitions, and Filesystems
To understand RAID we need at least a rudimentary understanding of hard drives and partitions. I assume that most of you are familiar with hard drives, or at least that your computer has one. You may, however, have no idea how they work. Here's a nifty picture!
Without getting too technical, the basic premise is that you have a physical medium (in this case, a magnetic disk) that holds information based on how it's magnetized. There is an "arm" that sweeps over the disk as it's spinning, and the arm can both read and write magnetic information on the disk.
To use a disk, we create what's called a partition. A partition, like its literal meaning, is a way to divide hard disks into separate containers. If a disk is a long series of slots to store information, say it looked like this:
[000000000000000000000000000000000000]
...then we could divide it into two partitions, and to our operating system they would look like two different sections:
[000000000000000000000000][0000000000]
There are a number of different reasons to divide a disk into multiple partitions. They can make it easier to manipulate information, make backups, or to assign completely different purposes to different sections of a disk. They are also useful for RAID, but we'll get to that.
Filesystems
The next thing you need to know about are file systems. Think of a file system as a way to map out the large array of information on a partition. When you put a file system on a partition, different "chunks" are grouped together and given names for reference. This makes it incredibly easier to find things on a disk when you're looking for them. Now it looks something a little more like this to our computer:
[list_of_addresses:(group1)000000|(group2)000000|(group3)000000|(group4)000000]
Only it's much larger, and this is a very basic abstraction. Not too important, filesystems aren't our focus.
Conventional Setup Without RAID
By convention, there are some commonly seen partition setups on linux installations. Here is the one you will see the most:
1 Hard Disk
2 Partitions
[[(holds / "root" directory tree):::::][swap:::::]]
"Swap" is the linux version of virtual memory. This is hard drive space that is used as overflow when your RAM becomes full.
The second most common convention is to have a separate partition for your /home directory:
1 Hard Disk
3 Partitions
[[(holds / "root directory tree)::][(holds /home directory tree):::][swap::::]
This second setup is handy if you want to do a complete reinstall of your OS, but want to retain your personal files. Just reinstall on the first partition, and then swap out new empty /home with the one still sitting on your second partition. This involves using something called your "fstab" and mounting; not really our topic for today.
So What is RAID?
Literally the definition of RAID is Redundant Arrays of Inexpensive Disks. By convention, these disks are hard drives. There are two main things you can do with RAID:
- Striping: This is when you make a computer think of multiple disks (or partitions) as one unit. For example, if the memory in two VERY small hard drives were these two lists:
- [::::::::::::::::] and [::::::::::::::::]
Then striping causes the computer to think of these as [::::::::::::::::::::::::::::::::], such that there is no divide between them. This is convenient for not wanting to deal with lots of smaller drives, and joining together all of your resources into one lump sum. It also means that if you're pulling the following text from our striped disks:
[:::::::::::::::::hello,world:::::::::::::::]
and "hello," and "world" are each on separate real disks:
[::::::::::::::hello,] [world:::::::::::::]
then we will read it in twice as fast. This is because each disk has their own arm to move around, and their own cable connected to the motherboard.
The second method of RAID is:
- Mirroring: This is where, not surprisingly, you mirror information from one disk (partition) to another. So if we take our original two disks
- [::::::::::::::::] and [::::::::::::::::] We keep them separate, but as we write information to one, it copies an exact duplicate onto another:
- [Ars:d::43::564] and [Ars:d::43::564] Where the letters and numbers are chunks of information, and the colons are blank space. The main advantage of redundancy is backup. For reasons I don't entirely understand, the type of RAID we'll be using doesn't benefit in read speed when you mirror the drives. Other types, called "hardware RAID", do.
Striping disks is known as RAID 0, while a mirror is called RAID 1. You can "layer" these RAID methods to group and mirror disks at the same time. This creates combinations known as RAID 1+ 0 or RAID 0+1. These are not the same, but we'll get to that later. There are also other configurations, such as RAID 5 and JBOD that again are nothing more than variations of these two methods. I encourage you to check out the Wikipedia page, linked above in the resources section, on RAID.
Hardware vs. Fake vs. Software RAID
There are roughly three larger categories to accomplish creating a RAID array.
- Hardware: This is where a set of dedicated hardware sits between your operating system and your hard disks. It does all the work reading and writing to your disks, but your computer just thinks it's talking to one hard drive. You can buy special RAID cards to do this; good ones cost about $250.
----------------
| Your operating |
| system |
----------------
| |
| |
----------------
| Raid Card |
----------------
|| || || ||
// || || \\
----- ---- ---- ----
|HD1 | |HD2||HD3| |HD3|
---- ---- ---- ----
- Your second option, what most Windows users have, is called "Fake RAID". This is where you have a combination of dedicated hardware, but it can't function unless you have special drivers installed in your operating system. Since hardware developers tend to ignore linux users, Fake RAID doesn't really exist for us.
- The third option, the kind we would like to use, is Software RAID. Since about three years ago, the linux kernel has been able to recognize RAID devices created with certain software. In many ways this is still an undeveloped section of the kernel. It's implementation is kind of sloppy and is NOT beginner friendly. You need to be Command Line comfortable, or you will be by the time you're done. After playing around with this for the past week, I have decided that it's only advantage is price. While you can get some performance increase, if you're committed to using RAID, fork over the $200 for a card. Software RAID has some serious limitations, which we'll go over as we look at what we want versus what we can get.
What Kind of Setup do We Want? RAID10 !
What Kind Are We Going to Get? Not RAID10 !
This PC is supposed to be a high end media PC, capable of doing video editing and animation rendering in more reasonable amounts of time. A RAID10 can give us both improved reading performance, as well as some redundancy for backup. There are actually a few complications regarding the RAID setup we would like. After 20 hours or more troubleshooting, it turns out the RAID10 setup we want is just not functionally supported yet.
But what would RAID10 look like? Well, that depends if you're looking at striping first or mirroring first.
---------------------------------------
| |
| ------- ------- |
| | | | | |
| | HD1 | | HD3 | |
| | | | | |
| ------- ------- |
| | | <==mirror==> | | |
| ------- ------- |
| | | | | |
| | HD2 | | HD4 | |
| | | | | |
| ------- ------- |
| |
-------------------------------------
Here's stripe first, mirror second. So what do we learn from that? ASCII art is awesome. But basically what we have are four hard drives. We divide them into two pairs, and then stripe together those two pairs (making two large hard drives). Then we want one large hard drive to mirror the other. Conceptually it's very simple.
Here's mirror first, stripe second:
---------------------------------------
| |
| ------- ------- |
| | | | | |
| | HD1 | | HD3 | |
| | || | | | |
| ------- | | ------- |
| |mirror| |==========| |mirror| |
| -------|| | ------- |
| | || | | | |
| | HD2 | | HD4 | |
| | | | | |
| ------- ------- |
| |
-------------------------------------
There's only one stripe. At first you may think that you would need two, you remember that when devices are mirrored, they appear to be one drive.
Why Not?
The part that complicates everything are Bootloaders. A Bootloader (in the case of linux, either GRUB or LILO) is a tiny piece of software that can get loaded into memory before your operating system boots up. Fundamentally, the bootloader is able to tell your computer where to find the files it needs to start loading the whole operating system. They are very useful if you have multiple operating systems, or multiple linux kernels, or if you want to boot up with certain options (such as verbose, to see what's going on).
The problem is that, while the linux kernel has been updated to include RAID functionality, GRUB and LILO have not. They are unable to reference a raid drive that involves any kind of striping (RAID0). That includes RAID10.
"But wait," you say! "We can use partitions to solve this! Like putting /home on a different partition, we can make two RAIDs. One without striping (RAID1), for just the the boot information, and put everything else on a RAID10!" You, like me, are technically correct, but it does not really help us.
This gets a little complicated, but here is what it looks like graphically:
Divide each of the 4 drives into a /boot partition, and a "/" root partition.
--------
| boot |
--------
| root | X 4
| Par2 |
| |
--------
Then mirror the four /boot partitions, and create a RAID10 out of the other four.
--------
| boot | X 4
--------
| ------- ------- |
| | root | | root | |
| | HD1 | | HD3 | |
| | Par2 || | | Par2 | |
| ------- | | ------- |
| |mirror| |==========| |mirror| |
| -------|| | ------- |
| | root || | | root | |
| | HD2 | | HD4 | |
| | Par2 | | Par2 | |
| ------- -------
So what fails? Well, the root directory cannot be on the RAID10. We can make a separate partition on each hard drive for each subdirectory we want on the raid (ie: /var, /opt, /home, /tmp), assigning a separate size to each, and repeating it on each hard drive. This gets -really- hard to keep track of. Here is what each HDD would have to look like:
-----------
| / |
-----------
| /var |
-----------
| /opt |
-----------
| /home |
-----------
| /tmp |
-----------
Now not only do you need to repeat that 4 times, but you now need a LOT of RAID arrays. 1 RAID1 for booting. 1 X 4 RAID0 for /var, /opt, /home, /tmp + 1 X 4 RAID1 for /var, /opt, /home, /tmp. You can try keep it more cohesive using another technique called "Logical Volumes", but this creates an even more confusing setup and decreases some of that performance we gained from the RAID10.
Ultimately, this option was not pursued because I don't know what directories individual applications will use to read and write their rendering data in. One application could use /tmp, but another may use /opt, or some other subdirectory I haven't thought of. It would be a lot of work, just to have the programs not gain any read performance because they're stuck using boot RAID.
What we Can Do Instead
Well, might as well learn how to actually do this process. I've tried to set up bigblue (the media PC) so that three of the hard disks are completely unused. We're going to SSH over and do an interactive walkthrough of creating some RAID arrays on the fly. Let's try making a RAID with 2 disks.
Necessary Tools
- fdisk or other partition editor (this formats and partitions hard drives, sets flags)
- mdadm utility This mdadm utility handles the creation of the actual RAID array.
- mkfs (this creates file systems)
- Peanut Butter Crackers (see picture)
Create Partition Tables for our Drives
(you need root privileges to do most of this stuff, so sudo or root bash is assumed)
fdisk /dev/sdb
Where "/dev/sdb" is the drive we want to work with.
The commands we need to know are:
"p" = Show partition table
"d" = delete a partition
"n" = create new partition
"t" = set partition type
"fd" = hex for raid autodetect partition
"w" = write changes
"q" = exit without making changes
We're going to delete any partitions left over. Then, since we're just doing this to test out the raid, we're only going to make one partition, and set the type to raid autodetect. Repeat this on the second drive.
Put filesystem on Partitions
You don't have to do this, but I've found that it actually speeds up the process of creating the RAID device. I'm not exactly sure why, but mdadm somehow takes advantage of drives that already have file systems on them.
mke2fs /dev/sdd1
mke2fs /dev/sde1
Create RAID Device
This is where "mdadm" comes in. Here's the command we'll use:
mdadm --create /dev/md0 --chunk=64 --level=raid0 --raid-devices=2 /dev/sdb1 /dev/sdc1
- --create /dev/md0 tells mdadm that we are making an array, and the device should be called /dev/md0
- --chunk=64: This has to do with the mapping of the hard drive. 64 bytes is the default size these days.
- --level=raid0 It's going to be a stripe array
- --raid-devices=2 We're using two physical hard drives (actually, partitions), and they are /dev/sdb1 and /dev/sdc1
The 1 at the end of /dev/sdb1 and c1 indicate the number partition we're working with.
Create Filesystem
Now like any other drive, /dev/md0 needs to be formatted with a filesystem before we can mount it.
mkfs -t ext3 /dev/md0
Make an ext3 type filesystem on /dev/md0. Ext3 is a common/popular filesystem for linux machines.
Mount It
Create a directory to mount the raid device to:
mkdir /media/raid
Now mount the device
mount /dev/md0 /media/raid
Check it out. It should be twice the size of each regular disk. You can do some very basic reading throughput test with the following command:
hdparm -t /dev/md0
Some Benchmarks on Performance
I created a bunch of these raid devices and used a program called bonnie++ to benchmark their performance. bonnie++ is available in the Ubuntu repositories.
Single drive reads
/dev/sda:
Timing buffered disk reads: 214 MB in 3.02 seconds = 70.89 MB/sec
/dev/sdb:
Timing buffered disk reads: 210 MB in 3.02 seconds = 69.64 MB/sec
/dev/sdc:
Timing buffered disk reads: 208 MB in 3.00 seconds = 69.30 MB/sec
2 Disk striped array
/dev/md0:
Timing buffered disk reads: 334 MB in 3.01 seconds = 110.83 MB/sec
/dev/md0:
Timing buffered disk reads: 334 MB in 3.01 seconds = 111.14 MB/sec
3 Disk striped array
/dev/md0:
Timing buffered disk reads: 490 MB in 3.00 seconds = 163.28 MB/sec
/dev/md0:
Timing buffered disk reads: 490 MB in 3.00 seconds = 163.28 MB/sec
2 Disk Mirror Array
/dev/md0:
Timing buffered disk reads: 202 MB in 3.00 seconds = 67.33 MB/sec
3 Disk Mirror Array
/dev/md0:
Timing buffered disk reads: 204 MB in 3.02 seconds = 67.62 MB/sec
RAID1 using 2 Disks
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
bigblue 6296M 43881 81 45299 13 24527 7 59694 90 69558 12 210.2 0
RAID0 Using 2 Disks
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
bigblue 6296M 54517 97 90975 29 46067 13 54236 94 112059 16 215.7 0
RAID0 Using 3 Disks
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
bigblue 6296M 46401 87 49958 14 25165 7 54785 83 57663 9 264.2 0
RAID1 Using 3 disks
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
bigblue 6296M 44064 81 44874 14 26783 8 55273 89 68490 12 387.9 1