Comments on building a new RAID5 array

I’ve rescued the following e-mail from Neil Brown about building a new RAID5 array in Linux and why one the disks, while the array is being constructed, is marked as a spare:

When creating a new raid5 array, we need to make sure the parity
blocks are all correct (obviously). There are several ways to do
this.

  1. Write zeros to all drives. This would make the array unusable until the clearing is complete, so isn’t a good option.
  2. Read all the data blocks, compute the parity block, and then write out the parity block. This works, but is not optimal. Remembering that the parity block is on a different drive for each ‘stripe’, think about what the read/write heads are doing. The heads on the ‘reading’ drives will be somewhere ahead of the heads on the ‘writing’ drive. Every time we step to a new stripe and change which is the ‘writing’ head, the other reading heads have to wait for the head that has just changes from ‘writing’ to ‘reading’ to catch up (finish writing, then start reading). Waiting slows things down, so this is uniformly sub-optimal.
  3. Read all data blocks and parity blocks, check the parity block to see if it is correct, and only write out a new block if it wasn’t. This works quite well if most of the parity blocks are correct as all heads are reading in parallel and are pretty-much synchronised. This is how the raid5 ‘resync’ process in md works. It happens after an unclean shutdown if the array was active at crash-time. However if most or even many of the parity blocks are wrong, this process will be quite slow as the parity-block drive will have to read-a-bunch, step-back, write-a-bunch. So it isn’t good for initially setting the parity.
  4. Assume that the parity blocks are all correct, but that one drive is missing (i.e. the array is degraded). This is repaired by reconstructing what should have been on the missing drive, onto a spare. This involves reading all the ‘good’ drives in parallel, calculating them missing block (whether data or parity) and writing it to the ‘spare’ drive. The ‘spare’ will be written to a few (10s or 100s of) blocks behind the blocks being read off the ‘good’ drives, but each drive will run completely sequentially and so at top speed.

On a new array where most of the parity blocks are probably bad, ‘4’
is clearly the best option. ‘mdadm’ makes sure this happens by creating a raid5 array not with N good drives, but with N-1 good drives and one spare. Reconstruction then happens and you should see exactly what was reported: reads from all but the last drive, writes to that last drives.

Advertisements

3 thoughts on “Comments on building a new RAID5 array

  1. Pingback: Infinity Distressed Wall Clock (so)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s