Replace and re-add a failed drive to a Linux software RAID

Posted by ads' corner on Saturday, 2019-03-30
Posted in [Hardware][Linux]

Most of my systems run on a software RAID 1 configuration (that is, two disks, where each disk is mirrored to the other). This way, one of the disks can fail and still all the data is available.

If a disk failure happens, the disk is replaced with a similar disk, and then needs to be configured and re-added to the RAID.

Newer systems all use the GUID partition table (GPT), and therefore allow almost unlimited disk sized. Instructions for re-adding a disk using GPT are a bit different from the days when MBR¬†(up to 2 TB disk space) was used, therefore I’m writing them down here for future use.

First, obviously, the failed disk must be replaced. If there is a disk error, the software RAID should already have failed the disk. This can be verified by looking into /proc/mdstat:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md2 : active raid1[1] sda5[0]
      1943342080 blocks [1/2] [U_]

unused devices: <none>

One out of two disks for /dev/md2 is available. In this example, /dev/sda is healthy, and /dev/sdb failed.

The serial number of the failed disk can be identified using the SMART control tools:

smartctl -i /dev/sdb

Device Model:     TOSHIBA DT01ACA300
Serial Number:    84QA172SE

The serial number is helpful in order not to accidentaly remove the wrong disk, and then lose all data (the other RAID 1 disk is the only working disk right now). Alternatively, if the failed disk is no longer responsing at all, this method can be used to identify the healthy disk - and then swap the other one.

Once the disk is replaced, the new disk needs to be partitioned. It’s important that the new disk has exactly the same partition structure as the old one. For GPT disks, the sgdisk utility can copy the partition table from the healthy to the new disk. On Debian and Ubuntu systems, the utility is in the parted package.

For the next step, be super careful, and verify twice that the disk names are in order.

sgdisk -R /dev/sdb /dev/sda

This will replicate (R) the partition table from /dev/sda to /dev/sdb. This command is seen in different formats in tutorials on the Internet, sometimes also shown as:

sgdisk /dev/sda -R /dev/sdb

If you get the order wrong, the partition table of your healthy disk /dev/sda will be overwritten with the empty table from /dev/sdb. This will render your data unaccessible!

Copying the GPT will also copy the GUID, both disks will have the same. It’s required to assign a new GUID to the new disk:

sgdisk -G /dev/sdb

Both disks should have the same partition table now, verify it:

sgdisk -p /dev/sda ; sgdisk -p /dev/sdb

The lists must be equal.

Now it’s time to re-add the partitions back to the RAID. In the example above I only listed /dev/md2, usually there are a few more partitions. Repeat the following step for every partition:

mdadm --manage /dev/md2 --add /dev/sdb5

A look into /proc/mdstat should show that the rebuild process for /dev/md2 has started, or even has already finished if it’s only a small partition. Once the rebuild is finished, the RAID status should look like this:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md2 : active raid1 sdb5[1] sda5[0]
      1943342080 blocks [2/2] [UU]

Categories: [Hardware] [Linux]