Examining RAID data recovery
RAID (Redundant Array of Independent Disks) systems are commonly used in servers, high end computers, NAS (Network Attached Storage) and high capacity external backup devices. RAID data recovery is more complex than with a single hard drive, however like all types of recovery this depends on the nature of the fault and the damage it has caused.
RAID setups are used by most businesses and organisations to store critical information, usually in a server setup. RAID configurations provide data redundancy as a safety measure in the event of disk read errors and/or drive failure. If one of the drives in the RAID becomes unreadable or fails the integrity of the data will remain intact.
Whilst data backups should be an integral part of any IT implementation, a RAID setup is ideal to help maintain an uninterrupted service and help prevent data loss.
The basics of RAID
RAID systems consist of two or more hard drives or SSDs working in unison to present a logical volume of storage. A RAID controller (either as hardware or software) sits between the operating system and the physical drives and provides this virtualisation. This allows the multiple drives in the RAID to be seen as one unit of storage which can then be accessed by the operating system as one or more logical drives as required.
RAID systems are designed to be fault tolerant, which means they can continue to run and provide data integrity for read and write processes even when errors occur on one of the disks. If a single drive in a RAID system encounters errors or fails completely it can be ‘hot swapped’, where the drive is removed and replaced without the need to shut down the server or cause any loss of service.
Types of RAID
RAIDS can be set up in various configurations to favour speed or reliability requirements. These different configurations commonly referred to as RAID levels determine how data is stored across the multiple drives.
There are 3 key terms that are used to describe a RAID level:
- Mirrored - The data is saved on two hard drives providing an exact duplicate.
- Striped - The data is saved across multiple hard drives.
- Parity - A Parity bit is a value based on the data it represents. It is used to detect errors and allow the correct data to be restored from the other drives.
There are many different RAID levels, the most common being RAID 0 to 6 and RAID 10. The table below provides a rough comparison of the basic RAID levels.
RAID Level |
Mirroring |
Striping |
Parity |
Read performance |
Write performance |
Protection |
Details |
0 |
No |
Yes |
No |
High |
High |
None |
Fast. Good for home and gaming. No redundancy. |
1 |
Yes |
No |
No |
Medium |
Medium |
1 drive failure |
Slow. NAS boxes, OS sytems on server. Good redundancy. |
2 |
No |
Yes |
Yes |
Medium |
Medium |
1 drive failure |
Not commonly used |
3 |
No |
Yes |
Yes |
Medium |
Low |
1 drive failure |
Not commonly used |
4 |
No |
Yes |
Yes |
High |
Low |
1 drive failure |
Not commonly used. |
5 |
No |
Yes |
Yes |
High |
Low |
1 drive failure |
NAS. Data servers. |
6 |
No |
Yes |
Yes |
High |
Low |
2 drive failure |
Good redundancy. Min 4 drives. Business servers. |
10 |
Yes |
Yes |
No |
High |
Medium |
1 drive failure (per array) |
Min 4 drives. |
50 |
No |
Yes |
Yes |
High |
Medium |
1 drive failure (per array) |
Commonly used. Database and App servers |
Common faults on RAID systems
As they are arrays of multiple hard disk drives, RAID systems suffer many of the same faults that a single hard drive would experience such as a head crash, board failure, deleted files, viruses etc. The nature of a RAID system means that a drive in the array with physical damaged can easily be replaced without any loss of data.
Despite this protective functionality RAID systems can still fail for other reasons and so RAID data recovery should still be included as part of a disaster recovery plan.
The types of issues that a RAID system can encounter are as follows:
- RAID controller card failure
- Electronic failure on the server rack
- Deleted files
- Software corruption
- Issues when the RAID rebuilds after a drive replacement
- Failure to boot
- Corrupted RAID configuration
- Virus or malware
- Power surge
- Multiple drive failures
- Hard drive head crash
- Error during drive formatting
- RAID drives not synchronising
As the various RAID configurations create a complex relationship between the drives, faults that affect the hardware containing the drives or the software controlling them can easily lead to read write errors and cause data access problems.
RAID Data recovery
The RAID recovery process can be quite involved so a good understanding of RAID setups and a methodical approach is required.
Evaluation
The first step is to evaluate all member disks from the RAID setup and to isolate any faulty drives. This will help determine if the fault is with the hard drives, with the hardware housing for the drives or with the RAID controller. At this stage we treat each drive as a potential recovery and assess them as such.
A rebuild recovery
Depending on the RAID Level, recovery can be performed on the “known good” disks and then the recovered volume mounted virtually using data recovery tools. For example, a 3 drive RAID5 with 1 faulty member can have all its data retrieved using the 2 healthy members. This is a best case scenario, however depending on the fault more than one hard drive may be affected.
Multiple hard drive failures
If too many drives in a RAID array have failed then it is necessary to diagnose the faults to determine if they are repairable or suitable for the recovery process.
Once the hard drive has been repaired to the point where data is obtainable, it is generally a good idea to “clone” this drive, making a copy of it onto a known good drive. This ensures a reliable copy of the data to work with. Continuing to work against the original device could see it degrade during the process potentially reducing chances of recovery.
Rebuilding the RAID
Once data has been recovered from the failed drive(s), the RAID then needs to be rebuilt.
Rebuilding the RAID can sometimes be the most time consuming part of the RAID data recovery process depending on the type of RAID used and the number of drives involved.
When the RAID is rebuilt we can then check the integrity of the data to make sure the recovery is a success.
Summary
Whilst RAID data recovery is more complex than single drive recovery, many of the underlying methods and techniques are the same. The fault itself is treated as a single drive and once that has been addressed the RAID is rebuilt to recover the data.
As with any data recovery (under most circumstances), we would advise against reusing the faulty drive. The faulty drive is only ever fixed to the point of enabling the data recovery, going forward that device should not be relied upon.