Check Your RAID Consistency Before A Rebuild
May 8, 2008
Over the years one of the most consistent problems with RAID recovery is the rebuild. I would estimate that nearly 40 percent of the RAIDs that we cannot recover are due exclusively to the fact that a technician executed a rebuild before verifying the following three items.
1. Hardware:
The RAID went down for some reason. Many times it is because the hardware housing the array may have some issues. There may be cabling problems, heat problems, back plane problems, or a hundred and one other hardware issues that can cause the RAID to degrade.
2. Hard drives
A simple surface scan of all drives in the array can give you an indication of the state of the drives. A report outlining any anomalies found for each drive is always critical when diagnosing the array.
3. RAID Consistency
A RAID five bases its integrity on a simple XOR algorithm that is stored on a block by block basis within the array stripe. The firmware of a RAID five uses this algorithm to ensure that the data stored on the RAID is consistent. It also ensures that if a single drive goes down and the array becomes degraded, the technician has ample time to do a quick backup of critical data, get all users off in a timely manner, and cleanly shut down any database handlers that may residing and open on the array. In other words, don’t have a dirty shutdown of your exchange store.
A degraded RAID 5 should NEVER BE PRODUCTION RUN!!! However, this is not normally the case and is why RAID recovery is a multi-million dollar business. A degraded RAID five that is run in production for longer than twenty four hours now contains data on the offending drive that is considered stale. If a second drive goes down then the entire array goes down as RAID five cannot run with two drives out.
When I get a call from a technician that their RAID is down because the array lost two drives, I immediately assume that one of the drives is stale and quickly advise the technician not to do a rebuild. I can count on one hand in the entire time I have been recovering RAIDs that a client has lost two drives simultaneously.
Although items 1 and 2 are not my bread and butter, I am familiar with techniques used to do their respective checks. Item 3, however, I am very familiar with and can help you ascertain if in fact there is a stale drive within your array. The following are set of steps, as well as a free piece of software that you can use before any rebuild is initiated.
Step 1: Pull all drives that are in the array out. Get the drives that are configured as part of the array away from the hardware. This does not include any hot swap drives, only those drives configured in the array and working at time of degrade.
Step 2: Make images of all the drives in the array. This serves several purposes. First, during an imaging session you may find bad sectors on the drives. Secondly, you never want to work on the live data as the drives may be on their last legs and any recovery, rebuild, or diagnostic run on live data may kill the drive. Lastly, if something happens to the drives then you will have the images as a way to recreate the original data set.
Step 3: Download the RAID Diagnostic Toolkit from our website and install it on a Windows NT type machine. The software is very easy to use and very self explanatory. There are options in the software that are not currently active, this is because I will be introducing them in later posts. So I just pop up a little window to let you know that this is a future software enhancement or the function is grayed out.
Currently the software defaults to a 64K, or 128 sector stripe size. Although, the stripe size, for this particular function has no bearing on the test, it is nevertheless used on 95 percent of the RAID fives that I work on and can give us a more real world type map.
The software will run the consistency check on your set of images and give you a report on whether the stripe is corrupt. It will not tell you which drive is the stale drive if the stripe is corrupt, only that a rebuild, using this set of drives would not be advisable.
If you're new here, you may want to subscribe to our RSS feed.
Subscribe to DTI Data Recovery Resource Center by Email Thanks for visiting The DTI Data Recovery Resource Center!
RAID Configuration and Parity Check
May 8, 2008
The function set for the inaugural offering of RAID Diagnostic Toolkit is very basic. This post will explain how to choose a set of ’streams’ to build a ‘RAID set’. Initially the software does not have any options for stripe size, raid type, meta data offsets, so on and so forth. For the ‘parity check’ function which this current version of this software offers, the assumptions will be a RAID 5, with a 64K stripe size, with no meta data. In future releases of the software these, and many other options will be added in order to make a more robust diagnostic tool.
First we must populate the RAID with streams. There are basically two types of streams that we will use, the first is a physical data stream or ‘hard drive’. The second is an image data stream or ‘file’. Figure A depicts populating the ’stream list’ with physical streams. As you can see the ‘Populate Stream List’ menu item is highlighted. Clicking on this will poll all hard drives on the local machine and display them as shown in Figure B.

Figure A

Figure B
The best way to test an array is to make images of the hard drives and then use the images for testing. From the ‘Configuration’ menu option click on “Add File Stream To List”. A standard Windows file selection dialog box will appear. Go to the proper folder and choose the image that you would like to add to your stream list. Click on the file, and then open and the file will be added to your stream list. You are now free to add this item into your RAID Configuration list.
In order to add an item from the stream list into the RAID Configuration simply double-click on the stream list item and it will be added into the RAID Configuration list of items as depicted in Figure C.

Figure C
Next, in order to start the parity test click on the menu item “Diagnostics”. Doing so will reveal the menu item “Raid Five Parity Check”. Click on that menu item and the diagnostic will begin. This function will check the RAID five on a stripe by stripe basis and validate the parity using XOR mathematics.
In the lower left hand corner of the software is a small status/information window that offers real time data of the parity scan. this window contains five items which describe the state of the diagnostic.
Type: The configured RAID/River type
Ident: Identifier give to the RAID/River type
Block: The block, currenty being scanned by the software
Time: Time remaining till the scan has completed.
Errors: The total blocks that a parity error has been found.
Two of the five items are most pertinent for this particular function. They are the “Errors” item and the “Block” item. If the “Error” item is ten to fifteen percent of the array then the array stripe is probably corrupt and you may have a stale drive in the array. For all practical purposes however, there should be less that or a total of three or four total errors for the entire array. A healthy array will have no errors and if even only one appears that could mean either the hardware is starting to fail, or worse, the firmware and or its accompanying memory me be buggy. Either scenario could spell disaster for your array and should be looked at immediately. View Figure D as an example.
Figure D
Finally, if you wish to interrupt the diagnostic just click on the “Configuration” menu item, and then the “Interrupt Processing” item and all processing will stop.
That’s it! Of course you must always bear in mind that even if the RAID does not pass the parity test there may still be data to recover. Alternatively if it does pass, this does not necessarily mean that the RAID is good for a rebuild. There will be other functions added to the software that will help you better determine if a rebuild is advisable.
Dick Correa
RAID Five Steps to recovering your data
February 22, 2008
In one of my articles I tried to define the mathematics of a RAID 5 stripe and how it relates to data recovery. Using the eXclusive ORing truth table we can continue to run the array even when one drive has dropped out of the array. This RAID state is known as degraded and must considered by the IT professional as a temporary state. Once in a degraded state the prudent technican should try to do the following:
1. Take every user off of the server. Although the RAID is designed to run in a degraded state, it is not a run time solution. Ignore management, ignore the user, and log everyone off.
2. Make a complete and full backup.
3. Check your complete and full backup. Many a time I have heard a tech tell me that he did a full and complete backup only to find out the some obscure accounting piece of software had some hidden flat file buried 27 folders deep that had the entire companies payroll for the last 36 years and was not in his “complete backup”.
4. Pull every drive from the array and make a complete sector by sector image of each drive. Take those images and guard them with your life. If when you are trying to bring the array back online, and something goes amiss, you will have a clean starting point. This method is called the ‘hindsight is definitely 20 20′ school of thought and has saved my derriere on many occasion.
5. Check every cable, every slot, every dust laden chip to make sure that something hasn’t ‘broken’ loose.
6. Put the working drives back in the enclosure and replace the bad drive. Bring the array back online. Go into the RAID BIOS and make sure that any rebuild is pointing to the right drive. Although there may be meta data that tells the RAID card who is what, where, and how. Double check anyway.
7. Rebuild the array. If you get a stall, a hang, or a reboot then stop everything. Execute step 5 again, and try the rebuild just one more time. If it fails again, then do a surface check of all the drives in the array, including the new drive. The fact that a drive is new does not necessarily mean that it will work out of the box. Many a time I have pulled a new drive out only to have it fail the ’smoke test’. A surface check will hopefully expose any flaws on the media during the read tests.
If you have reached this point and still do not have a defined solution then you must weigh time constraints, user complaints, and management breathing down your neck as to whether to spring for a new server and reload, or to continue beating your head against the wall of an older server, using older software, running on an older operating system. Data is almost always exportable in a simple comma delimited format and can then be imported into almost any application. Maybe now is the time to upgrade and you can use this incident as leverage to pry money from management for a new server.
No matter what you decide, if you have followed the above steps, your data will be relatively safe. It is the seasoned IT professional that can think out of the box and bring his company back online with a minimum of aggravation.
SNAP RAID Recovery Part II Drive Set Definition
February 19, 2008
One of the many attributes of a RAID 5 that make it popular is that if a drive goes down in the array the RAID will remain functional. In such a case the following events should occur. An alarm should sound. An alarm that would wake the dead. An alarm that would make raking your fingernails across a chalkboard sound pleasurable by comparison. An alarm that by all known standards would be considered inhumane in most modern cultures. This alarm will sound incessantly, unwavering in its pursuit to be heard until the technician hits it with a keyboard, kicks the server plug out, or kills a chicken and offers a sacrifice to the alarm gods. In other words, you can’t miss this alarm, and if you do, see the reference to ‘wake the dead’.
Secondly, an email will be sent, advising you that the four years worth of data that you thought was being backed up, but you discovered two days ago wasn’t, is now in peril of being lost into the never never land of lost bits and socks that inexplicably disappear from the dryer. Yes, your job, your home, your marriage, all will be lost unless you heed the email warning and immediately shut down the server, to the chagrin of 427 end users who are reading about American Idol. HAH! Welcome to the party pal! (Quote circa 1976: Bruce Willis: Die Hard)
Keep in mind that although there are many RAID cards, as well as on-board RAID interfaces that perform these functions, your particular RAID firmware, as well as its current configuration may not.
The reason that a RAID 5 can have a drive go down and still run is the mathematics of XORing (eXclusive ORing) the data. This method for keeping the data relatively safe in a RAID 5 is called parity. It is a manipulation of bits in each byte of data. For the unwashed a byte is eight bits.
In order to under stand XORing it is imperative that you understand the XOR truth table. Figure 1 is the truth table for eXclusive ORing.
Figure 1
Figure 2 is an example of XORing and how it relates to a four drive RAID 5 and the parity.
Figure 2
The data is arranged thusly:
‘R’ is the ASCII letter
52h is the ASCII hexadecimal value of ‘R’
0101 0010 is the ASCII binary representation of the letter.
As you can see each letter has is set up the same way.
For illustration purposes the following can be assumed. Each line is considered a single byte of a stripe as conveyed by Figure 3. If we take the ‘R’, and the ‘F’, and the ‘T’, and XOR them together, we get the value in D4, where each bit, of each byte is individually XORed across the stripe.
Figure 3
Using Figure 4 as a base, and the truth tables we can see the following:
| 1 | 2 | 3 | 4 | ||||||
|---|---|---|---|---|---|---|---|---|---|
|
L1: |
0 |
XOR |
0 | = | 0 |
XOR |
0 | = | 0 |
|
L2: |
1 |
XOR |
1 | = | 0 |
XOR |
1 | = | 1 |
|
L3: |
0 |
XOR |
0 | = | 0 |
XOR |
0 | = | 0 |
|
L4: |
1 |
XOR |
0 | = | 1 |
XOR |
1 | = | 0 |
|
L5: |
0 |
XOR |
0 | = | 0 |
XOR |
0 | = | 0 |
|
L6: |
0 |
XOR |
1 | = | 1 |
XOR |
1 | = | 0 |
|
L7: |
1 |
XOR |
1 | = | 0 |
XOR |
0 | = | 0 |
|
L8: |
0 |
XOR |
0 | = | 0 |
XOR |
0 | = | 0 |
Figure 4
Now, lets say we lose D2 (drive two) in the array. The following is how the RAID card firmware handles it.
| 1 | 3 | 4 | 2 | ||||||
|---|---|---|---|---|---|---|---|---|---|
|
L1: |
0 |
XOR |
0 | = | 0 |
XOR |
0 | = | 0 |
|
L2: |
1 |
XOR |
1 | = | 0 |
XOR |
1 | = | 1 |
|
L3: |
0 |
XOR |
0 | = | 0 |
XOR |
0 | = | 0 |
|
L4: |
1 |
XOR |
1 | = | 0 |
XOR |
0 | = | 0 |
|
L5: |
0 |
XOR |
0 | = | 0 |
XOR |
0 | = | 0 |
|
L6: |
0 |
XOR |
1 | = | 1 |
XOR |
0 | = | 1 |
|
L7: |
1 |
XOR |
0 | = | 1 |
XOR |
0 | = | 1 |
|
L8: |
0 |
XOR |
0 | = | 0 |
XOR |
0 | = | 1 |
Figure 5
We have built drive 2 on the fly. We do not need to know the data since we can use the XOR truth table and the remaining three drives data to calculate the value of drive two. In the above example the process was illustrated for one byte across one stripe on a four drive array. All of these calculations are done in an instant on a stripe by stripe basis. The full stripe is recalculated for every write, and if a drive is out of the array for every read of the down drive. With all these calculations you would think it would slow down the processing. To a degree, it does, however, bus I/O is infinitely slower than any XOR math a CPU may have to perform. A way to emphasize this point is imagine you are standing on a bridge. Below you is a river. Each byte of data is a boat that passes under the bridge. The boat travels from the hard drive, down the river, to memory, and to the CPU. As one boat passes, you wait for the next boat. The next boat will not pass for one hundred years. The CPU is in a perpetual wait state. It is always waiting for data to process. So, if you want to speed up your PC, by high speed I/O smart boards that can RAID, on a high speed bus.
To be continued…
Learn more about RAID Data Recovery
SNAP Server Data Recovery 3 Spanned RAID 5 Arrays
February 8, 2008
Recently, it was my task to take sixteen drives, spanned across three RAID fives, and recover a set of hundreds of AVI files. These files were used for research and although not time sensitive, were critical to the conclusions of the research.
We have been asked to do many similar jobs where the archive of a set of data has been compromised. Many lawyers have databases of all of their scanned briefs as well as all documentation pertaining to a particular case. If that information is lost and the case reopened for appeal it could be devastating to not be able to review the documentation in a timely manner. I mention this because it took me over a month to complete this task, and although interesting, was very tedious.
What made this recovery interesting was that the drives were in two physical devices. The first device was a four drive SNAP array that was used as the head. The other device was a twelve drive SNAP server that was broken up into two RAID fives. The challenge for this recovery was that no one knew which drives were in which array, no one knew the drive order of any array, the configuration given to me by the SNAP server was in error, no one knew the stripe size of the array, and finally, the data recovery company who had the array before me, marked the drives out of order. In other words, I was handed 16 drives and told to figure out a triple spanned RAID five.
So here are the steps I took to solve this data recovery problem for my client.
Step one, I had to find out which drives went with each other. I would have hoped that each RAID was equal in size. In other words, I hoped the RAIDs would all be four drives for the head in one array, and eight drives each for the other two RAIDS, but this was not the case. In order to find which drives went with which array I had to know several things.
First, I had to know the SNAP layout for arrays. Each drive in a SNAP array is basically broken up into two parts, the operating system, and the data area. In order to find the size of each you must look at the master boot record (MBR) of each of the drives. The MBR houses the partition table which is a listing of the active partitions.
SNAP partitions are divided into three basic areas, an operating system partition, a swap partition and a data partition. SNAP Appliance designed their device so that if one of the drives went down the firmware would roll to the next drive to load the operating system, network interface, and RAID handler. The important piece of information is what the standard offset to the data area is. The data area of each drive is used for the RAID 5. I have found the data area sector offset for the Guardian OS series to be LBA sector 2216970. This information may change from version to version, but all the Guardian operating systems I have worked with have been the same.
Now that we know the data area offset we can take the next step, which is to determine which drive sets comprise the three RAID sets.
To be continued……..
RAID Data Recovery Overview
December 7, 2007
We have been getting a lot of calls about RAID data recovery lately. As more and more computer manufacturers utilize RAID systems in home computers, RAID failures rise exponentially. It used to be that RAID data recovery calls that we received were large companies, that were running massive multi drive arrays. It stands to reason that a large company can afford the costs associated with RAID data recovery.
RAID Data Recovery Costs
To most consumers, the costs for RAID data recovery will seem rather high, but since the majority of RAID systems that we are seeing are RAID 0 which are 2 drives that are “striped” or combined to create 1 volume. This increases performance, but is also dangerous. The fact is if a RAID 0 fails, both drives must be repaired to recover the data. That is why RAID data recovery is expensive to a consumer.
RAID Data Recovery Variables
September 12, 2007
The first rule of RAID data recovery is “due no harm”. In fact that is DTI’s rule on all types of hard drive recovery, from laptop disk repair to multi drive arrays. Work should never be performed directly on media that stored the data. Any type of actions on the original hard drives can cause more damage. Before any type of hard drive recovery takes place on a RAID array, DTI takes steps to insure that every precaution is put into place. We will not under any circumstances make the situation worse. When it comes to RAID data recovery we take the importance of your data very seriously.
RAID Data Recovery Variables And Hard Drive System Integrity
The first order of business when a RAID array arrives in our labs for data recovery, is to pull every possible bit of data from the hard drive to a clone disk or an image file. In order to accomplish the migration of every binary sector from all of the hard drive elements in the array, clean room hard drive recovery procedures may be necessary. DTI operates a class 100 clean room that is bio-metrically secured. We take the security of your data very seriously.
After any defective media has been recovered and imaged the data recovery of the RAID array can begin.
Once in the RAID data recovery lab, our engineers will determine the “on disk” structure of the stripe and parity if the array is in a RAID 5 configuration. In many cases a partial rebuild has taken place or hot spares have gone offline on multichannel storage bays adding more complex calculations which may require custom code to be written for the extraction of the data to be successful. DTI has on-site programmers that can create programs on demand for any type of new situation that arises. Over the last decade we have developed custom software and proprietary techniques to recover data and validate “on disk file structure” components to verify data integrity and expedite our RAID data recovery process.
When drives within the array are physically damaged DTI has the capabilities to perform hard drive recovery and prepare the data for the engineers to recover the files. Call Toll Free: 1-866-438-6932 if you have any questions.
RAID Hard Drive Data Recovery
September 10, 2007
A RAID system comprises of 2 or more hard disks that are combined to provide the storage capacity of both drives across 1 volume. The exception to this is RAID 1 which is a mirror. In other words the second drive is a duplicate of the first. If one of the drives in a RAID 1 fails then the other will retain the data. In most cases of RAID data recovery the problem is logical as opposed to physical.
RAID Data Recovery
The process of RAID data recovery involves several steps that depend on the type of RAID, and the type of failure. In this series of articles we are going to break down the different types of RAID data recovery scenarios and offer insight into why RAID’s fail as well as recommendations of which type of RAID is best for your situation.
As stated before the most common causes of RAID data recovery involve logical problems. These often happen when RAID hard drives go off line temporarily. Most times the hard drive lights are green, but when a drive goes off line it turns amber. This happens frequently in SATA RAID systems. SCSI back-planes also have a high occurrence of of amber drives. The problem is if a drive goes off line the RAID is operating in a degraded state.
When a RAID is running in a degraded state any further types of failure are fatal. The worse case scenario is that an engineer will see an amber drive and force a re-build. On a RAID 5 this can damage the parity and elevate the need for RAID data recovery.
Related Posts:
RAID Data Recovery Services
June 4, 2007
DTI Data Recovery is one of the few companies that actually performs RAID recovery in their own labs. RAID data recovery is the most advanced type of data restoration there is. Not only do the hard drives need to be repaired, but the underlying file system or parity needs to be re-written or restored.
RAID Hard Drive Data Recovery
A RAID system is very complex at every level. There are many things that can and often DO go wrong. The most common problem involves hardware and specifically the hard drives that make up the RAID. While there are many types of RAID systems, the most common is a RAID 5 which requires at least 3 hard disk drives to function. It will work as long as 2 of the drives are functioning. If 2 hard disk drives fail then there is no alternative but restore from backup or look for a company like DTI that does RAID Data Recovery.
If you have read any of the previous articles in this blog you know that hard drives fail and that they fail often. A system that requires at least 3 hard drives is that much more likely to have a hard disk problem. Since most companies that employ RAID systems use the RAID itself as backup as well as storage. So how do you back up your backup? The fact is even a totally redundant RAID system like a RAID 10 which is a mirrored RAID 5 can fail.
DTI generally only needs the hard drives to perform RAID Data Recovery. If you are shopping around for RAID data recovery and a company is asking for the whole system, don’t send your data to them! At most the RAID card would be the only thing a real data recovery company should need and that only when there is a question about parity or block size or type. Some information can only be gotten from the card. A company like DTI Data Recovery has basically seen all the RAID cards that there are.
If there are physical problems with the RAID then hard drive repair will have to be done prior to the actual data recovery of the RAID system. Once the hard drives have been repaired enough to be copied sector by sector and the sector data has been transferred to similar or the same media then RAID Data Recovery can begin.
RAID Data Recovery With Software
There are a few data recovery programs that can work with damaged RAID systems. In some cases the RAID volume can be created virtually. Images are made of the hard drives and they are put back together again in a virtual environment. One of the problems with this type of recovery is often time the images don’t align properly and the headers need to be modified with a hex editor. This is serious business and is very difficult unless you have advanced programming knowledge.
DTI Data Recovery has the type of programmers and engineers that are required to put back together a failed RAID system. Once the hard drive have been safely cloned then the work truly begins. Only with the cutting edge technology that is proprietary at DTI Data can RAID arrays be successfully restored. Don’t risk your data, call DTI if you require RAID Data Recovery.
If you are unsure of what to do please call 866-438-6932 for a no cost no obligation RAID Data Recovery evaluation.
24 Hour Hard Drive Recovery & Server/RAID Support Hotline:
Toll Free 1-866-438-6932 or direct 1-727-345-9665.
Extended Software Support:
8 AM to 11 PM EST 7days a week!
SNAP Data Recovery Through The OS Inode
May 21, 2007
This week is the final offering of our topic Recovering a single file from a SNAP Server Operating System. We have learned what a Super Block is, a cylinder group, and some of the important data elements in those data structures. We have learned how to find these data structures by using the data elements of other structures. Finally, we have learned that the file system is broken into blocks and that these blocks are the storage cornerstone of SNAP OS. Putting all of these facts together we come down to the final data structure the inode. At the bottom of this article there are links to my other posts so you can read or print them all in order.
Recovering a single file from a SNAP OS Part 3
The inode is the final link in the chain of data storage. It holds the map of the blocks where all of the data of each file and directory is stored. Let us dissect the inode and find its most important elements.
The SNAP OS Inode
Figure 1 is a raw hex representation of an inode. There are several data elements within the inode that define the date the file or directory was created, the last time it was updated, and the size of the file. For our purposes however, we are only concerned with one area of data elements and those are the direct and indirect block definitions.

Fig 1
The direct block definitions are defined in the shaded green area, and there are a maximum of twelve direct data blocks. The term direct means that each one of the four byte numbers in the shaded green area point to an actual data storage block. In other words, if we take the first value of 0×14A80E (1353742 decimal) and go view that data block, we will find the first values for our file 2003STEP.PDF. In figure 2 we can see the first few bytes of data from block 0×14A80E.

Fig 2
There are only twelve direct data blocks so if your file exceeds 96 k, then the file system will use a method defined as indirect blocks. There are three data elements of these blocks, they are:
- Indirect Block: Points to a block that has a list of data blocks.
- Double Indirect Block: List of blocks that point to an Indirect block.
- Triple Indirect Block: List of blocks the point to Double Indirect blocks.
From the above explanation you can see how deciphering a very large file can be extremely complicated. Once understood, this method works well and is very fast. Along with those facts, it is also very easy to program using recursion and a set of flags to let the recursive function know what is being processed.
Figure three is a listing of the 2003STEP.PDF direct blocks from its only indirect block.

Fig 3
Well, that’s it! By using the formulas and techniques I have outlined in my last three articles you can easily retrieve any file. I hope this helps those of you that have lost data due to hardware and or software failures on your SNAP Server.
If you have any questions, or if I can be of any help, please feel free to call me, or drop me an email.
(727)345-9665 Ext 203
dickc AT dtidata.com
SNAP Server Data Recovery of a Single File
Here are all the articles about SNAP Server Recovery of a single file:
- SNAP Data Recovery - the first post about the SNAP OS.
- SNAP Server Data Recovery of a Single File - A detailed post about recovering a lost file on the SNAP OS.
- SNAP Server Data Recovery Using The Super Block - the next article about SNAP file recovery.
Our main page for SNAP Server Data Recovery.







