A wide variety of failures cause physical damage to storage media. Compact discs (CDs) can have their metallic substrate or dye layer scratched. Hard disks can suffer several types of mechanical failures, such as head crashes and failed motors. Tapes can simply break. Physical damage always causes some data loss, and in many cases, the file system’s logical structures sustain damage as well. This results in logical damage that must be dealt with before any files can be salvaged from the failed media.
End users can’t repair most physical damage. Generally, they don’t have the hardware or technical expertise required to make physical repairs. Further, end users’ attempts to repair physical damage often increase the damage. Normally you shouldn’t attempt to repair physical media. You may try a number of techniques to recover data from damaged media. However, only organizations with specialized equipment and facilities, such as clean rooms, should attempt repair or enhanced data recovery.
Recovering data from a hard drive should start with the assumption that, unless the case is visibly damaged, the drive itself is still operable. Today’s hard disks are built to be rugged enough to protect against damage. Thus, when presented with a “failed hard drive,” use the following techniques to evaluate the drive and retrieve needed data:
Remove the drive from the system on which it is installed and connect it to a test system— a compatible system that is functional. Make the connection without installing the drive but only connecting the data and power cables.
Boot the test system from its own internal drive. Listen to the failed drive to determine whether the internal disks are spinning. If the disks are spinning, it generally means the disk has not experienced a catastrophic failure. Therefore, you can likely recover the data.
Determine whether the failed drive is recognized and can be installed as an additional disk on the test system. If the drive installs, copy all directories and files to a hard drive on the test system. If a drive fails on one system but installs on another, the drive may be usable. The drive may have failed because of a power supply failure, corruption of the operating system, malicious software, or some other reason. If you can operate the drive, run a virus check on the recovered data and test for directory and file integrity.
If the hard drive is not spinning or the test system does not recognize it, perform limited repair. You may be able to get the hard drive to start and it may be recognized by the test system. If you can repair the drive, use specialized software to image all data bits from the failed drive to a recovery drive. Use the extracted raw image to reconstruct usable data. Try open source tools such as DCFLdd (this is an enhanced version of the dd utility) to recover all data except for data in physically damaged sectors.
If necessary, send the device to data recovery specialists, who may be able to apply extraordinary recovery techniques.
It is possible that the data is deemed “lost,” and there will be no increased loss if you attempt local repair and fail. If so, you can try the following:
Remove the printed circuit board and replace it with a matching circuit board from a known healthy drive.
Change the read/write head assembly with matching parts from a known healthy drive.
Remove the hard disk platters from the original drive and install them into a known healthy drive.
Logical damage to a file system is more common than physical damage. Logical damage may prevent the host operating system from mounting or using the file system. Power outages can cause logical damage, preventing file system structures from completely writing information from memory to the storage medium. Even turning off a machine while it is booting or shutting down can lead to logical damage. Errors in hardware controllers—especially redundant array of inexpensive disks (RAID) controllers—and drivers and system crashes can have the same effect.
Logical damage can cause a variety of problems, such as system crashes or actual data loss. It can result in intermittent failures. It can also trigger other strange behavior, such as infinitely recursing directories and drives reporting negative free space remaining. Some programs can correct the inconsistencies that result from logical damage. Most operating systems provide a basic repair tool for their native file systems. Microsoft Windows has chkdsk, for example; Linux comes with the fsck utility; and Mac OS X provides Disk Utility. A number of companies have developed products to resolve logical file system errors, such as the Sleuth Kit (http://www.sleuthkit.org). Third-party products may be able to recover data even when the operating system’s repair utility doesn’t recognize the disk. TestDisk (http://www.cgsecurity.org/wiki/TestDisk) is one example. It can recover lost partitions and reconstruct corrupted partition tables.
Journaling file systems, such as NTFS 5.0 and ext3, help to reduce the incidence of logical damage. In the event of system failure, you can roll these file systems back to a consistent or stable state. The information most likely to be lost will be in the drive’s cache at the time of the system failure.
Using a consistency checker should be a routine part of system maintenance. A consistency checker protects against file system software bugs and storage hardware design incompatibilities. For example, a disk controller may report that file system structures have been saved to disk, but the data is actually still in the write cache. If the computer loses power while this data is in the cache, the file system may be left in an inconsistent or unstable state. To avoid this problem, use hardware that does not report the data as written until it actually is written. Another solution is to use disk controllers with battery backups. When the power is restored after an outage, the pending data is written to disk. For greater protection, use a system battery backup to provide power long enough to shut down the system safely.
Two techniques are common for recovering data after logical damage: consistency checking and zero-knowledge analysis. Use these techniques to either repair or work around most logical damage. However, applying data recovery software doesn’t guarantee that no data loss will occur. For example, when two files claim to share the same allocation unit, one of the files is almost certain to lose data.
Consistency checking involves scanning a disk’s logical structure and ensuring that it is consistent with its specification. For instance, in most file systems, a directory must have at least two entries: a dot (.) entry that points to itself and a dot-dot (..) entry that points to its parent. A file system repair program reads each directory to ensure that these entries exist and point to the correct directories. If they do not, the program displays an error message, and you can correct the problem. Both chkdsk and fsck work in this fashion. However, consistency checking has two major problems:
A consistency check can fail if the file system is highly damaged. In this case, the repair program may crash, or it may believe the drive has an invalid file system.
The chkdsk utility might automatically delete data files if the files are out of place or unexplainable. The utility does this to ensure that the operating system can run properly. However, the deleted files may be important and irreplaceable user files.
The same type of problem occurs with system restore disks that restore the operating system by removing the previous installation. Avoid this problem by installing the operating system on a separate partition from the user data.
Zero-knowledge analysis is the second technique for file system repair. With zero-knowledge analysis, few assumptions are made about the state of the file system. The file system is rebuilt from scratch using knowledge of an undamaged file system structure. In this process, scan the drive of the affected computer, noting all file system structures and possible file boundaries. Then match the results to the specifications of a working file system.
Zero-knowledge analysis is usually much slower than consistency checking. You can use it, however, to recover data even when the logical structures are almost completely destroyed. This technique generally does not repair the damaged file system, but it allows you to extract the data to another storage device.