Thursday, February 23, 2006

Dating tips for single points of failure

A typical office PC that mounts a drive over the network is not very reliable. Temporary failure of either the network or a file server will make a computer unusable. Failure of a local disk will at best require reinstallation of the OS and all the applications, and at worst can lead to irrecoverable loss of data. The overall system is less fault-tolerant than a hard drive or a network separately. Is it possible to improve the robustness of today's primary tools?

Increasing the reliability of individual parts can lead to slow progress, but the breakthroughs are achieved by putting the parts together in smarter ways. For example, RAID storage introduces a component that redundantly writes information to multiple disks. The system is prepared for the failure of each specific disk and can continue functioning without downtime.

A breakdown shouldn't be a catastrophe. The set of data that a computer user accesses is predictable; it is easy to copy it to a local disk and keep it up-to-date. If the network goes down, some information can be read and edited locally. When a network connection is reestablished, the data can be automatically synchronized. The Coda research project at CMU developed a distributed filesystem that works this way.

With disconnected operation there is the possibility that two users modify the same information thus creating a conflict. Conflicts can be resolved either automatically by the system, or by a component that is aware of the structure of the data. Ideally the code to reconcile versions should be the responsibility of application developers, who in the worst case can ask the user to merge versions.

Investment in a more reliable network or a more reliable disk drive doesn't offer nearly as good a return as an investment in fault-tolerant software. Software can utilize local storage to work around network downtime, and use the network to avoid data loss in case of a drive failure.


Post a Comment

<< Home