Implementing fault tolerance in a 64-bit distributed operating system
Wilkinson, T. J. (1993). Implementing fault tolerance in a 64-bit distributed operating system. (Unpublished Doctoral thesis, City, University of London)
Abstract
This thesis explores the potential of 64-bit processors for providing a different style of distributed operating system. Rather than providing another reworking of the UNIX model, the use of the large address space for unifying volatile memory (virtual memory), persistent memory (file systems) and distributed network access is examined and a novel operating system, Arius, is proposed.
The concepts behind the design of ARIUS are briefly reviewed, and then the reliability of such a system is examined in detail. The unified nature of the architecture makes it possible to use a reliable single address space to provide a completely reliable system without the addition of other mechanisms. Protocols are proposed to provide locally scalable distributed shared memory and these are then augmented to handle machine failures transparently though the use of distributed checkpoints and rollback.
The checkpointing system makes use of the caching mechanism in DSM to provide data duplication for failure recovery. By using distributed memory for checkpoints, recovery from machine faults may be handled seamlessly. To cope with more “complete” failures, persistent storage is also included in the failure mechanism.
These protocols are modelled to show their operability and to determine the cost they incur in various types of parallel and serial programs. Results are presented to demonstrate these costs.
Publication Type: | Thesis (Doctoral) |
---|---|
Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science |
Departments: | School of Science & Technology > Computer Science School of Science & Technology > School of Science & Technology Doctoral Theses Doctoral Theses |
Download (22MB) | Preview
Export
Downloads
Downloads per month over past year