Surviving Internet Catastrophes

Authors: Flavio Junqueira, Ranjita Bhagwan, A.H., Keith Marzullo, and Geoffrey M. Voelker.

Abstract: In this paper, we propose a new approach for designing distributed systems to survive Internet catastrophes called informed replication, and demonstrate this approach with the design and evaluation of a cooperative backup system called the Phoenix Recovery Service. Informed replication uses a model of correlated failures to exploit software diversity. The key observation that makes our approach both feasible and practical is that Internet catastrophes result from shared vulnerabilities. By replicating a system service on hosts that do not have the same vulnerabilities, an Internet pathogen that exploits a vulnerability is unlikely to cause all replicas to fail. To characterize software diversity in an Internet setting, we measure the software diversity of host operating systems and network services in a large organization. We then use insights from our measurement study to develop and evaluate heuristics for computing replica sets that have a number of attractive features. Our heuristics provide excellent reliability guarantees, result in low degree of replication, limit the storage burden on each host in the system, and lend themselves to a fully distributed implementation. We then present the design and prototype implementation of Phoenix, and evaluate it on the PlanetLab testbed.

Ref: Proceedings of USENIX Annual Technical Conference, Anaheim, CA, USA, April 2005.

Available as Compressed Postscript, Postscript, and PDF.

More info: Flavio Junqueira has more information on the Phoenix project.