Abstract

Online detectors are important to achieving coordination and synchronization in large-scale distributed applications that run in peer-to-peer systems, the Grid, PlanetLab, and large-scale, enterprise-like server farms. Detectors can be used to monitor the up/down status of hosts, the malicious behavior among processes, and the availability behavior among hosts, and to estimate the number of hosts in a distributed system. We discuss a variety of online detectors that exist for these different problems with an emphasis on practical solutions that satisfy two characteristics: They have been implemented and validated in experimental evaluation or practice, and they are based on novel ideas and on strong theory. The goal of this article is to enable practitioners to understand these protocols so they can be implemented easily or adapted for various distributed systems. This article aims to provide the starting researcher an overvew, and a good feel for the area of detectors to enable additional learning in this interesting field.

Keywords: distributed systems; detection; crash-stop; byzantine; availability; system size estimation; scalability; fault-tolerance