What is Wrong With the Transmission? — A Comprehensive Study on Message Passing Related Bugs

Abstract

Along with the prevalence of distributed systems, more and more applications require the ability of reliably transferring messages across a network. However, passing messages in a convenient and dependable way is both difficult and error prone. Thus the existing messaging products usually suffer from numerous software bugs. And these bugs are particularly difficult to be diagnosed or avoided. Therefore, in order to improve the methods for handling them, we need a better understanding of their characteristics. This paper provides the first (to the best of our knowledge)comprehensive characteristic study on message passing related bugs (MP-bugs). We have carefully examined the pattern, manifestation, fixing and other characteristics of 349 randomly selected real world MP-bugs from 3 representative open-source applications (Open MPI, Zero MQ, and Active MQ). Surprisingly, we found that nearly 60% of the non-latent MP-bugs can be categorised into two simple patterns: the message level bugs and the connection level bugs, which implies a promising perspective of detecting/tolerating tools for MP-bugs. Apart from this finding, our study have also uncovered many new (and sometimes surprising)insights of the message passing systems’ developing process. The results should be useful for the design of corresponding bug detecting, exposing and tolerating tools.

Publication
44th International Conference on Parallel Processing