CMSC 23310/33310 Advanced Distributed Systems
Spring 2012
Lecturer: Borja Sotomayor
E-mail: borja AT cs DOT uchicago DOT edu
Office: Searle 209-A
Office hours: Open door policy (see Course Syllabus)
Discussion session: Tuesdays 10:30-11:50 in Cobb 304
Lecture: Thursdays 10:30-11:50 in Ryerson 276
Quick links
Course Description
In recent years, large distributed systems have taken a prominent role not just in scientific inquiry, but also in our daily lives. When we perform a search on Google, stream content from Netflix, place an order on Amazon, or catch up on the latest comings-and-goings on Facebook, our seemingly minute requests are processed by complex systems that sometimes include hundreds of thousands of computers, connected by both local and wide area networks.
Recent papers in the field of Distributed Systems have described several solutions (such as MapReduce, BigTable, Dynamo, Cassandra, etc.) for managing large-scale data and computation. However, building and using these systems poses a number of more fundamental challenges: How do we keep the system operating correctly even when individual machines fail? How do we ensure that all the machines have a consistent view of the system's state? (and how do we ensure this in the presence of failures?) How can we determine the order of events in a system where we can't assume a single global clock?
Many of these fundamental problems were identified and solved over the course of several decades, starting in the 1970's. To better appreciate the challenges of recent developments in the field of Distributed Systems, this course will guide students through seminal work in Distributed Systems from the 70's, 80's, and 90's, leading up to a discussion of recent work in the field.
Course Organization
This course is divided into two components:
- Reading and Discussion of Primary Sources: We will cover seminal work in the field, as well as recent work, by going directly to the original papers (listed below). One or two papers will be assigned each week, to be discussed on Tuesdays (on Thursdays, we will present background material that will aid in understanding the paper/s assigned for the following Tuesday). Students will have to submit brief response papers each week, as well as a final paper.
- Distributed Systems Programming Project: Throughout the quarter, students will have to implement a distributed system. Unlike projects that students may have done in previous 200-level CS systems courses, this project is not designed to directly complement all the material covered in the lectures. Instead, the project will be open-ended: students must propose a project themselves (subject to instructor approval), choose the languages and frameworks to develop the project, and meet certain reporting milestones during the quarter.
The final grade will be divided as follows: 20% response papers, 20% final paper, 20% participation in discussions, 40% project (the grade for the project is divided into several milestones; see the syllabus). There will be no midterms or final exam.
Students interested in taking this course should take into account that CMSC 23300 (Networks and Distributed Systems) is a prerequisite for this course. Students can petition to have this requirement waived, as long as they have taken at least one other 200-level CS systems course.
Papers
This is the tentative list of papers we will be covering in this course. We will be discussing some of these in detail, while others will be assigned as optional reading. See the Course Syllabus for the week-by-week reading schedule..
Getting Started
- Leslie Lamport. Solved Problems, Unsolved Problems and Non-problems in Concurrency. SIGOPS Oper. Syst. Rev., 19(4):34–44, October 1985
- Edsger W. Dijkstra. Self-stabilizing systems in spite of distributed control. Commun. ACM, 17(11):643–644, November 1974
Fault Tolerance
- M. Pease, R. Shostak, and L. Lamport. Reaching Agreement in the Presence of Faults. J.ACM, 27(2):228–234, April 1980
- Leslie Lamport, Robert Shostak, and Marshall Pease. The Byzantine Generals Problem. ACM Trans. Program. Lang. Syst., 4(3):382–401, July 1982
- Miguel Castro and Barbara Liskov. Practical Byzantine Fault Tolerance and Proactive Recovery. ACM Trans. Comput. Syst., 20(4):398–461, November 2002
Distributed Consensus
- Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. Impossibility of distributed consensus with one faulty process. J. ACM, 32(2):374–382, April 1985
- Leslie Lamport. The Part-Time Parliament. ACM Trans. Comput. Syst., 16(2):133–169, May 1998
- Leslie Lamport. Paxos Made Simple. ACM SIGACT News, 32(4):18–25, December 2001
Distributed Time
- Leslie Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Commun.ACM, 21(7):558–565, July 1978
- C. J. Fidge.Timestamps in Message-Passing Systems that Preserve the Partial Ordering. Proceedings of the 11th Australian Computer Science Conference, 10(1):5666, 1988
- Friedemann Mattern. Virtual Time and Global States of Distributed Systems. In Parallel and Distributed Algorithms, pages 215–226. North-Holland, 1989
Other Topics and Surveys
- N. Lynch. A Hundred Impossibility Proofs for Distributed Computing. In Proceedings of the eighth annual ACM Symposium on Principles of distributed computing, PODC ’89, pages 1–28, New York, NY, USA, 1989. ACM
- Seth Gilbert and Nancy Lynch. Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. SIGACT News, 33(2):51–59, June 2002
Recent Work
- Mike Burrows. The Chubby Lock Service for Loosely-Coupled Distributed Systems. In Proceedings of the 7th symposium on Operating systems design and implementation, OSDI ’06, pages 335–350, Berkeley, CA, USA, 2006. USENIX Association
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: A Distributed Storage System for Structured Data. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7, OSDI ’06, pages 15–15, Berkeley, CA, USA, 2006. USENIX Association
- Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon’s Highly Available Key-Value Store. In Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles, SOSP ’07, pages 205–220, New York, NY, USA, 2007. ACM
- Tushar D. Chandra, Robert Griesemer, and Joshua Redstone. Paxos Made Live: An Engineering Perspective. In Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing, PODC ’07, pages 398–407, New York, NY, USA, 2007. ACM
- Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM, 51(1):107–113, January 2008
- Avinash Lakshman and Prashant Malik. Cassandra: A Decentralized Structured Storage System. SIGOPS Oper. Syst. Rev., 44(2):35–40, April 2010