Algorithms for Big-Data Management

CompSci 590.04, Fall 2015

Instructor: Ashwin Machanavajjhala
TA: Xi He
When: 3:05 PM - 4:20 PM Wednesdays and Fridays
Where Social Science 124

Office Hours:
    Ashwin Machanavajjhala: By appointment (send email to
    Xi He: Wednesdays 4:30 - 6 PM @ LSRC D309

Overview: Many real world applications are moving from being computationally-bound to being data-bound -- we are seeing a wide variety of staggeringly large datasets. There are billions of emails and search queries, and millions of tweets and photos posted everyday, in addition to our every action being tracked online (via cookies) and in the physical world (e.g., through video cameras). Analyzing this data, especially by combining multiple datasets from different sources, has revolutionized e-commerce, advertising and health industries, and has facilitated advances in science and public policy (e.g. Google Flu).

This course will provide an introduction to algorithm design on such large datasets. We will cover parallel algorithms (which partition computation across many machines), streaming algorithms (that never store the whole input in memory), and computing over noisy data (with applications to privacy and crowd-sourcing).

Lectures in this course will be based on one or more recent publications. There will be no exams. Students will be graded on (a) class participation (attendance, reading assigned papers, class discussion, paper presentation), (b) scribing notes for one or two lectures, and (c) the completion of a group project. Projects can focus on developing new theory/ algorithms, or on implementing/adapting known algorithms for new application domains.

Prerequisites: The course is open to interested graduate and undergraduate students with sufficient mathematical maturity. Students are expected to have completed Discrete Math (CS 230 or equivalent) and Algorithms (CS 330 or equivalent) or Databases (CS316). Knowledge of basic probability will be assumed.


  Tentative Syllabus:



Wed, 8/26
Reservoir Sampling (slides)
J. Vitter, “Random Sampling with a Reservoir”, ACM Transactions on Mathematical
Software, 1985
Fri, 8/28
Sampling from databases (slides) N. Dalvi, R. Kumar, A. Machanavajjhala, V. Rastogi, "Sampling hidden objects using nearest-neighbor oracles", KDD 2011.
Optional: Frank Olken "Random Sampling from Databases", PhD Thesis
Monte Carlo Estimation, DNF counting (slides) R. Karp, M. Luby, N. Madras, "Monte Carlo Estimation Algorithm for Enumeration Problems", Journal of Algorithms 10(3) 1989
Fri, 9/4
Markov Chain Monte Carlo (slides)

Assignment 1 posted

L. Lovasz "Random Walks on Graphs: A Survey", Combinatorics 1993
Wed, 9/9
Coupling and Mixing Times (slides) (notes)

Streaming Algorithms
Fri, 9/11
Filtering & Estimating the number of distinct objects (slides)

Assignment 1 DUE
Chapter 4: Section 4.3, 4.4,  Ullman & Rajaraman textbook
Wed, 9/16
Frequency Moments (notes)
Chapter 3: Section 4.5,  Ullman & Rajaraman textbook
Alon, Mathias, Szegedy, "The space complexity of approximating the frequency moments", STOC 96
Fri, 9/18
k-wise independence & Count min sketch
(Andrew McGregor's Slides)

Assignment 2 posted
Cormode, Muthukrishnan, "An improved data stream summary: The count min sketch and its applications"
Wed, 9/23
Count min sketch and Heavy Hitters

Fri, 9/25
Online Prediction and Decision Making (slides)

Assignment 2 DUE Monday 9/28

Parallel Architechtures

Wed, 9/30
Map Reduce basics (slides) Chapter 2: Sections 2.1, 2.2, 2.3, Ullman & Rajaram textbook
Fri, 10/2
Map Reduce (continued)

Locality Sensitive Hashing (slides)

Project Proposal is Due

Chapter 2: Sections 2.1, 2.2, 2.3, Ullman & Rajaram textbook

Chapter 3:   Ullman & Rajaraman textbook
Wed, 10/7
Graphs on Map Reduce (slides) Rastogi, Machanavajjhala, Chitnis, Das Sarma, "Finding Connected Components in Map Reduce in Logarithmic Rounds", ICDE 2013
Kang, Tsourakis, Faloutsos, "Pegasus: a Peta Scale Graph Mining System", ICDM 2009
Fri, 10/9
Graph Processing & Iterations (slides)

Malewicz et al, "Pregel: A system for large scale graph processing", SIGMOD 2010

Wed 10/14

Assignment 3 posted

Fri, 10/16
Asynchronous Processing

Low, Gonzales, Kyrola, Bickson, Guestrin, Hellerstein, "Distributed Graphlab", VLDB 2012
Wang, Xie, Demers, Gehrke, "Asynchronous Large Scale Graph Processing Made Easy", CIDR 2013
Wed, 10/21
Resilient Distributed Datasets (slides)
Zaharia et al "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing", NSDI 2012
Fri, 10/23
Feed Following  (slides)

Silberstein, Terrace, Cooper, Ramakrishnan, "Feeding Frency: Selectively Materializing User's Event Feeds", SIGMOD 2010
Silberstein, Machanavajjhala, Ramakrishnan, "Feed Following", DBSocial 2011

Wed 10/28
Basic Join Algorithms (slides)

Assignment 3 DUE

Fri 10/30
Worst Case Optimal Join Algorithms (slides)

Wed 11/4
Record Linkage and Fuzzy Joins

Mid-term report DUE
Chapter 3:   Ullman & Rajaraman textbook

Computing with Error

Fri, 11/6
Motivation: Differential privacy
"Calibrating Noise to Sensitivity in Private Date Analysis"
C. Dwork, F. McSherry, K. Nissim, A.Smith TCC 2006
"Differential Privacy" C. Dwork, ICALP 2006
Wed, 11/11
Answering sets of counting queries

"Boosting the Accuracy of Differentially Private Histograms
Through Consistency
", M. Hay, V. Rastogi, G. Miklau, D. Suciu, PVLDB 2010
"Optimizing Linear Counting Queries Under Differential Privacy"
C. Li, M. Hay, V. Rastogi, G. Miklau, A. McGregor PODS 2010
Fri, 11/13
Computing Histograms

"A Data- and Workload- aware Algorithm for Range Queries under Differential privacy", Li, Hay, Miklau, Wang, VLDB 2014
"A simple and practical algorithm for differentially private data release", F. McSherry, K. Ligett, M. Hardt, NIPS 2012
Wed 11/18
Sorting under Error "Computing with Noisy Information", Fiege, Raghavan, Peleg, Upfal, SIAM Journal of Computing 1994
Fri, 11/20
Project Presentations