From ASCII to Answers: Advanced Topics in Data Processing


3-0-9 (Grad H)


Samuel Madden (
Amol Deshpande (
Adam Marcus (
Eugene Wu (


TR1-2:30, room 4-231 (First Meeting Thursday, 9/5)

This class will survey techniques and systems for ingesting, efficiently processing, analyzing, and visualizing large data sets. Topics will include data cleaning, data integration, scalable systems (relational databases, NoSQL, Hadoop, etc.), analytics (data cubes, scalable statistics and machine learning), and scalable visualization of large data sets. The goal of the class is to gain working experience along with in-depth discussions of the topics covered. Students should have a background in database or distributed systems (6.814/6.830 or 6.824 or permission of instructor). There will be a semester-long project and paper, and hands-on labs involving using real systems. There will be no exams, and grading will be based largely on class participation and whether assignments are completed.

Grading will be based on class participation, successful completion of labs, and a final project, as follows:

Final Project33%

Details about the final project are available here.

For homeworks, you allowed 5 penalty free late days to use throughout the semester. One late day equals one 24 hour period after the due date of the assignment. Once you have used your late days, there will be a 20% penalty for each day an assignment is late. You do not need to explictly declare the use of late days; we will assign them to you in a way that is optimal for your grade when different assignments are worth different numbers of points. Late days may not be used for the final project.

Useful links:
Hand in assignments on Stellar.
Discuss the class on Piazza.
Email the staff.
Check out the course github repository (including lecture notes and slides!).

Class Schedule




To be read before class

Assignments / Notes

L1: Class Intro; Data Modeling and Representation Overview

Schemas / structure are valuable (even if it not in relational world); case studies and class overview

Notes: github
Assigned: Lab 1

L2: Alternative Data Models

Understand different data models, and what they are good for.  Semi-structured storage, arrays, graphs, etc.

What goes around comes around
Section 3 of Unified twitter logging (you may also want to skim the intro to get the motivation)

Notes: github
Due: Lab 1
Assigned: Lab 2

L3: Data Cleaning and Integration #1

Text -> Relations/Structured Data

Play with Data Wrangler. Read the PDF.

Notes: github
Lab 3 (to be posted on Tuesday) will involve sed/awk and Wrangling a few datasets

L4: Data Cleaning and Integration #2 (Stonebraker)

Overview:  A ?= A’

Data Curation at Scale: The Data Tamer System

Optionally, you may want to read: Finding Related Tables and Crossing the Structure Chasm.

Due: Lab 2
Out: Lab 3

L5: Data Cleaning and Integration #3 (Adam Marcus)

Combining Info From Multiple Sources (Truthfulness, Veracity, etc.)

Read Truth Finding on the Deep Web

L6: Data Cleaning and Integration #4

Data Integration Techniques -- building a simple integration system

Data Integration Tutorial Slides Getoor and Machanavajjhala, VLDB 2012.

Due: Lab 3
Due: Final Project Teams
Out: Lab 4

L7: MapReduce

Split up larger computations into map and reduce.  Learn to connect multiple MapReduce phases.

Read: MapReduce: A Flexible Data Processing Tool by Dean and Ghemawat, and Possible Hadoop Trajectories by Stonebraker and Kepner.

If you are not familiar with MapReduce you should read: the original MapReduce paper (We will not cover this paper in detail so you do not need to read it again if you understand the basic MapReduce model.)

Lab 5 involes implementing some simple tasks with MapReduce.

L8: Column-oriented Databases

Perform queries like sums, counts, and averages over huge datasets in seconds using Column Oriented DBs.

The Vertica Analytic Database: C-Store 7 Years Later

Due: Lab 4
Out: Lab 5

L9: Data Cubes + Materialized views

Frequently, your dataset is so large that asking a question of it in real-time isn’t possible.  Cubes and materialized views allow you to precompute the answers to questions so that you have them when you need them later.

Read the data cube paper; pay attention to the algebraic vs holistic aggregate distinction.

Also read Section 1 and 2.1 of the DBToaster paper for a review of maintaining views. (You may read the rest of the paper but it is quite technical.)

L10: Approximate query processing

Sometimes, you don’t actually need a perfect answer, and statistics can make it easy to get very close.  We’ll cover techniques like sampling, sketching, and generating histograms.  

Bloom filters , Reservoir sampling , Count-min Sketches

Due: Lab 5

L11: Systems for machine learning on big data (Tyson Condie)

What are the challenges of running ML algorithms on Big Data, what are frameworks that have been proposed, how are they better than existing systems (e.g., Relational DBs + UDFs/UDAs). In particular, iterative algorithms require you to do several passes over your data.  Tools like MapReduce are not as useful for this because both their programming model and execution model aren’t designed around iteration.  

Read the Scalops paper. You may also want to read the Spark paper.

Out: Lab 6


L12: Future of Hadoop-Like Platforms (Carlo Curino)

No reading was assigned, but for those interested, you may want to read the YARN paper.

Due: Lab 6

L13: Graph processing

Lots of data is graphical, be it the world wide web or social networks.  Graph processing systems help you run algorithms to process these graphs.

Read the Pregel paper and check out the Giraph Docs.

Out: Lab 7

L14: Student Data Case Studies (Eugene Wu and Students)

L15: Scalable Algos (Tim Kraska)

Database techniques may be able to further speed up machine learning techniques by embedding them within the DBMS, declaratively specifying the machine learning algorithms and parameters and letting the DBMS optimize the parameters, or implementing the algortihms on map-reduce systems.

Read the MLBase Paper

Compare and contrast runnning in-database ML (madlib) vs out-of-database ML.

L16: Stream processing techniques; Naiad

Sometimes your datasets are both large and not static (think Twitter, or the stock market): as data streams in, you have to compute information on it without knowing when the stream will end.

Naiad Paper

Due: Lab 7

L17: Visualizations

Visualizations map summary statistics of your data into the visual domain.  We will have a crash course on typical summary statistics, and a comparison of declarative vs painting based models of visualizations.

Polaris Paper
D3 Paper

Assigned: Lab 8

L18: Scalable Visualizations #1

We’ve discussing scaling algorithms for faster data cleaning and processing.  How do we scale visualizations to encode more data (dimensions, hierarchical, # tuples) in the same space?  

Generalized Plot Matrix

L19: Amanda Cox -- NYTimes guest lecturer

Due: Lab 8

L20: Analytics in a dangerous (adversarial) world (Stephen Tu)

More and more sensitive data is being collected and processed by analysts. How do we protect user data from privacy breaches? We'll study several adversarial models and talk about a few solutions explored by the research community.

Read Part A and Part B of the lecture notes.

L21: Final Project Work In Class

For the next three weeks, class will be used to work on final projects. Course staff will be on hand to meet and discuss with groups. Brief progress reports are due on each Tuesday. Progress reports should consist of two bulleted lists: "Progress so Far" and "Goals For Next Week"; they don't need to be super long or detailed and should be submitted via Stellar.

Due: Progress Report 1 (Stellar Link)

L22: Final Project Work In Class

L23: Final Project Work In Class

Due: Progress Report 2 (Stellar Link)


L24: Final Project Work In Class

Due: Progress Report 3 (Stellar Link)

L25: Final Project Work In Class

L26: Poster Session