6.830 Final Project Assignment and Ideas (Fall 2015)
- Sep 30: Project Team Due
- Oct 13: Project Proposals Due
- Dec 9: Project Presentations
- Dec 10: Final Reports Due
A large portion (20%) of your grade in 6.830 consists of a final project.
This project is meant to be a substantial independent research or
engineering effort related to material we have studied in class. Your
project may involve a comparison of systems we have read about, an
application of database techniques to a system you are familiar with,
or be a database-related project in your research area.
This document describes what is expected of a final project and
proposes some possible project ideas.
What is Expected
Good class projects can vary dramatically in complexity, scope, and topic.
The only requirement is that they be related to something we have studied
in this class and that they contain some element of research -- e.g., that
you do more than simply engineer a piece of software that someone else
has described or architected. To help you determine if your idea is of
reasonable scope, we will arrange to meet with each group several times
throughout the semester.
What to Hand In
There are two written deliverables: a project proposal, due on October 6th,
and a final report, due on December 10th.
Project Proposal: The proposal should consist of 1-2 pages
describing the problem you plan to solve, outlining how you plan to
solve it, and describing what you will "deliver" for the final
project. We will arrange short meetings with every group before the
project proposal to help you refine your topic and would be happy to
provide feedback on a draft of your proposal before it is due.
Final Report: You should prepare a conference-style report on
your project with maximum length of 15 pages (10 pt font or larger,
one or two columns, 1 inch margins, single or double spaced -- more is not
better.) Your report should introduce and motivate the problem your
project addresses, describe related work in the area, discuss the
elements of your solution, and present results that measure the
behavior, performance, or functionality of your system (with
comparisons to other related systems as appropriate.)
Because this report is the primary deliverable upon which you will
be graded, do not treat it as an afterthought. Plan to leave
at least a week to do the writing, and make sure your proofread and
Please submit a paper copy of your report in class on Wednesday, 12/10.
You will also be expected to give a presentation on your
project in class that will provide an opportunity for you to present
a short demo of your work and show what you have done to other students
in the class. Details about the format of the presentation
will be posted as the date gets closer.
The following is a list of possible project ideas from past years; we will update over the next few weeks.
You are not
required to choose from this list -- in fact, we encourage you
to try to solve a problem of your own choosing! If you are interested
in working on one of these projects, the staff
can put you in touch with students and others around MIT working on these
ideas. Note that these are not meant to be complete project proposals, but
just suggestions for areas to explore -- you will need to flesh them out
into complete projects by talking with your group members, the course staff,
and graduate students working on these projects.
- D3.js is a visualization system for rendering data. It uses a number of database-like constructs internally, some of which are quite primitive. For example, they use nested loop joins and don’t reorder selections. The goal would be to build a simple cost-based query optimizer/rewriter for D3. Contact: Holger Pirk (firstname.lastname@example.org)
- We have access to a large (many terabyte) collection of EKG (brain signal) data from hundreds of patients over many years. We have both the raw data and
the "cooked" version — which has about 100 attributes obtained by an expert reading the
EKG. Experts are assigned randomly from Children’s Hospital doctors.
Some of these attributes are suspect. For example, our contact reports that the percentage of EKGs that a
doctor indicates as having "heart disease" varies between 25 and 45% between doctors. Obviously, it
should be the same, given the way EKGs are randomly assigned.
The goal is to see which of these hundreds attributes actually carry information, when correlated against
real patient outcomes or other metrics, or to investiagte automatically inferring these
attributes from the raw signal data.
- Loading large datasets (> 10 GB) can be very slow, especially when there are many indexes on tables. One simple trick is to sort data before inserting it, or to drop indexes, insert data, and reload indexes. The goal of this project would be to investigate other techniques to improve load performance in an open source database such as Postgres or MySQL/Innodb.
One concrete idea might be "fast bulk load" a la Hbase for MySQL. HBase uses a MapReduce job to build its indexes. One possible project would be to build a prototype that does the same for a single (or multiple) MySQL Boxes.
- Hosted database services such as Amazon RDS, Microsoft SQL Azure
are starting to become popular. It is still unclear what the
performance impact is of running applications on a local
(non-hosted) platform, such as a local enterprise datacenter,
versus having the data hosted "in the cloud". This project would investigate
the performance of such systems on a particular workload or workloads(s),
e.g., OLAP, OLTP, Web.
- Flash storage promises lower latency for random operations than
conventional disks. However, it also has a number of unusual
limitations. An interesting project investigates the performance
impact of using flash memories for DB applications, or for specific DB
tasks such as indexing or recovery.
- Twitter provides a fire hose of data. Automatically filtering,
aggregating, analyzing such data can allow to harness the full vlue of
the data, extracting valuable information. The idea of this project
is to investigate stream processing technology to operate on social
streams, or extract structure from them and store them in a database system.
- Scalable Wrangler. Build a (Data Wrangler) like interface that
scales to larger data sets (millions of rows), or devise a sampling
strategy to automatically load the most relevant records (e.g.,
representative outliers) into wrangler. More generally think about
how you would improve the wrangling interface.
- Databases are "schema first", meaning you have to declare a schema, and ensure your data conforms
to it, before you can load data into the datbase. The goal of this project would be to build a database
system that allows you to immediately start manipulating your data, without declaring a schema, by, e.g.,
auto-generating column names, ignoring improperly formatted tuples in input data, etc. The database would
then allow you to clean up the data, renaming columsn and fixing up broken or missing records.
- Compare the performance of a Hadoop-based database like Hive to a relational system like MySQL or Postgres on some workload, and understand where the differences come from. Alternatively, look at implementing on of the advanced features of relational databases that we have discussed in Hive, or improving the optimizer in Hive in some way.
- More generally the performance of different SQL/NoSQL solutions on different workloads (e.g., MySQL vs Postgres vs Hive vs Pig vs Voldemort vs Mongo vs ...)
- Look at running some collection of statistical or machine learning or other data analysis operations inside of a database, and compare their performance them to running those algorithms by copying data out of the database and running in Matlab or some other programming language. See, for example Mad Lib.
- Build a mobile app to find nearby phones and search the for related data. For example, a user might request: "I want to see the current view of stata after a winter storm" and the system will try to look for photos taken within the last hour or so on phones that it knows about and then stitch them together to generate a 360 view. Challenges are, for example, designing metadata for photos taken on phones, the query language, and how to forward data (send everything to centralized server and then forward to users vs some kind of peer to peer approach).
- Concurrent insert-heavy tree: a lot of people have looked at using batching to improve the performance of insert-heavy workloads (e.g., LSM tress, COLAs, buffer trees). But these are still mostly sequential data structures. The goal of this project is to design a concurrent version of such a tree.
- DB deadlock detector: deadlocks within the DB are nasty to debug. The goal of this project is to build a tool which analyzes code to automatically find deadlocks, either statically or with runtime instrumentation? As a test for the tool, you might try it on the TPC-C implementation in oltpbench (oltpbench), which is *full* of deadlocks, and have it automatically find them.
- Query plan JIT compiler: now that in-memory DBs are becoming more popular, it might make sense to try generating native code for query plan evaluation to reduce the latency of intepreted plans. Some gains might be achieved with runtime type knowledge just as in standard compilers. It would be fun to take SimpleDB and generate (for instance) LLVM bytecode on the fly for query evaluation.
- Benchmark study of different file systems for database performance. This would involve an experimental study of different file systems (e.g., ext4, zfs, btrfs, etc.) where you measure how much the choice of file system makes on performance for non in-memory workloads. You could look at both disk and flash.
- Build an auto-cleaning database: A database that, based on user input, automatically "predicts" missing values in a column using a standard machine learning technique on the remaining columns. A more advanced version of this database can pick the most suited machine learning algorithm that best predicts missing values.
- A graph visualization system: Most graph visualization tools end up displaying a giant mess of links. Build an interactive visualization tool that automatically clusters or samples nodes and edges so that the user is not overwhelmed with the visualization, but the user can interact by zooming in/out in order to get a better sense for the data. Design scalable techniques such that these interactions are seamless.
- Data cleaning experimental study: Do an experimental study of the common data cleaning errors found in practice in log files found online, and develop algorithms for automatically wrangling log files to a more structured form.
- Graph processing language: Most graph processing systems end up focusing on simple algorithms (shortest paths, pagerank) that are not really used by real graph analysts. Study http://snap.stanford.edu/snap/description.html to identify common patterns of accesses made by different network algorithms, and design a wrapper over a relational or graph database that can supports a large fraction of these operations.
- Rapid Histogram Visualizations: Study the online aggregation paper, build a system to use the online aggregation techniques to produce rapid histogram visualizations of aggregates that improve over time. Then, study the impact of various sampling techniques to get to even better visualizations that are accurate but are computed on much less data.
- Auto-admin tools to recommend indices, etc. Design a tool that
recommends a set of indices to build given a particular workload and a
set of statistics in a database. Alternatively investigate the
question of which materialized views to create in a data-warehousing
system, such as C-Store.
- Extend SimpleDB. SimpleDB is very simple. There are a number of ways you might
extend it to explore some of the research ideas we have studied in
this class. For example, you could add support for optimistic
concurrency control and compare its performance to the basic
concurrency control scheme you will implement in Lab 4.
There are a number of other possible projects of this type;
we would be happy to discuss these in more detail.
- Rollback of long-running or committed transactions. Database systems typically only
support UNDO of committed transactions, but there are cases where it might be important
to rollback already committed transactions.
One approach is to use user-supplied compensating actions,
but there may be other models that are possible, or it may be possible to automatically derive such
compensating action for certain classes of transactions.