6.830 Final Project Assignment and Ideas (Spring 2013)
- Feb 25: Project Team Due
- Mar 13: Project Proposals Due
- May 15: Project Presentations
- May 16: Final Reports Due
A large portion (20%) of your grade in 6.830 consists of a final project.
This project is meant to be a substantial independent research or
engineering effort related to material we have studied in class. Your
project may involve a comparison of systems we have read about, an
application of database techniques to a system you are familiar with,
or be a database-related project in your research area.
This document describes what is expected of a final project and
proposes some possible project ideas.
What is Expected
Good class projects can vary dramatically in complexity, scope, and topic.
The only requirement is that they be related to something we have studied
in this class and that they contain some element of research -- e.g., that
you do more than simply engineer a piece of software that someone else
has described or architected. To help you determine if your idea is of
reasonable scope, we will arrange to meet with each group several times
throughout the semester.
What to Hand In
There are two written deliverables: a project proposal, due on March 13th,
and a final report, due on May 16th.
Project Proposal: The proposal should consist of 1-2 pages
describing the problem you plan to solve, outlining how you plan to
solve it, and describing what you will "deliver" for the final
project. We will arrange short meetings with every group before the
project proposal to help you refine your topic and would be happy to
provide feedback on a draft of your proposal before it is due.
Final Report: You should prepare a conference-style report on
your project with maximum length of 15 pages (10 pt font or larger,
one or two columns, 1 inch margins, single or double spaced -- more is not
better.) Your report should introduce and motivate the problem your
project addresses, describe related work in the area, discuss the
elements of your solution, and present results that measure the
behavior, performance, or functionality of your system (with
comparisons to other related systems as appropriate.)
Because this report is the primary deliverable upon which you will
be graded, do not treat it as an afterthought. Plan to leave
at least a week to do the writing, and make sure your proofread and
Please submit a paper copy of your report in class on Thursday, 5/16
You will also be expected to give a presentation on your
project in class that will provide an opportunity for you to present
a short demo of your work and show what you have done to other students
in the class. Details about the format of the presentation
will be posted as the date gets closer.
The following is a list of possible project ideas; you are not
required to choose from this list -- in fact, we encourage you
to try to solve a problem of your own choosing! If you are interested
in working on one of these projects, contact Professor Curino or Professor Stonebraker and we
can put you in touch with students and others around MIT working on these
ideas. Note that these are not meant to be complete project proposals, but
just suggestions for areas to explore -- you will need to flesh them out
into complete projects by talking with your group members, the course staff,
and graduate students working on these projects.
- Loading large datasets (> 10 GB) can be very slow, especially when there are many indexes on tables. One simple trick is to sort data before inserting it, or to drop indexes, insert data, and reload indexes. The goal of this project would be to investigate other techniques to improve load performance in an open source database such as Postgres or MySQL/Innodb.
One concrete idea might be "fast bulk load" a la Hbase for MySQL. HBase uses a MapReduce job to build its indexes. One possible project would be to build a prototype that does the same for a single (or multiple) MySQL Boxes.
- Hosted database services such as Amazon RDS, Microsoft SQL Azure
are starting to become popular. It is still unclear what the
performance impact is of running applications on a local
(non-hosted) platform, such as a local enterprise datacenter,
versus having the data hosted "in the cloud". This project would investigate
the performance of such systems on a particular workload or workloads(s),
e.g., OLAP, OLTP, Web.
- Flash storage promises lower
latency for random operations than conventional disks. However, it also has a number of unusual
limitations. An interesting project investigates the
performance impact of using flash memories for DB applications, or for specific DB tasks such as indexing
- Twitter provides a fire hose of data. Automatically filtering,
aggregating, analyzing such data can allow to harness the full vlue of
the data, extracting valuable information. The idea of this project
is to investigate stream processing technology to operate on social
streams, or extract structure from them and store them in a database system.
use to access a database; the idea is to avoid the painful way in
which current client-side application have to use the XMLHttpRequest
interface to access server-side objects asynchronously. This layer
should cache objects on the client side whenever possible, but be
backed by a shared, server-side database system.
- As a related project, HTML5 browsers
This project would involve investigating how to user such a database to improve client performance, offload work from the server, etc.
- Preventing denial-of-service attacks on database systems.
Databases are a vulnerable point in many web sites, because it is
often possible for attackers to make some simple request that causes
the website to issue queries asking the database to do a lot of work.
By issuing a large number of such requests, an attacker can
effectively issue a denial of service attack against the web site,
disabling the database. The goal of this project would be to develop
a set of techniques to counter
this problem -- for example, one approach might be to modify the database scheduler
so that it doesn't run the same expensive queries over and over.
- Auto-admin tools to recommend indices, etc. Design a tool that
recommends a set of indices to build given a particular workload and a
set of statistics in a database. Alternatively investigate the
question of which materialized views to create in a data-warehousing
system, such as C-Store.
- Asynchronous Database Access. Client software interacts with
standard SQL databases via a blocking interface like ODBC or JDBC; the
client sends SQL, waits for the database to process the query, and
receives an answer. A non-blocking interface would allow a single
client thread to issue many parallel queries from the same thread,
with potential for some impressive performance gains. This project
would investigate how this would work (do the queries have to be in
different transactions? what kind of modification would need to be
made to the database) and would look at the possible performance gains
in some typical database benchmarks or applications.
- Extend SimpleDB. SimpleDB is very simple. There are a number of ways you might
extend it to explore some of the research ideas we have studied in
this class. For example, you could add support for optimistic
concurrency control and compare its performance to the basic
concurrency control scheme you will implement in Problem Set 3.
There are a number of other possible projects of this type;
we would be happy to discuss these in more detail.
- CarTel. In the compensating actions,
but there may be other models that are possible, or it may be possible to automatically derive such
compensating action for certain classes of transactions.
- Databases are "schema first", meaning you have to declare a schema, and ensure your data conforms
to it, before you can load data into the datbase. The goal of this project would be to build a database
system that allows you to immediately start manipulating your data, without declaring a schema, by, e.g.,
auto-generating column names, ignoring improperly formatted tuples in input data, etc. The database would
then allow you to clean up the data, renaming columsn and fixing up broken or missing records.
- Compare the performance of a Hadoop-based database like Hive to a relational system like MySQL or Postgres on some workload, and understand where the differences come from. Alternatively, look at implementing on of the advanced features of relational databases that we have discussed in Hive, or improving the optimizer in Hive in some way.
- More generally the performance of different SQL/NoSQL solutions on different workloads (e.g., MySQL vs Postgres vs Hive vs Pig vs Voldemort vs Mongo vs ...)
- Look at running some collection of statistical or machine learning or other data analysis operations inside of a database, and compare their performance them to running those algorithms by copying data out of the database and running in Matlab or some other programming language. See, for example Mad Lib.
- Build a mobile app to find nearby phones and search the for related data. For example, a user might request: "I want to see the current view of stata after a winter storm" and the system will try to look for photos taken within the last hour or so on phones that it knows about and then stitch them together to generate a 360 view. Challenges are, for example, designing metadata for photos taken on phones, the query language, and how to forward data (send everything to centralized server and then forward to users vs some kind of peer to peer approach).
- Concurrent insert-heavy tree: a lot of people have looked at using batching to improve the performance of insert-heavy workloads (e.g., LSM tress, COLAs, buffer trees). But these are still mostly sequential data structures. The goal of this project is to design a concurrent version of such a tree?
- DB deadlock detector: deadlocks within the DB are nasty to debug. The goal of this project is to build a tool which analyzes code to automatically find deadlocks, either statically or with runtime instrumentation? As a test for the tool, you might try it on the TPC-C implementation in oltpbench (oltpbench), which is *full* of deadlocks, and have it automatically find them.
- Query plan JIT compiler: now that in-memory DBs are becoming more popular, it might make sense to try generating native code for query plan evaluation to reduce the latency of intepreted plans. Some gains might be achieved with runtime type knowledge just as in standard compilers. It would be fun to take SimpleDB and generate (for instance) LLVM bytecode on the fly for query evaluation.
- Benchmark study of different file systems for database performance. This woould involve an experimental study of different file systems (e.g., ext4, zfs, btrfs, etc.) where you measure how much the choice of file system makes on performance for non in-memory workloads. You could look at both disk and flash.