6.830 Final Project Assignment and Ideas (Fall 2009)

Important Dates:

A large portion (20%) of your grade in 6.830 consists of a final project. This project is meant to be a substantial independent research or engineering effort related to material we have studied in class. Your project may involve a comparison of systems we have read about, an application of database techniques to a system you are familiar with, or be a database-related project in your research area.

This document describes what is expected of a final project and proposes some possible project ideas.

What is Expected

Good class projects can vary dramatically in complexity, scope, and topic. The only requirement is that they be related to something we have studied in this class and that they contain some element of research -- e.g., that you do more than simply engineer a piece of software that someone else has described or architected. To help you determine if your idea is of reasonable scope, we will arrange to meet with each group several times throughout the semester.

What to Hand In

There are two written deliverables: a project proposal, due on October 15th, and a final report, due on December 10th.

Project Proposal: The proposal should consist of 1-2 pages describing the problem you plan to solve, outlining how you plan to solve it, and describing what you will "deliver" for the final project. We will arrange short meetings with every group before the project proposal to help you refine your topic and would be happy to provide feedback on a draft of your proposal before it is due.

Final Report: You should prepare a conference-style report on your project with maximum length of 15 pages (10 pt font or larger, one or two columns, 1 inch margins, single or double spaced -- more is not better.) Your report should introduce and motivate the problem your project addresses, describe related work in the area, discuss the elements of your solution, and present results that measure the behavior, performance, or functionality of your system (with comparisons to other related systems as appropriate.)

Because this report is the primary deliverable upon which you will be graded, do not treat it as an afterthought. Plan to leave at least a week to do the writing, and make sure your proofread and edit carefully!

Please submit a paper copy of your report in class on Thursday, 12/10. You will also be expected to give a presentation on your project in class that will provide an opportunity for you to present a short demo of your work and show what you have done to other students in the class. Details about the format of the presentation will be posted as the date gets closer.

Project Ideas

The following is a list of possible project ideas; you are not required to choose from this list -- in fact, we encourage you to try to solve a problem of your own choosing! If you are interested in working on one of these projects, contact Professor Madden or Professor Morris and we can put you in touch with students and others around MIT working on these ideas. Note that these are not meant to be complete project proposals, but just suggestions for areas to explore -- you will need to flesh them out into complete projects by talking with your group members, the course staff, and graduate students working on these projects.
  1. Access Methods for in-memory databases.The goal of this project is to study the trade-offs of various data representation options for databases in main memory. For example, in the H-store Project we found that using an array for in-memory tables with "dense" keys is good; one idea for a project would be to characterize the the cases where such arrays are beneficial. Another would be to study in-memory compression, which might improve cache performance. (Contact Evan Jones for more information.)
  2. Partition-aware object-relational mapping. Many programmers seem to prefer object-relational mapping (ORM) layers such as like Ruby on Rails or Hibernate to a traditional ODBC/JDBC interface to a database. In the H-store Project we have been studying performance benefits that can be obtained in a "partitonable" database, where the tables can be cleanly partitioned according to some key attribute (for example, customer-id), and queries are generally run over just one partition. The goal of this project would be to study how to exploit partitioning to improve the performance of a distributed ORM layer.
  3. Specialized storage layers. Design a new database access method optimized for a particular type of data. For example, you could build a SimpleDB or MySQL access method that is especially optimized for storing graphs, or for storing arrays, and compare its performance the basic access methods we have provided.
  4. Client-side database. Build a Javascript library that client-side Web applications can use to access a database; the idea is to avoid the painful way in which current client-side application have to use the XMLHttpRequest interface to access server-side objects asynchronously. This layer should cache objects on the client side whenever possible, but be backed by a shared, server-side database system.
  5. As a related project, HTML5 browsers (including WebKit, used by Safari and Chrome), include a client-side SQL API in JavaScript. This project would involve investigating how to user such a database to improve client performance, offload work from the server, etc.
  6. Preventing denial-of-service attacks on database systems. Databases are a vulnerable point in many web sites, because it is often possible for attackers to make some simple request that causes the website to issue queries asking the database to do a lot of work. By issuing a large number of such requests, and attacker can effectively issue a denial of service attack against the web site by disabling the database. The goal of this project would be to develop a set of techniques to counter this problem -- for example, one approach might be to modify the database scheduler so that it doesn't run the same expensive queries over and over.
  7. Auto-admin tools to recommend indices, etc. Design a tool that recommends a set of indices to build given a particular workload and a set of statistics in a database. Alternatively investigate the question of which materialized views to create in a data-warehousing system, such as C-Store.
  8. A related (more theoretical) project is to look at what properties of transactions would simplify concurrency control and recovery. The simple example of this is that if all transactions are read only, no concurrency control is needed. But there are a variety of other situations -- see a recent paper from the database group on this topic. The goal of this project would be to think of additional properties and to try to deduce algorithms that automatically determine what properties a collection of transactions have.
  9. Asynchronous Database Access. Client software interacts with standard SQL databases via a blocking interface like ODBC or JDBC; the client sends SQL, waits for the database to process the query, and receives an answer. A non-blocking interface would allow a single client thread to issue many parallel queries from the same thread, with potential for some impressive performance gains. This project would investigate how this would work (do the queries have to be in different transactions? what kind of modification would need to be made to the database) and would look at the possible performance gains in some typical database benchmarks or applications.
  10. Heuristic Optimizer. The Selinger Optimizer (see "Access Path Selection in a Relational Database Management System" in the Red Book) is notoriously bad at making query optimization decisions when there are many joins because the statistics it computes over intermediate results are often wrong. Your goal for this project is to devise a scheme for evaluating joins that doesn't require such detailed statistics -- for example, by partitioning the query plan into several smaller groups of joins that can be computed independently, and then using statistics collected during the computation of those joins to join the remaining intermediate relations. Of course, this won't be "optimal", but your goal should be to show that your scheme avoids making very bad decisions and can in-fact outperform Selinger in some situations.
  11. Extend SimpleDB. SimpleDB is very simple. There are a number of ways you might extend it to explore some of the research ideas we have studied in this class. For example, you could add support for optimistic concurrency control and compare its performance to the basic concurrency control scheme you will implement in Problem Set 3. There are a number of other possible projects of this type; we would be happy to discuss these in more detail.
  12. CarTel. In the CarTel project, we are building a system for collecting and managing data from automobiles. There are several possible CarTel related projects:
  13. RDF data browser. Build a tool that visualizes a graph of RDF data and allows browsing and / or filtering of it. This could include efforts to deduce partial integrity constraints ("almost every book has an author") as well as to generate wide (more relational) representations from RDF triples.
  14. SPARQL to SQL conversion. SPARQL is a RDF query language; this project would look at converting it to SQL for execution on RDF data stored in a relational format (see this paper for a discussion as to why storing RDF data in an RDBMS might be beneficial from a performance perspective.) It may be possible to leverage work from the open source Jena tool for this project.
  15. Integrity Constraints in Streaming Systems. Current stream data processing systems (for example, Borealis) do not provide any way for users to specify constraints on the data that they can handle -- for example, a financial analysis application might want to verify that no purchases exceed some available balance. This project would involve defining what such constraints mean, and exploring ways to efficiently check and react to them. For more information, see, for example, StreamClean (PDF).
  16. Use a database to log and recover a file system. This project would involve using a database to implement transactions in a file system. The idea would be to use a user-level file-system package like FUSE to intercept calls to read and write from the file system, and to log these actions to a database to support ideas like transactions and rollback of the file system.
  17. Rollback of long-running or committed transactions. Database systems typically only support UNDO of committed transactions, but there are cases where it might be important to rollback already committed transactions. One approach is to use user-supplied compensating actions, but there may be other models that are possible, or it may be possible to automatically derive such compensating action for certain classes of transactions.
  18. Cloud database on Amazon EC2. Amazon EC2 is a hosted cloud computing service. We have a number of credits available for use of these machines. One possible project would involve building a database system that runs in the cloud, looking at issues such as elasticity (e.g., adding or removing new machines on the fly), fault tolerance, and load balancing.
  19. SQL on Hadoop or Hadoop on SQL. Hadoop is an open source implementation of MapReduce. Possible projects include using Hadoop to schedule SQL tasks, or extending the SimpleDB execution engine to issue Hadoop jobs that run a database query in parallel.