6.830 Final Project Assignment and Ideas (Fall 2017)

Important Dates:

A large portion (20%) of your grade in 6.830 consists of a final project. This project is meant to be a substantial independent research or engineering effort related to material we have studied in class. Your project may involve a comparison of systems we have read about, an application of database techniques to a system you are familiar with, or be a database-related project in your research area.

This document describes what is expected of a final project and proposes some possible project ideas.

What is Expected

Good class projects can vary dramatically in complexity, scope, and topic. The only requirement is that they be related to something we have studied in this class and that they contain some element of research -- e.g., that you do more than simply engineer a piece of software that someone else has described or architected. To help you determine if your idea is of reasonable scope, we will arrange to meet with each group several times throughout the semester.

What to Hand In

There are two written deliverables: a project proposal, due on October 6th, and a final report, due on December 14th.

Project Proposal: The proposal should consist of 1-2 pages describing the problem you plan to solve, outlining how you plan to solve it, and describing what you will "deliver" for the final project. We will arrange short meetings with every group before the project proposal to help you refine your topic and would be happy to provide feedback on a draft of your proposal before it is due.

Final Report: You should prepare a conference-style report on your project with maximum length of 15 pages (10 pt font or larger, one or two columns, 1 inch margins, single or double spaced -- more is not better.) Your report should introduce and motivate the problem your project addresses, describe related work in the area, discuss the elements of your solution, and present results that measure the behavior, performance, or functionality of your system (with comparisons to other related systems as appropriate.)

Because this report is the primary deliverable upon which you will be graded, do not treat it as an afterthought. Plan to leave at least a week to do the writing, and make sure your proofread and edit carefully!

Please submit a paper copy of your report in class on Wednesday, 12/14. You will also be expected to give a presentation on your project in class that will provide an opportunity for you to present a short demo of your work and show what you have done to other students in the class. Details about the format of the presentation will be posted as the date gets closer.

Sample Projects

Here's an example of a nice project from last year: ImputeDB: Data Imputation as a Query Optimization. Jose Cambronero, John K. Feser, Micah Smith.

Project Ideas

The following is a list of possible project ideas from past years; we will update over the next few weeks. You are not required to choose from this list -- in fact, we encourage you to try to solve a problem of your own choosing! If you are interested in working on one of these projects, the staff can put you in touch with students and others around MIT working on these ideas. Note that these are not meant to be complete project proposals, but just suggestions for areas to explore -- you will need to flesh them out into complete projects by talking with your group members, the course staff, and graduate students working on these projects.
  1. Build a system that ingests a stream of pictures (say crawled from the web) very quickly and supports fast object recognition queries (find a blue cat).
  2. Collect training data of the form {\tt (query, execution time)} and then fit a predictive ML model (e.g., a CNN), and see if you can predict the performance of new queries. Alternatively, see if you can build a query optimizer that users a CNN or other ML models as its cost model.
  3. Investigate the best way to build a massive database of imagery (e.g., ImageNet, or videos from YouTube ...)
  4. Investigate the best way to build a massive database sensor data (e.g., high rate acceleration from millions of phones, or LIDAR data from thousands of cars). If of interest, we can help you get access to data.
  5. Add database operators to Orchestra

    Orchestra is an open source workflow management system that coordinates teams of experts and machines as they collaborate on a project. For the most part, we've designed and built the human task-completion side of the system, with a basic API for Pythonic machine task processing. One interesting project would be to add traditional database operators to Orchestra so that we can facilitate a rich array of data filtering, branching, joining, and aggregation operators. If your team worked on this project, one of the project maintainers (all members of B12, an MIT startup focused on the future of work) could meet with you as frequently as you'd like in order to mentor you through the process of contributing to an active codebase! Contact marcua@marcua.net (a former 6.830 TA and founder of B12).

  6. D3.js is a visualization system for rendering data. It uses a number of database-like constructs internally, some of which are quite primitive. For example, they use nested loop joins and don’t reorder selections. The goal would be to build a simple cost-based query optimizer/rewriter for D3, or one of the vis systems that runs on top of d3, like crossfilter (http://square.github.io/crossfilter/).
  7. Loading large datasets (> 10 GB) can be very slow, especially when there are many indexes on tables. One simple trick is to sort data before inserting it, or to drop indexes, insert data, and reload indexes. The goal of this project would be to investigate other techniques to improve load performance in an open source database such as Postgres or MySQL/Innodb. One concrete idea might be "fast bulk load" a la Hbase for MySQL. HBase uses a MapReduce job to build its indexes. One possible project would be to build a prototype that does the same for a single (or multiple) MySQL Boxes.
  8. Hosted database services such as Amazon RDS, Microsoft SQL Azure are starting to become popular. It is still unclear what the performance impact is of running applications on a local (non-hosted) platform, such as a local enterprise datacenter, versus having the data hosted "in the cloud". This project would investigate the performance of such systems on a particular workload or workloads(s), e.g., OLAP, OLTP, Web.
  9. Flash storage promises lower latency for random operations than conventional disks. However, it also has a number of unusual limitations. An interesting project investigates the performance impact of using flash memories for DB applications, or for specific DB tasks such as indexing or recovery.
  10. Scalable Wrangler. Build a (Data Wrangler) like interface that scales to larger data sets (millions of rows), or devise a sampling strategy to automatically load the most relevant records (e.g., representative outliers) into wrangler. More generally think about how you would improve the wrangling interface.
  11. Databases are "schema first", meaning you have to declare a schema, and ensure your data conforms to it, before you can load data into the datbase. The goal of this project would be to build a database system that allows you to immediately start manipulating your data, without declaring a schema, by, e.g., auto-generating column names, ignoring improperly formatted tuples in input data, etc. The database would then allow you to clean up the data, renaming columsn and fixing up broken or missing records.
  12. Compare the performance of a Hadoop-based database like Hive to a relational system like MySQL or Postgres on some workload, and understand where the differences come from. Alternatively, look at implementing on of the advanced features of relational databases that we have discussed in Hive, or improving the optimizer in Hive in some way.
  13. More generally the performance of different SQL/NoSQL solutions on different workloads (e.g., MySQL vs Postgres vs Hive vs Pig vs Voldemort vs Mongo vs ...)
  14. Look at running some collection of statistical or machine learning or other data analysis operations inside of a database, and compare their performance them to running those algorithms by copying data out of the database and running in Matlab or some other programming language. See, for example Mad Lib.
  15. Build a mobile app to find nearby phones and search the for related data. For example, a user might request: "I want to see the current view of stata after a winter storm" and the system will try to look for photos taken within the last hour or so on phones that it knows about and then stitch them together to generate a 360 view. Challenges are, for example, designing metadata for photos taken on phones, the query language, and how to forward data (send everything to centralized server and then forward to users vs some kind of peer to peer approach). Modern vision algorithms in tools like OpenCV are good enough that they can recognize common scenes and objects that could facilitate building such a database.
  16. Concurrent insert-heavy tree: a lot of people have looked at using batching to improve the performance of insert-heavy workloads (e.g., LSM tress, COLAs, buffer trees). But these are still mostly sequential data structures. The goal of this project is to design a concurrent version of such a tree.
  17. Query plan JIT compiler: now that in-memory DBs are becoming more popular, it might make sense to try generating native code for query plan evaluation to reduce the latency of intepreted plans. Some gains might be achieved with runtime type knowledge just as in standard compilers. It would be fun to take SimpleDB and generate (for instance) LLVM bytecode on the fly for query evaluation.
  18. Benchmark study of different file systems for database performance. This would involve an experimental study of different file systems (e.g., ext4, zfs, btrfs, etc.) where you measure how much the choice of file system makes on performance for non in-memory workloads. You could look at both disk and flash.
  19. Build an auto-cleaning database: A database that, based on user input, automatically "predicts" missing values in a column using a standard machine learning technique on the remaining columns. A more advanced version of this database can pick the most suited machine learning algorithm that best predicts missing values.
  20. Data cleaning experimental study: Do an experimental study of the common data cleaning errors found in practice in log files found online, and develop algorithms for automatically wrangling log files to a more structured form.
  21. Graph processing language: Most graph processing systems end up focusing on simple algorithms (shortest paths, pagerank) that are not really used by real graph analysts. Study http://snap.stanford.edu/snap/description.html to identify common patterns of accesses made by different network algorithms, and design a wrapper over a relational or graph database that can supports a large fraction of these operations.
  22. Rapid Histogram Visualizations: Study the online aggregation paper, build a system to use the online aggregation techniques to produce rapid histogram visualizations of aggregates that improve over time. Then, study the impact of various sampling techniques to get to even better visualizations that are accurate but are computed on much less data.
  23. Auto-admin tools to recommend indices, etc. Design a tool that recommends a set of indices to build given a particular workload and a set of statistics in a database. Alternatively investigate the question of which materialized views to create in a data-warehousing system, such as C-Store.
  24. Extend SimpleDB. SimpleDB is very simple. There are a number of ways you might extend it to explore some of the research ideas we have studied in this class. For example, you could add support for optimistic concurrency control and compare its performance to the basic concurrency control scheme you will implement in Lab 4. There are a number of other possible projects of this type; we would be happy to discuss these in more detail.
  25. Rollback of long-running or committed transactions. Database systems typically only support UNDO of committed transactions, but there are cases where it might be important to rollback already committed transactions. One approach is to use user-supplied compensating actions, but there may be other models that are possible, or it may be possible to automatically derive such compensating action for certain classes of transactions.