6.830 Final Project Assignment and Ideas (Fall 2016)

Important Dates:

A large portion (20%) of your grade in 6.830 consists of a final project. This project is meant to be a substantial independent research or engineering effort related to material we have studied in class. Your project may involve a comparison of systems we have read about, an application of database techniques to a system you are familiar with, or be a database-related project in your research area.

This document describes what is expected of a final project and proposes some possible project ideas.

What is Expected

Good class projects can vary dramatically in complexity, scope, and topic. The only requirement is that they be related to something we have studied in this class and that they contain some element of research -- e.g., that you do more than simply engineer a piece of software that someone else has described or architected. To help you determine if your idea is of reasonable scope, we will arrange to meet with each group several times throughout the semester.

What to Hand In

There are two written deliverables: a project proposal, due on October 6th, and a final report, due on December 14th.

Project Proposal: The proposal should consist of 1-2 pages describing the problem you plan to solve, outlining how you plan to solve it, and describing what you will "deliver" for the final project. We will arrange short meetings with every group before the project proposal to help you refine your topic and would be happy to provide feedback on a draft of your proposal before it is due.

Final Report: You should prepare a conference-style report on your project with maximum length of 15 pages (10 pt font or larger, one or two columns, 1 inch margins, single or double spaced -- more is not better.) Your report should introduce and motivate the problem your project addresses, describe related work in the area, discuss the elements of your solution, and present results that measure the behavior, performance, or functionality of your system (with comparisons to other related systems as appropriate.)

Because this report is the primary deliverable upon which you will be graded, do not treat it as an afterthought. Plan to leave at least a week to do the writing, and make sure your proofread and edit carefully!

Please submit a paper copy of your report in class on Wednesday, 12/10. You will also be expected to give a presentation on your project in class that will provide an opportunity for you to present a short demo of your work and show what you have done to other students in the class. Details about the format of the presentation will be posted as the date gets closer.

Project Ideas

The following is a list of possible project ideas from past years; we will update over the next few weeks. You are not required to choose from this list -- in fact, we encourage you to try to solve a problem of your own choosing! If you are interested in working on one of these projects, the staff can put you in touch with students and others around MIT working on these ideas. Note that these are not meant to be complete project proposals, but just suggestions for areas to explore -- you will need to flesh them out into complete projects by talking with your group members, the course staff, and graduate students working on these projects.
  1. SQL Autocomplete: Given one or more databases (including their content) and schemas, build a tool that guides users as they write queries, but suggesting field/table names, values for constants, etc. Alternatively, allow users to specify a small amount of information (e.g., some keywords), and suggest possible SQL queries that find results containing those keywords.
  2. Benchmarking is incredibly important, yet incredibly hard to write. Users have to conceptualize, write the data generators and then generate queries for each database vendor dialect. There are no good tools which help developers develop and share a benchmark. The goal of the project would build to fill the gap and build:
    1. a language to specify schema and data generation properties of a benchmark and automatically generate a data generator based on the spec
    2. do automatic translation of queries across different database vendor dialects
    Having a specification language will give at least 3 benefits:
    1. We can generate an efficient data generator for the benchmark directly from the spec for free. This is important as writing a data generator is cumbersome and despite best-effort may not be efficient. For example, the TPC-H benchmark data generator can be made to run 2.5x faster by rewriting the code; the SSBM benchmark data generator does not compile by default on Mac.
    2. Queries in the benchmark can automatically be translated to many different query engines. Again, TPC-H benchmark supports 6 different databases by default. Each has slightly different dialect. The queries specified with one dialect can be parsed and automatically converted to other dialects.
    3. Users can easily modify it for their use case. For example:
  3. D3.js is a visualization system for rendering data. It uses a number of database-like constructs internally, some of which are quite primitive. For example, they use nested loop joins and don’t reorder selections. The goal would be to build a simple cost-based query optimizer/rewriter for D3. Contact: Holger Pirk (holger@csail.mit.edu)
  4. Loading large datasets (> 10 GB) can be very slow, especially when there are many indexes on tables. One simple trick is to sort data before inserting it, or to drop indexes, insert data, and reload indexes. The goal of this project would be to investigate other techniques to improve load performance in an open source database such as Postgres or MySQL/Innodb. One concrete idea might be "fast bulk load" a la Hbase for MySQL. HBase uses a MapReduce job to build its indexes. One possible project would be to build a prototype that does the same for a single (or multiple) MySQL Boxes.
  5. Hosted database services such as Amazon RDS, Microsoft SQL Azure are starting to become popular. It is still unclear what the performance impact is of running applications on a local (non-hosted) platform, such as a local enterprise datacenter, versus having the data hosted "in the cloud". This project would investigate the performance of such systems on a particular workload or workloads(s), e.g., OLAP, OLTP, Web.
  6. Flash storage promises lower latency for random operations than conventional disks. However, it also has a number of unusual limitations. An interesting project investigates the performance impact of using flash memories for DB applications, or for specific DB tasks such as indexing or recovery.
  7. Twitter provides a fire hose of data. Automatically filtering, aggregating, analyzing such data can allow to harness the full vlue of the data, extracting valuable information. The idea of this project is to investigate stream processing technology to operate on social streams, or extract structure from them and store them in a database system.
  8. Scalable Wrangler. Build a (Data Wrangler) like interface that scales to larger data sets (millions of rows), or devise a sampling strategy to automatically load the most relevant records (e.g., representative outliers) into wrangler. More generally think about how you would improve the wrangling interface.
  9. Databases are "schema first", meaning you have to declare a schema, and ensure your data conforms to it, before you can load data into the datbase. The goal of this project would be to build a database system that allows you to immediately start manipulating your data, without declaring a schema, by, e.g., auto-generating column names, ignoring improperly formatted tuples in input data, etc. The database would then allow you to clean up the data, renaming columsn and fixing up broken or missing records.
  10. Compare the performance of a Hadoop-based database like Hive to a relational system like MySQL or Postgres on some workload, and understand where the differences come from. Alternatively, look at implementing on of the advanced features of relational databases that we have discussed in Hive, or improving the optimizer in Hive in some way.
  11. More generally the performance of different SQL/NoSQL solutions on different workloads (e.g., MySQL vs Postgres vs Hive vs Pig vs Voldemort vs Mongo vs ...)
  12. Look at running some collection of statistical or machine learning or other data analysis operations inside of a database, and compare their performance them to running those algorithms by copying data out of the database and running in Matlab or some other programming language. See, for example Mad Lib.
  13. Build a mobile app to find nearby phones and search the for related data. For example, a user might request: "I want to see the current view of stata after a winter storm" and the system will try to look for photos taken within the last hour or so on phones that it knows about and then stitch them together to generate a 360 view. Challenges are, for example, designing metadata for photos taken on phones, the query language, and how to forward data (send everything to centralized server and then forward to users vs some kind of peer to peer approach). Modern vision algorithms in tools like OpenCV are good enough that they can recognize common scenes and objects that could facilitate building such a database.
  14. Concurrent insert-heavy tree: a lot of people have looked at using batching to improve the performance of insert-heavy workloads (e.g., LSM tress, COLAs, buffer trees). But these are still mostly sequential data structures. The goal of this project is to design a concurrent version of such a tree.
  15. Query plan JIT compiler: now that in-memory DBs are becoming more popular, it might make sense to try generating native code for query plan evaluation to reduce the latency of intepreted plans. Some gains might be achieved with runtime type knowledge just as in standard compilers. It would be fun to take SimpleDB and generate (for instance) LLVM bytecode on the fly for query evaluation.
  16. Benchmark study of different file systems for database performance. This would involve an experimental study of different file systems (e.g., ext4, zfs, btrfs, etc.) where you measure how much the choice of file system makes on performance for non in-memory workloads. You could look at both disk and flash.
  17. Build an auto-cleaning database: A database that, based on user input, automatically "predicts" missing values in a column using a standard machine learning technique on the remaining columns. A more advanced version of this database can pick the most suited machine learning algorithm that best predicts missing values.
  18. Data cleaning experimental study: Do an experimental study of the common data cleaning errors found in practice in log files found online, and develop algorithms for automatically wrangling log files to a more structured form.
  19. Graph processing language: Most graph processing systems end up focusing on simple algorithms (shortest paths, pagerank) that are not really used by real graph analysts. Study http://snap.stanford.edu/snap/description.html to identify common patterns of accesses made by different network algorithms, and design a wrapper over a relational or graph database that can supports a large fraction of these operations.
  20. Rapid Histogram Visualizations: Study the online aggregation paper, build a system to use the online aggregation techniques to produce rapid histogram visualizations of aggregates that improve over time. Then, study the impact of various sampling techniques to get to even better visualizations that are accurate but are computed on much less data.
  21. Auto-admin tools to recommend indices, etc. Design a tool that recommends a set of indices to build given a particular workload and a set of statistics in a database. Alternatively investigate the question of which materialized views to create in a data-warehousing system, such as C-Store.
  22. Extend SimpleDB. SimpleDB is very simple. There are a number of ways you might extend it to explore some of the research ideas we have studied in this class. For example, you could add support for optimistic concurrency control and compare its performance to the basic concurrency control scheme you will implement in Lab 4. There are a number of other possible projects of this type; we would be happy to discuss these in more detail.
  23. Rollback of long-running or committed transactions. Database systems typically only support UNDO of committed transactions, but there are cases where it might be important to rollback already committed transactions. One approach is to use user-supplied compensating actions, but there may be other models that are possible, or it may be possible to automatically derive such compensating action for certain classes of transactions.