SummaryRelational database systems are typically built on top a table-oriented storage manager. Tables are usually physically stored in a natural, “row-oriented” format, such that all of the fields of a given row (or tuple) are stored together on disk. Unsurprisingly, the performance of a given database query is highly dependant on the physical arrangement of the tables being accessed on disk, as well as the collection of indices and other access methods available for a given table. For queries that retrieve just a few records, the time to completely scan a table may be dramatically longer than the time to lookup those records in an index. For queries that need to access a significant fraction of a table, however, the additional disk I/Os required to traverse an index to retrieve all of the records may make it dramatically slower than simply scanning the entire table.Database system designers have long understood these physical storage tradeoffs, and modern database systems are optimized to provide very good performance across a range of queries that are typical in online transaction processing environments (e.g., reservation and banking systems) where relational databases were originally developed. Recent years, however, have seen the emergence of several new types of database systems where traditional physical database designs do not perform particularly well. For example, in the C-Store system that we have been building at MIT, we are investigating the use of a “column-oriented” physical database design. By storing data from the same field of a table together, it is possible for queries that access only a small subset of all of the fields in a table to perform substantially less disk I/O than the same queries in a row-oriented system. This is especially true in data warehouses and other read-mostly environments, where the majority of queries require scanning a large portion of the database, and there are few updates, which are expensive in a column-oriented system as they require updating many distinct areas on disk. Our performance results, both in an academic prototype as well as a commercial startup, suggest that this physical reorganization (when combined with new compression techniques) can provide about two orders of magnitude performance increase over “row-oriented” databases on data warehouse benchmarks. Based party on our experience in the C-Store project, we propose to build a new type of data storage system, the ChunkyStore. Our basic purpose is to bridge the gap between the two extremes of purely row-oriented and purely column-oriented databases by building a next generation storage system that adapts smoothly between these extremes. In particular, we are interested exploring hybrid physical database designs where being “partially” column oriented makes sense, such as storing one table as columns and another as rows, as well as exploring more sophisticated ways of “chunking” database tables into small segments of rows and columns that are likely to be accessed together. In addition to morphing smoothly between row and column representations, a goal of ChunkyStore is to support table-like storage that is somewhat different than purely relational data. In particular we are interested in being able to efficiently represent sequence and multi-dimensional data, that is naturally ordered (unlike relations, which are “sets”, and hence naturally unordered). Rather than storing ordered data in unordered relations, ChunkyStore can be told when data is ordered, and can efficiently represent such data via a number of optimizations. For example, in an ordered domain, it may be possible to directly offset into a time series or array to identify a tuple corresponding to a particular location or time. Array like representations also arise in array-based multidimensional OLAP (MOLAP) data-cube stores; we envision that ChunkyStore, with its array-oriented optimizations, will work equally well as a MOLAP system. Hence, in addition to bridging the column/row-store divide, it will bridge the divide between relational OLAP systems and MOLAP systems. We are currently building two data processing applications on top of ChunkyStore: First, ChunkyScientist is a high dimensionality scientific database that provides a MATLAB like interface. It is designed to support large quantities of array-valued data (such as images and audio files). By properly storing dense arrays so that it is possible to directly offset to relevant sub-arrays, substantial performance gains can be had. For sparse arrays, list- based representations can have big performance wins. Second, ChunkyDB is a relational database that offers performance comparable to either a row store or a column store. We expect it will be well suited to applications that a read-mostly, but with a healthly mix of writes, such as in customer-relationship management applications. In these environments, a pure column store will suffer some performance overhead, but a pur PeopleFaculty
Post-Docs
Students
Publications
FundingThis project is funded by the NSF as award IIS-0704424. |