New England Database Summit 2011 Program
|9:00 AM||Welcoming remarks (Yanlei Diao and Samuel Madden)|
Keynote 1: Donald Kossmann (Professor at ETH Zurich). Predictable Performance for Unpredictable Workloads
Click to toggle abstract. |
This talk presents the design of SwissBox. SwissBox is a database appliance designed to process thousands of concurrent queries and updates with bounded query response times and strict data freshness guarantees. The system was designed to aggressively share operations between concurrent queries and updates. This talk shows the design of the storage manager (called Crescando) and the design of the query processor (called SharedDB). Furthermore, the talk presents the results of performance experiments with workloads from an airline reservation system.
About the speaker: Donald Kossmann is a professor for Computer Science at ETH Zurich (Switzerland). He received his MS from the University of Karlsruhe and completed his PhD at the University of Aachen. After that, he held positions at the University of Maryland, the IBM Almaden Research Center, the University of Passau, the University of Munich, and the University of Heidelberg. He is an ACM fellow, member of the board of trustees of the VLDB endowment, and was the program committee chair of the ACM SIGMOD Conf., 2009. He is a co-founder of i-TV-T (1998), XQRL Inc. (acquired by BEA in 2002), and 28msec Inc. (2007). His research interests lie in the area of databases and information systems.
|Session 1 : Provenance and Security (Chair: Gerome Miklau)|
Margo Seltzer. Provenance Everywhere. Click to show abstract.
Digital provenance describes the ancestry or history of a digital object. Computer science research in provenance has addressed issues in provenance capture in operating systems, command shells, languages, workflow systems and applications. However, it's time to begin thinking seriously about provenance interoperability, what it means, and how we can achieve it. We have undertaken several projects that integrate provenance across multiple platforms. Doing so introduces many challenging research opportunitie s. In this talk, I'll present our Provenance-Aware Storage System, focusing on our experiences integrating provenance across differet layers of abstraction. I'll present some of our use cases and discuss important issues for further research.
|10:50-11:10||Raluca Ada Popa, Nickolai Zeldovich, Hari Balakrishnan. CryptDB: A Practical Encrypted Relational DBMS. Click to show abstract.
CryptDB is a DBMS that provides provable and practical privacy in the face of a compromised database server or curious database administrators. CryptDB works by executing SQL queries over encrypted data. At its core are three novel ideas: an SQL-aware encryption strategy that maps SQL operations to encryption schemes, adjustable query-based encryption which allows CryptDB to adjust the encryption level of each data item based on user queries, and onion encryption to efficiently change data encryption levels. CryptDB empowers the server to execute only queries that the users requested, and achieves maximum privacy given the mix of queries issued by the users. The server fully evaluates queries on encrypted data and sends the results back to the client for final decryption; clients don’t perform any query processing and client-side applications run unchanged. Our evaluation shows that CryptDB has modest overhead: on the TPC-C benchmark on Postgres, CryptDB reduces throughput by 27% compared to regular Postgres. Moreover, CryptDB does not change the innards of existing DBMSs: we realized the implementation of CryptDB using client-side query rewriting/encrypting, user-defined functions, and server-side tables for public key information.
|11:10-11:30||David Schultz and Barbara Liskov. IFDB: Database Support for Decentralized Information Flow Control. Click to show abstract.
Information flow control is an attractive way to protect sensitive data because it provides a means to restrict how information is used. Access control, in contrast, restricts what data can be read but not what can be done with it. Decentralized Information Flow Control (DIFC) extends information flow so that individuals and organizations can control the policy for their own information; it provides fine-grained, discretionary control for each such principal’s data. In the past few years, there has been significant interest in DIFC in the programming language and operating system communities, but existing systems don’t comprehensively address databases, which is where most applications store their sensitive, persistent data. This paper describes IFDB, the first system, to our knowledge, to extend DIFC to databases.
|Session 2 : Databases in the Cloud (Chair: Carlo Curino)|
Emmanuel Cecchet, Rahul Singh, Upendra Sharma, Prashant Shenoy. Dolly: Virtualization-driven Database Provisioning for the Cloud. Click to show abstract.
The Cloud is an increasingly popular platform for e-commerce applications that can be scaled on-demand in a very cost effective way. Dynamic provisioning is used to autonomously add capacity in multi-tier cloud-based applications that see workload increases. While many solutions exist to provision tiers with little or no state in applications, the database tier remains problematic for dynamic provisioning due to the need to replicate its large disk state.
In this talk, we explore virtual machine (VM) cloning techniques to spawn database replicas and address the challenges of provisioning shared-nothing replicated databases in the cloud. We argue that being able to determine state replication time is crucial for provisioning databases and show that VM cloning provides this property. We propose Dolly, a database provisioning system based on VM cloning and cost models to adapt the provisioning policy to the cloud infrastructure specifics and application requirements. We present an implementation of Dolly in a commercial-grade replication middleware and evaluate database provisioning strategies for a TPC-W workload on a private cloud and on Amazon EC2. By being aware of VM-based state replication cost, Dolly can solve the challenge of automated provisioning for replicated databases on cloud platforms.
|11:50-12:10||Jim Starkey. Databases in 21st Century: NimbusDB and the Cloud. Click to show abstract.
Databases in 21st Century: NimbusDB and the Cloud Jim Starkey, CTO and Founder, NimbusDB Inc.
An old adage advises getting your head out of the clouds and your feet on the ground. The exploding numbers of mobile devices and applications and the exponential growth of social networking make this piece of wisdom a candidate for Adage 2.0: "Keep your head in the clouds, but stretch until your feet are on solid ground".
Three aspects of cloud computing challenge traditional database technology. The first is scalability. No single computer system can support the loads imposed by popular worldwide applications. Worse, loads cannot be predicted accurately and configuring for the greatest possible load is economically catastrophic. The second challenge is multi-tenancy. Amazon, Google, and Rackspace demonstrate that a well-managed multi-tenant cloud delivers excellent performance and high resource utilization. Traditional database systems are single tenant; they support a single database per system or per cluster. The third challenge is administration. The cloud environments require strict separation between physical provisioning, which is the province of the cloud vendor, and database administration, the responsibility of the database owner or administrator.
There are many workarounds for each of these challenges. Some offer scalability, without multi-tenancy. Amazon offers a multi-tenant cloud of MySQL databases, restricted to a single machine instance. NoSQL systems offer scalability and multi-tenancy, without ACID transactions, high-level semantics, or declarative consistency.
Traditional database architectures have reached their limits. If we, the database community, are to meet the challenges of computing in the clouds, we need new ideas, architectures, and implementations.
NimbusDB has the features needed for cloud databases:
|1:00-2:00|| Keynote 2: Renée Miller On Schema Discovery
Click to toggle abstract. |
Structured data is distinguished from unstructured data by the presence of a schema describing the logical structure and semantics of the data. The schema is the means through which we understand and query the underlying data. Schemas enable data independence. In this talk, I consider a few problems related to the discovery and maintenance of schemas. I'll discuss the changing role of schemas from prescriptive to descriptive. This talk is based on joint work with Fei Chiang, Periklis Andritsos, and Oktie Hassanzadeh.
About the speaker: Renée J. Miller is a professor of computer science and the Bell Canada Chair of Information Systems at the University of Toronto. She received the US Presidential Early Career Award for Scientists and Engineers (PECASE), the highest honor bestowed by the United States government on outstanding scientists and engineers beginning their careers. She received an NSF CAREER Award, the Premier's Research Excellence Award, and an IBM Faculty Award. She is a fellow of the ACM. Her research focuses on data management and information systems. She is President of the VLDB Endowment and is also serving as the PC Chair for SIGMOD 2011. She received her PhD in Computer Science from the University of Wisconsin, Madison and bachelor's degrees in Mathematics and in Cognitive Science from MIT.
|Session 3: New Database Design (Chair: Mirek Riedewald)|
|2:00-2:20||Adam Marcus, Eugene Wu, David R. Karger, Samuel Madden, Robert C. Miller. Qurk: A Query Processor for Human Operators. Click to show abstract.
Amazon's Mechanical Turk ("MTurk") service allows users to post short tasks ("HITs") that other users can receive a small amount of money for completing. Common tasks on the system include labelling a collection of images, combining two sets of images to identify people which appear in both, or extracting sentiment from a corpus of text snippets. Designing a workflow of various kinds of HITs for filtering, aggregating, sorting, and joining data sources together is common, and comes with a set of challenges in optimizing the cost per HIT, the overall time to task completion, and the accuracy of MTurk results. We propose Qurk, a novel query system for managing these workflows, allowing crowd-powered processing of relational databases. We describe a number of query execution and optimization challenges, and discuss some potential solutions.
|2:20-2:40||Robert Soule, Martin Hirzel, Robert Grimm, Bugra Gedik. Distributed CQL Made Easy. Click to show abstract.
This talk is about making it easy to implement a distributed CQL. It argues that they key to simplifying not just the development of distributed CQL, but streaming applications in general, is to approach it as a language problem, not a database problem. At the heart of our solution lies the design of an intermediate language (IL), River. We explore how to provide language- and runtime-independent optimizations; and easy integration with distributed runtimes. We have used River to implement a translation of CQL; a generic version of the data parallel optimization; and a mapping to IBM's high-performance streaming middleware, System S. Overall, this work not only significantly reduces the development effort for implementing a distributed CQL, but also provides a lingua franca for mapping streaming languages in general to distributed streaming runtimes, and a common substrate for implementing optimizations.
|Session 4: DB Design and Performance Tuning (Chair: Sam Madden)|
|3:00-3:20||Paul G. Brown. SciDB: Towards a Terabyte Matrix Multiply. Click to show abstract.
SciDB is an open-source DBMS, oriented toward complex analytic tasks involving very large-scale data sets, such as those found in the science community, financial services, oil and gas, and genomics. SciDB is an array-based DBMS with an array query language borrowing heavily from SQL. It runs on a shared-nothing cluster of Linux machines. At the present time we are shipping Version 0.75, which has been downloaded by around 100 external sites.
SciDB’s design combines data management functionality, such as filters, join and aggregates, with linear algebra and statistical operations, including matrix multiply and matrix inversion. Although general SQL scalability has been demonstrated by commercial data warehouse vendors using horizontal partitioning, SciDB must provide for a more complex array data model and must mix data management operations with linear algebra ones. Further, SciDB offers a number of advanced features useful to this kind of analytic processing: provenance, version control and support for random variables as data types.
In this talk, we will present scalability numbers for SciDB, on both data management and complex analytic commands. We are using the Amazon EC2 environment and show results over hundreds of processing nodes. Compared to high-end supercomputers, we also show dramatically lower costs for complex analytics, and we will present an estimate of the cost in dollars to perform a multiply of a 1TB by 1TB dense matrix multiply on EC2. This will establish a "cost per terabyte" metric by which to judge other solutions.
|3:20-3:40||M. Akdere, J. Duggan, U. Cetintemel, O. Papaemmanouil, E. Upfal, S. Zdonik. Using
Prediction. Click to show abstract.
Accurate query performance prediction (QPP) is central to effective resource management and provisioning, query optimization and user experience management. Analytical cost models, which are commonly used by optimizers to compare candidate plan costs, are poor predictors of execution latency. As a more promising approach to QPP, we study the practicality and utility of sophisticated learning-based models, which have recently been successfully applied to a variety of predictive tasks.
This talk will cover learning-based techniques we developed for QPP in analytical workloads. In the first part of the talk, we focus on QPP in isolated runs. We describe predictive modeling techniques that learn query execution behavior at different granularities, ranging from coarse grained plan-level models to fine-grained operator-level models. We demonstrate that these two extremes offer a tradeoff between high accuracy and generality, respectively, and introduce a hybrid approach that combines their respective strengths by selectively composing them in the process of QPP. We discuss how we can use a training workload to (i) pre-build and materialize such models offline, so that they are readily available for future predictions, and (ii) build new models online as new predictions are needed. All prediction models are built using only static features (available prior to query execution) and the performance values obtained from the offline execution of the training workload.
In the second part of the talk, we focus on concurrent query runs. We describe a QPP solution based on the analysis of query behavior in isolation, pairwise query interactions and sampling techniques. To improve the accuracy of the predictions, our solution builds a model to create a timeline of the query interactions, i.e., a fine-grained estimation of the time segments during which discrete mixes will be executed concurrently.
We quantify the effectiveness of these techniques through experimental evidence obtained from a full implementation on top of PostgreSQL with TPC-H data and queries. The results reveal that our techniques advance the state of the art and that learning-based modeling for QPP is both feasible and effective for a variety of workload and usage scenarios.
Mingsheng Hong. Automated Physical Database Design in Column Stores. Click to show abstract.
Physical database design is a crucial performance tuning activity that optimizes the utilization of the CPU, storage, and I/O resources in data warehouses. The column store technology poses unique challenges in physical database design. An automated solution is highly desirable in order for column stores to receive mainstream adoption.
In this talk, we present the Vertica Database Designer, an automated physical database designer tailored for column stores. After a brief introduction to its interface and internals, we highlight a few of the use cases drawn from a wide spectrum of industry sectors, where we compare its quality of design with that of manual designs.
|4:00-4:20||Sivaramakrishnan Narayanan and Florian Waas. Dynamic Prioritization of Database Queries. Click to show abstract.
Databases and BI systems are a staple in modern enterprises and pivotal when it comes to supporting businesses in their decision making process. With an ever increasing breadth of data sources integrated in data warehousing scenarios and advances in analytical processing, the classic categorizations of query workloads such as OLTP, OLAP, loading, reporting, or massively concurrent queries have long been blurred. Mixed workloads have become a reality that today’s database management systems have to be able to support. In such workloads, different components compete for resources and, depending on the resource proﬁles, often impact each other negatively in non-intuitive ways. We present a workload management solution that allows database administrators to assign priorities to workloads and allocates resources per these priorities. The solution balances CPU resources between workloads by (1) determining the ideal resource consumption rate for each individual query, and (2) employing a back-off technique based on control-theory principles wherein every participating process periodically checks if it has exceeded its current target rate and adjusts its consumption as necessary. The continuous application of this principle results in rapid convergence between actual and ideal resource consumption rate. The mechanism is highly effective, robust and portable and adaptive. We present some experimental results and real-world experience of this mechanism as implemented in the Greenplum Parallel Database.
|4:30 PM||Poster Session and Appetizers / Drinks (Building 32, R&D Area, 4th Floor)|