New England Database Summit Papers for Presentation
List of accepted papers:
- John Hugg, VoltDB Inc. Determinism in High-Throughput Distributed Databases. Click to show abstract.
The original VoltDB academic papers leverage determinism to achieve high performance. Academic papers that promise revolutions in performance often gloss over robustness or assume certain conditions are always true or always false. Supporting a high-throughput distributed system in long-term, real-world deployments quickly demonstrates that many assumptions about systems or failure models are invalid. This talk will focus on determinism. Much of the content will be related to lessons learned while building and supporting the VoltDB database, but content will be framed as generally as possible. Most of the talk will apply to any distributed system focused on high throughput of many small operations, especially transactional applications. This talk will answer the following questions in the following order: - What is determinism in the context of distributed databases? - What are some basic sources of non-determinism and how can they be addressed? - In what awesome ways can determinism be leveraged in distributed systems? - How can non-determinism risks be mitigated and what are some interesting related engineering tradeoffs?
- Tim Kraska, Brown University. MLbase: A Distributed Machine-learning System. Click to show abstract.
Machine learning (ML) and statistical techniques are key to transforming big data into actionable knowledge. In spite of the modern primacy of data, the complexity of existing ML algorithms is often overwhelming—many users do not understand the trade-offs and challenges of parameterizing and choosing between different learning techniques. Furthermore, existing scalable systems that support machine learning are typically not accessible to ML researchers without a strong background in distributed systems and low-level primitives. In this talk I will present MLbase, a novel system harnessing the power of machine learning for both end-users and ML researchers. MLbase provides (1) a simple declarative way to specify ML tasks, (2) a novel optimizer to select and dynamically adapt the choice of learning algorithm, and (3) a set of high-level operators to enable ML researchers to scalably implement a wide range of ML methods without deep systems knowledge.
- Alvin Cheung, MIT CSAIL; Samuel Madden, MIT; Armando Solar-Lezama, MIT CSAIL. Optimizing Database-Backed Applications with Program Synthesis. Click to show abstract.
Object-relational mapping libraries are a popular way for applications to interact with databases because they provide transparent access to the database using the same language as the application. Unfortunately, using such frameworks easily leads to applications with poor performance because modularity concerns encourage developers to implement relational operations in application code, and doing so does not take advantage of the optimized implementations of relational operations, efficient query plans, or push down of predicates that database systems provide. In this paper we present QBS, a system that automatically transforms fragments of application logic into SQL queries. QBS differs from traditional compiler optimizations because it relies on synthesis technology to automatically generate invariants and post-conditions for a code fragment. The post-conditions and invariants are expressed using a theory of ordered relations that allows us to reason precisely about both the contents and order of the records produced even by complex code fragments that compute joins and aggregates. The theory is close in expressiveness to SQL, so the synthesized post-conditions can be readily translated to SQL queries. Using 49 code fragments automatically extracted from over 120k lines of open-source code written using the Java Hibernate ORM, we demonstrate that our approach can convert a variety of imperative constructs into relational specifications and significantly improve application performance asymptotically by orders of magnitude.
- Paul Olsen, State University of NY -- Albany; Alan Labouseur, State University of NY -- Albany; Jeong-Hyon Hwang, State University of NY -- Albany. Quickly Finding the k Most Central Entities in Large Networks. Click to show abstract.
Many of today's applications can benefit from the discovery of the most central entities in real-world networks. Researchers have been developing techniques for finding the k most central entities in a network where the centrality of an entity is defined as the inverse of the average shortest path length from that entity to other entities. These previous techniques compute the centrality of each entity using a traditional single-source shortest path algorithm and then select k entities with the highest centrality values. Given a large network, however, these techniques incur high computational overhead. Our technique overcomes the above limitation. A key principle of our technique is to materialize intermediate results while a vertex's centrality is computed, and then reuse those results to speed up the computation of another vertex's centrality. Since the cost of each centrality computation may vary significantly depending on the choice of the previous computation, our technique schedules centrality computations in a manner that minimizes the estimated running time. Our technique also updates, with negligible overhead, an upper bound on the centrality of every entity. Using this information, entities that cannot belong to the final answer can be safely ignored, thereby further reducing the overall running time. In addition, our technique can quickly find an approximate answer (i.e., k entities with the highest estimated centrality values) and then gradually refine it until the final, correct answer is produced. In our experiments using various real-world data sets, our technique was orders of magnitude faster than the fastest traditional approach (e.g., 0.54 hours vs. 2.8 days) and exhibited higher performance benefits for larger data sets.
- Bryan Lewis, Paradigm4. SciDB-R: A scalable programming environment for R. Click to show abstract.
We will describe a new mechanism to couple R with DBMS facilities. Unlike the previous approaches, which call R routines from inside a DBMS, we propose to leave the R user interface intact, thereby supporting existing R programs. To provide scalability, we define a new n-dimensional sparse array class for R that represents arrays within SciDB. Arithmetic operations, matrix decompositions, and other operations on the new array class are executed in parallel on distributed data by the SciDB array DBMS. We will illustrate SciDB-R with two examples, which we will run both in native R and in SciDB-R. The first of these examples performs a truncated singular value decomposition of a very large, sparse matrix. This operation is widely used as the "guts" of recommendation engines used in many large web properties. The goal is to cluster customers with similar behavior, but the technique has very general utility. It will not be surprising that SciDB-R can complete the calculation, while native R issues a run-time error. The second example is taken from bioinformatics. Given a matrix whose dimensions are a person identifier and their genetic features (SNPs, gene sequence, etc.), we want to simultaneously cluster the rows and columns. That is, we want to find groups of people sharing groups of genetic markers. This bi-clustering operation segments populations and will be demonstrated on a 60K by 60K dense matrix. Again, native R addressing limits are exceeded, while SciDB-R executes the operation within operationally useful time constraints.
- Wyatt Lloyd, Princeton University; Michael J. Freedman, Princeton University; Michael Kaminsky, Intel Labs; David G. Anderson, Carnegie Mellon University. Stronger Semantics for Low-Latency Geo-Replicated Storage. Click to show abstract.
We describe Eiger, a scalable geo-replicated data store that achieves three goals: guaranteed low latency, a rich column-family data model, and strong consitency semantics. Key features include: (i) a low-latency, causally-consistent data store based on a column-family data model, including all the intricacies necessary to offer abstractions such as column families and counter columns, (ii) a novel non-blocking read-only transaction algo- rithm that is both performant and partition tolerant, (iii) and a novel write-only transaction algorithm that atomically writes a set of keys, is lock-free (low latency), and does not block concurrent read transactions.
- Justin DeBrabant, Brown; Leilani Battle, MIT; Ugur Cetintemel, Brown University; Stan Zdonik, Brown University; Michael Stonebraker, MIT. Techniques for Visualizing Massive Data Sets. Click to show abstract.
As many scientists move to a more data-driven analysis pipeline, large multidimensional datasets are becoming more common. In this analysis environment, scientists need to analyze massive amounts of data in an interactive manner and as quickly as possible. Unfortunately, previous methods of interacting and exploring data have not necessarily kept pace with the growth of data. In particular, many scientific visualization applications were designed to be interactive on the assumption that much, if not all, of the data would fit in memory, which is no longer the case. Because of the increased latency of having much of the data reside on disk and potentially across a network, the real-time interactivity of these visualization tools has suffered and scientistsÕ time is wasted trying to wade through massive amounts of data. In this talk we describe our experiments with several techniques that attempt to restore some if not all of the missing interactivity. In particular, we propose a prefetching and caching middleware for fetching data from disk in between user queries, thereby hiding disk latency, the major interactivity bottleneck. We also propose a resolution reduction technique to reduce the data being sent to the frontend visualization application. Combined, we show that these techniques can significantly improve the overall interactivity of the frontend visualization system.
- Alexandra Meliou, UMass Amherst. The Power of How-To Queries. Click to show abstract.
In this talk, I will discuss how-to queries, a new type of reverse data processing. A how-to query computes hypothetical updates to the database that achieve a desired effect on one or several outputs, while satisfying some global constraints. They can model complex business decisions and strategy planning processes. I will present Tiresias, the first how-to query engine, which integrates relational database systems with a linear programming engine.
- Michael Stonebraker, MIT; Justin DeBrabant, Brown; Stan Zdonik, Brown; Andy Pavlo, Brown; Stephen Tu, MIT. The Traditional Wisdom is All Wrong. Click to show abstract.
The traditional wisdom for OLTP applications is to build a disk-based RDBMS by maintaining a main-memory cache of ""hot"" disk blocks. Furthermore, if the RDBMS is not fast enough, then one should employ something like Memcached in front of the RDBMS. This hybrid architecture is implemented by a number of web properties, including Facebook. As an alternative, we have extended the main memory RDBMS H-Store so that it can manage larger-than-main-memory data bases, using a notion we call anti-caching. We show that H-Store extended by anti-caching is wildly faster than MySQL, regardless of the size of the data base on both the YCSB and TPC-C benchmarks. Even when fronted by Memcached, H-Store is still the performance winner. As a result, we assert that traditional OLTP RDBMSs are architected wrong and should be replaced by anti-caching implementations.
- Yuan Yuan, The Ohio State Univerisy; Rubao Li, The Ohio State University; Xiaodong Zhang, The Ohio State University. The Yin and Yang of Processing Data Warehousing Queries on GPU Devices. Click to show abstract.
Database community has made significant research efforts to optimize query processing on GPUs in the past few years. However, we can hardly find that GPUs have been truly adopted in major data warehousing production systems. To understand main reasons behind this fact, we have conducted a comprehensive study to evaluate the performance of processing complex data warehousing queries by varying query characteristics, software optimization techniques, and GPU hardware configurations. Our study focuses on the two fundamental components of query execution on GPUs, PCIe data transfer and kernel execution, aiming at gaining deep insights on how they are affected by various factors from software to hardware. Furthermore, we also propose an analytical model to understand and predict the variations of the two factors. Based on our study, we present our performance insights and prediction for warehousing query execution on GPUs.