New England Database Summit 2010 Program
| |||||||||||||||||||||||||||||||||||||||||||||||
| Time | Event | |
| 9:00 AM | Welcoming remarks (Samuel Madden, Daniel Abadi, and John Metzger) | |
| 9:10-10:10 | Raghu Ramakrishnan (Chief
Scientist for Audience & Cloud Computing at Yahoo!) Cloud Data
Serving
Slides (PDF)
Click to toggle abstract.
Data-backed web applications have stringent availability, performance
and partition tolerance requirements that are difficult, sometimes even
impossible, to meet using conventional database management systems. On
the other hand, they typically are able to trade off consistency to
achieve their goals. This has led to the development of specialized
key-value stores, which are now used widely in large-scale web services.
| |
| 10:10-10:35 | Coffee Break | |
| Technical Session 1 | ||
|---|---|---|
| 10:35-10:55 | Carlo Curino, Evan Jones, Yang
Zhang, Eugene Wu, Sam Madden RelationalCloud: The case for a database
service
Slides (PDF)
Click to toggle abstract. In this talk, we make the case for ``databases as a service'' (DaaS), with two target scenarios in mind: (i) consolidation of data management functionalities for large organizations and (ii) outsourcing to a cloud-based service provider for small/medium organization. We analyze the many challenges to be faced, and discuss the design of a database service we are building, called Relational Cloud. The system has been designed from scratch and combines many recent advances and novel solutions. The prototype we present exploits multiple dedicated storage engines, provides high-availability via transparent replication, supports automating workload partitioning and live data migration, and provides serializable distributed transactions. While the system is still under active development, we are already able to present promising initial results that showcase the key features of our system. The tests are derived from both TPC synthetic benchmarks and real-world applications. | |
| 10:55-11:15 | Mike Dirolf An Introduction to MongoDB
Slides (PDF)
Click to toggle abstract.
MongoDB is an open-source, high-performance, schema-free,
document-oriented database. The goal of the MongoDB project is to bridge
the functionality gap between a key/value store and a traditional
RDBMS. This talk will introduce MongoDB and discuss some of the reasons
why this project is gaining traction in the open-source community.
| |
| 11:15-11:35 | R. Nehme, Elke Rundensteiner,
and E. Bertino The Query Mesh Project: A
Powerful Multi-Route Query Processing Paradigm
Slides (PPTX)
Click to toggle abstract. In real-life applications, different subsets of data may have distinct statistical properties, e.g., various websites may have diverse visitation rates, different categories of stocks may have dissimilar price fluctuation patterns, etc. Unfortunately, in most database systems, traditional and stream systems alike, the optimizer picks just one single query plan for all this data based on the overall statistics of the data. Given real-life datasets with non-uniform distributions -- selecting a single execution plan may result in a query execution that is ineffective for possibly large portions of the actual data. In this talk, I'll describe a practical alternative approach to the current state-of-the-art query processing techniques that we are pursuing in the Query Mesh project. | |
| 11:35-11:55 | Andy Pavlo MapReduce and Parallel
DBMSs: Together At Last
Slides (PDF)
Click to toggle abstract. The MapReduce (MR) paradigm is heralded as a revolutionary new platform for large-scale, massively parallel data access. Some proponents claim that the extreme scalability of MR will relegate relational database management systems (DBMS) to the status legacy technology. In this talk, however, I will discuss the results from our recent benchmark study from that suggest that using MR systems to perform tasks that are best suited for DBMSs yields less than satisfactory results [PPR+09]. This leads us to conclude that MR is more akin to an Extract-Transform-Load (ETL) system than a DBMS, as it is quickly able to load and analyze large amounts of data in an ad hoc manner [SAD+10]. As such, it is complementary to DBMS technology, rather than a competitor. Thus, I will also discuss how the DBMS community has embraced MR technologies in the last year, and what features of DBMSs are being incorporated into popular open-source MR implementations. | |
| 11:55-12:15 | Gregory Malecha, Greg
Morrisett, Avraham Shinnar, and Ryan Wisnesky Toward a Verified
Relational Database Management System
Slides (PDF)
Click to toggle abstract. We report on our experience implementing a lightweight, fully verified relational database management system (RDBMS). The functional specification of RDBMS behavior, RDBMS implementation, and proof that the implementation meets the specification are all written and verified in the Coq proof assistant. Our contributions include: (1) a complete specification of the relational algebra in Coq; (2) an efficient realization of that model (B+ trees) implemented with the Ynot extension to Coq; and (3) a set of simple query optimizations that are proven to respect both semantics and run-time cost. In addition to describing the design and implementation of these artifacts, we highlight the challenges we encountered formalizing them, including the choice of representation for (finite) relations of typed tuples and the challenges of reasoning about data structures with complex sharing. Our experience shows that though many challenges remain, building fully-verified systems software in Coq is within reach. | |
| 12:15 PM | Lunch | |
| 1:10-2:10 | Curt Monash (President, Monash Research). Database and analytic technology: The state
of the union
Click to toggle abstract. The analytic database management industry is a hotbed of innovation, as numerous commercial and/or academic research projects have recently led to practical enterprise adoption. To a lesser but still-laudable extent, the same is true of other analytic and database technology sectors as well. In this talk, I will:
| |
| Technical Session 2 | ||
| 2:10-2:30 | Paul Brown SciDB: Massively
Parallel Array Data Storage, Processing and Analysis
Slides (PDF)
Click to toggle abstract. The technical requirements of large scale scientific data processing constitute an interesting new area of research and development in data management. In this talk we will introduce and review SciDB, a new 'massively parallel' platform for array data storage, processing and analysis. We will review a number of motivating use-cases from both the scientific and commercial spheres, describing how these use-cases have determined the features and functionality of our system; the SciDB data model, extensibility framework, query language, and external interfaces. We then turn to our system's design and implementation, focusing on a number of key design points connected to our data partitioning and query processing strategies. We conclude with a review of the project 'in flight'. | |
| 2:30-2:50 | Coffee Break | |
| 2:50-3:10 | Julia Stoyanovich, William Mee,
Kenneth A. Ross Semantic Ranking and Result Visualization for Life
Sciences Publications
Slides (PDF)
Click to toggle abstract. An ever-increasing amount of data and semantic knowledge in the domain of life sciences is bringing about new data management challenges. In this paper we focus on adding the semantic dimension to literature search, a central task in scientific research. We focus our attention on PubMed, the most significant bibliographic source in life sciences, and explore ways to use high-quality semantic annotations from the MeSH vocabulary to rank search results. We start by developing several families of ranking functions that relate a search query to a document’s annotations. We then propose an efficient adaptive ranking mechanism for each of the families. We also describe a two-dimensional skyline-based visualization that can be used in conjunction with the ranking to further improve the user’s interaction with the system, and demonstrate how such skylines can be computed adaptively and efficiently. Finally, we evaluate the effectiveness of our ranking with a user study. | |
| 3:10-3:30 | Mirek Riedewald, Alper Okcan,
Daniel Fink Scalable Search and Ranking for Scientific Data
Slides (PDF)
Click to toggle abstract. As the amount and complexity of data in many scientific disciplines increases rapidly, new tools are needed to support exploratory analysis and scientific discovery. Our work is motivated by a major challenge we experienced in collaborations with domain scientists---finding \emph{interesting} relationships between the attributes of a complex process. Such relationships, which we generally refer to as \emph{patterns}, form the basis for new hypotheses and hence facilitate discovery. We argue that data management research is essential for all aspects of scalable pattern search and ranking, ranging from an easy-to-use query language and a formal language for representing search preferences to distributed implementation of the search process. In addition to a system vision and research challenges we will also discuss our current results, including a formal preference language and techniques for efficient generation of model summaries, which are the basis for pattern discovery. | |
| 3:30-4:30 | C. Mohan (IBM Fellow and
Former IBM India Chief Scientist). Implications of Storage Class
Memories (SCMs) on Software Architectures
Slides (PDF)
Click to toggle abstract.
Flash memories have been in widespread usage for a while but they have had some performance and reliability problems which have made them unsuitable
for long term storage of traditional database data. A new class of memory
called Storage Class Memories (SCMs) are emerging which are built using
different technologies than flash devices. SCMs overcome many of the
shortcoming of flash devices while approaching the cost of flash
memories. SCMs fall in between DRAM and traditional disk storage along many
dimensions (performance, cost, energy usage, ....). As a result, large
SCM-based memory systems will be built. While main memory database
management systems (MMDBMSs) companies like TimesTen and SolidDB have been
around for a while, those companies have been acquired recently by Oracle
and IBM, respectively. SCMs will permit the sizes of databases managed by
MMDBMSs to be very large while being cheaper than those using only
DRAMs. SCMs may be viewed as disks or as memory from an architectural
perspective. Depending on the viewpoint, the implications on DBMS
architectures will be very different. Some preliminary ideas on usage of a
small amount of non-volatile memory realized by using battery-backed DRAM
was presented in a paper design called Safe RAM in VLDB 1989. Technology
has evolved tremendously in 2 decades and it is time for us to revisit
system architectures.
| |
| 4:30 PM | Poster Session and Appetizers / Drinks (Building 32, R&D Area, 4th Floor) | |
| 6:00 PM | Adjourn | |