New England Database Day Program
Friday, January 30th.
| Time | Event |
| 9:00 AM | Welcoming remarks |
| 9:30 AM | Michael J. Franklin (UC Berkeley and Truviso). Continuous Analytics: Supercharging Query Performance with Stream Processing.
|
| 10:30 AM | Coffee Break |
| |
| Technical Session 1 |
| 11:00 AM | Sunil Sarin. Dynamic Transaction Serialization in Netezza Performance Server. Click to toggle abstract. Although it may appear that transaction management issues have mostly been “solved” in the last few decades by the research and industrial database communities, the onset of new approaches and architectures can reopen opportunities for looking at old problems. The Netezza data warehousing appliance, Netezza Performance Server (NPS), is one such technology. The foundation of NPS is a massively parallel DBMS architecture whose innovations include fast filtering of data as it is read off disk (and before it needs to be examined in memory by software), query optimization, automatic coarse “indexing”, and highly scalable parallel query execution and data communication.
Within the NPS architecture, transaction managerment consists of the following capabilities:
• Multi-version storage and multi-version concurrency control (MVCC). This is a familiar scheme for ensuring that readers and writers do not block each other, but is extended in NPS to support dynamic serializability and is integrated with fast filtering of disk data.
• Dynamic transaction serializability, based on serialization graph checking, that allows serialization orders not supported by two-phase locking or by timestamp ordering.
• A commit/rollback architecture that optimizes transaction commit and supports asynchronous rollback & recovery.
While the presentation/paper will cover the basics of all three, the focus will primarily be on dynamic transaction serializability. Unlike “extreme optimistic” concurrency control, where the “bad news” (inability to serialize) is found out only when attempting to commit, dynamic transaction serializability is “cautiously optimistic” and detects serialization cycles as early as possible when transactions access (read or modify) data. Like all optimistic schemes, the detection of a serialization cycle requires aborting one or more transactions, which then need to be restarted. In our experience, this has been an acceptable cost in the data warehousing environments for which NPS is designed.
With almost all commercial DBMSs, transaction serializability is typically sacrificed in the name of “performance”. Read Committed is the typical default isolation level, and some DBMSs support “Snapshot Isolation” which is stronger and avoids write-write anomalies. But the cost of not having serializability efficiently supported in the DBMS is considerably greater application complexity (such as carefully placed explicit LOCK statements, or synchronization using data structures outside the DBMS). In NPS we took the “high ground” instead, and designed and implemented a scheme with serializability as the default (and currently only) transaction isolation level. We will discuss briefly some of the tradeoffs and costs involved in making this decision, which we believe are well offset by reduced application complexity.
[PDF] |
| 11:25 AM | Mujde Pamuk, Akash Shah. Deep Web Search with Morpheus. Click to toggle abstract. In contrast to the shallow web, which contains primarily textual information that is readily accessible to a conventional crawler, there is a deep web that contains information hidden behind HTML forms and other sorts of data structures. This interface to the deep web is often problematic for shallow web search engines. It has been estimated that there are 1 million deep web sites and that the deep web is 100 times the size of the shallow web. Hence, deep web information is obviously important to web users. In this paper we explore the architecture of Morpheus, a system that provides search capabilities for the deep web.
[PDF] |
| 11:50 AM | Daniel Abadi. Data Management in the Cloud. Click to toggle abstract. This New England DB Day submission explores the advantages and disadvantages of deploying
database systems in the cloud. We look at how the typical properties of
commercially available cloud computing platforms affect the choice of data
management applications to deploy in the cloud. Due to the ever-increasing
need for more analysis over more data in today's corporate world, along
with an architectural match in currently available deployment options, we
conclude that read-mostly analytical data management applications are
better suited for deployment in the cloud than transactional data
management applications.
[PDF] |
| |
| 12:25 PM | Lunch |
| |
| Technical Session 2 |
| 1:15 PM | David Karger. Baseless--Databases’ Image Problem and What To Do About It. Click to toggle abstract. There seems to be a widespread belief among the database community that
the world "has a database problem"-that there are a wide variety of problems
people have that would best be solved by using a database, but that people seem not recognize that fact. The reasons for this rejection are not entirely clear (I will discuss several below). But besides the occasional outburst (Stonebraker and DeWitt, 2008), the database community seems somewhat resigned to the blindness of the broader community. In the talk I will argue that while databases as they currently stand may deter users, the ideas underlying those databases can be used in tools that are attractive to end users, and might someday lead them to adopt real databases to solve their problems. I will describe our Exhibit framework, which aims to let end-users create rich "database backed" web sites without actually managing a database, and will explain why I think it is more appealing to end-users than actual database backed approaches. I will use it to argue for the development of a class of "starter" databases that can draw people to the benefits of database approaches.
[PDF] |
| 1:40 PM | Devesh Agrawal, Deepak Ganesan, Yanlei Diao. Index Design for Flash-based Embedded Systems. Click to toggle abstract. Flash memories are in ubiquitous use in networked embedded systems such as sensors, mobiles and handhelds. Their widespread adoption is due to a myriad of benefits: small size, low cost, low power consumption, and high capacity. The ability to store large amounts of data cheaply enables low-power embedded devices to be employed in a variety of roles that require local storage and indexing on flash.
In this talk, we present the Lazy-Adaptive Tree (LA-Tree), a novel index structure that is designed to minimize access to flash, thereby minimizing energy cost and response time. The LA-tree achieves this objective by augmenting a traditional tree index with a provably efficient mechanism for ``lazily'' performing inserts while ensuring an immediate response for lookups. The LA-tree is fundamentally better suited for flash-based embedded devices because it generates far fewer updates than a traditional tree index, hence significantly cheaper to maintain on flash, yet without increasing the lookup cost. It also tunes index parameters according to the flash cost model, and optimizes storage reclamation and memory management to address flash constraints. Our evaluation shows that the LA-Tree outperforms existing solutions over a range of workloads, datasets, and memory constraints, and achieves 2x to 8x gains in most cases.
[PDF] |
| 2:05 PM | Shilpa Lawande. Common Myths about Column Data Bases. Click to toggle abstract. In this talk we debunk the top 5 myths about column data bases. These are:
myth 1: Column-oriented data base load more slowly than row-oriented data bases
myth 2: Row-oriented data bases can neutralize the advantages of column-oriented data bases by using materialized views
myth 3: A large amount of main memory in a row-oriented DBMS implementation can neutralize the advantages of column-oriented data bases
myth 4: Compression works about as well for row stores as column stores
myth 5: Specialized hardware and/or SANs are required to get good performance on analytic workloads
[PDF] |
| |
| 2:30 PM | Coffee Break |
| 3:00 PM | Alon Halevy (Google). Structured data on the Web: where we are and where we can go
Click to toggle abstract.
Though search on the World-Wide Web has focused mostly on unstructured text, there is an increasing amount of structured data on the Web and growing interest in harnessing such data. I will describe several current projects at Google whose overall goal is to leverage structured data and better expose it to our users. These projects include crawling the deep web, collecting and mining the HTML tables on the Web, and computing aspects for search queries to better organize answers. In each case, I will focus the lessons learned from the project and the opportunities that lie ahead. I will also discuss the opportunities relating to creating and managing data on the Web.
|
| 4:00 PM | Poster Session and Appetizers / Drinks (Building 32, R&D Area, 4th Floor) |
| 6:00 PM | Adjourn |
|