New England Database Day Posters
The poster session will include drinks and appetizers. It will be held in Building 34/36 on the 3rd floor, as well as
in rooms 34-301 and 34-302.
List of accepted posters:
- Frank Manola. Distributed Transaction Management Using Telegraph Messages: An Historical Note. Click to show abstract.
The database community is generally familiar with the Two Phase Commit (2PC) protocol for coordinating distributed transactions. However, standard texts note that its origins are unclear. With the recent (2006) demise of the telegram, it seems appropriate to reflect on the fact that variants of 2PC were used in procedures to control the operations of railroad trains as early as the 19th century, using telegrams as the communications medium. This presentation describes these railroad operating procedures, which effectively involved maintaining a "distributed database" of track occupancy rights, and their relationship to database 2PC.
- Adam Marcus, David Karger, Sam Madden. BlendDB: Teaching and Old Heapfile New Tricks. Click to show abstract.
The physical implementation of most relational databases follows their logical description, where each relation is stored in its own file or
collection of files on disk. Such an implementation is good for queries that filter or aggregate large portions of a single table, and
provides reasonable performance for queries that join many records from one table to another table. It is much less ideal, however, for join queries that "follow paths" from a small number of tuples in one table to small collections of tuples in other tables to accumulate facts about a a related collection of objects (e.g., co-authors of a
particular author in a publications database), since answering such queries involves one or more random I/Os per table involved in the path. If the primary workload of a database consists of many of such path queries, as is likely to be the case in databases that support "browsing oriented" applications, performance will be quite poor.
To optimize the performance of these kinds of path queries, we have built BlendDB, a relational database which supports on-disk co-location of tuples from different relations. We propose a clustering algorithm that, given knowledge of the database workload, clusters the tuples of multiple relations such that tuples joining along common
paths are co-located. We then show experimentally that BlendDB provides better performance than traditional relational databases on throughput tests of queries against the IMDB movie dataset, even if the traditional database uses denormalization techniques.
- Vassilis Athitsos, Michalis Potamias, Panagiotis Papapetrou, and George Kollios. Nearest Neighbor Retrieval Using Distance-Based Hashing. Click to show abstract.
A method is proposed for indexing spaces with arbitrary distance measures, so as to achieve efficient approximate nearest neighbor retrieval. Hashing methods, such as Locality Sensitive Hashing (LSH), have been successfully applied for similarity indexing in vector spaces and string spaces under the Hamming distance. The key novelty of the hashing technique proposed here is that it can be applied to spaces with arbitrary distance measures, including non-metric distance measures. First, we describe a domain-independent method for constructing a family of binary hash functions. Then, we use these functions to construct multiple multibit hash tables. We show that the LSH formalism is not applicable for analyzing the behavior of these tables as index structures. We present a novel formulation, that uses statistical observations from sample data to analyze retrieval accuracy and efficiency for the proposed indexing method. Experiments on several real-world data sets demonstrate that our method produces good trade-offs between accuracy and efficiency, and significantly outperforms VP-trees, which are a well-known method for distance-based indexing.
- Devesh Agrawal, Deepak Ganesan, Yanlei Diao. A storage substrate for flash based sensor data management. Click to show abstract.
Recent technology trends in energy-efficient flash storage have made in-network
data archival and querying an inexpensive and energy-efficient option for data
management in sensor networks. In this poster, we present the design of an
energy-efficient storage substrate for a flash based sensor database that
exploits these trends. Our design of a storage substrate for sensor data
management involves several novel features such as partition-based storage
allocation and automated aging for storage reclamation. We also develop
techniques for efficiently managing data and indexes within a partition
including use of a per-partition flash translation layer, and adaptive memory
management to maximize write coalescing. We further explore techniques to
efficiently locate partitions relevant to a given query.
- Ben Vandiver. Tolerating Byzantine Faults in Database Systems. Click to show abstract.
This paper describes the design, implementation, and evaluation of a
replication scheme to handle Byzantine faults in transaction
processing database systems. The scheme compares answers from queries
and updates on multiple replicas which are unmodified, off-the-shelf
systems, to provide a single database that is Byzantine fault
tolerant. The scheme works when the replicas are homogeneous, but it
also allows heterogeneous replication in which replicas come from
different vendors. Heterogeneous replicas reduce the impact of bugs
and security compromises because they are implemented independently
and are thus less likely to suffer correlated failures.
The main challenge in designing a replication scheme for transaction
processing systems is ensuring that the different replicas execute
transactions in equivalent serial orders while allowing a high degree
of concurrency. Our scheme meets this goal using a novel concurrency
control protocol, commit barrier scheduling (CBS). We have
implemented CBS in the context of a replicated SQL database, HRDB
(Heterogeneous Replicated DB), which has been tested with unmodified
production versions of several commercial and open source databases
as replicas. Our experiments show an HRDB configuration that can
tolerate one faulty replica has only a modest performance overhead
(about 17% for the TPC-C benchmark). HRDB successfully masks
several Byzantine faults observed in practice and we have used it to
find a new bug in MySQL.
- Ross Shaull, Liuba Shrira, Hao Xu. Enabling Long-Lived Snapshots of the Long-Lived Past. Click to show abstract.
Decreasing disk costs have made it practical to retain long-lived snapshots, enabling new applications that analyze past states and infer about future states. Current approaches offer no satisfactory way to organize long-lived snapshots because they disrupt the database in either short or long run. Split snapshots are a recent approach that overcomes some of the limitations. An unsolved problem has been how to support efficient application code access to arbitrarily long-lived snapshots. We describe Skippy, a new approach that solves this problem. Performance evaluation of Skippy, based on theoretical analysis and experimental measurements, indicates that the new approach is effective and efficient.
- Philippe Cudre-Mauroux. Self-Organization in the GridVine Peer Data Management System. Click to show abstract.
GridVine is a Peer Data Management System based on a decentralized access structure. Built following the principle of data independence, it separates a logical layer -- where data, schemas and schema mappings are managed -- from a physical layer consisting of a structured Peer-to-Peer network supporting decentralized indexing, key load-balancing and efficient routing. Our system is totally decentralized, yet it fosters semantic interoperability through pairwise schema mappings and query reformulation. GridVine provides a complete solution for building higher-layer information sharing applications through a series a self-organizing mechanisms used to index tuples, maintain sets of related schemas and route queries from one schema to the others.
- Barbara Blaustein, Peter Mork, Len Seligman, Ken Smith, David Allen, Jeff Hoyt, Michael Morse, Arnie Rosenthal, Chris Wolf. Information Sharing Research at MITRE. Click to show abstract.
MITRE’s Innovative Information Engineering group performs research on diverse information
management topics. This poster presents the work of four projects that facilitate information sharing.
- Michael Hay, Gerome Miklau, David Jensen. Protecting Anonymity in Published Social Networks. Click to show abstract.
A networked data set (such as a social or communication network) is a graph representing entities connected by personal relationships, interactions, or flows of information. Dissemination and analysis of networked data has many benefits, but it is often severely constrained by privacy concerns. A distinctive challenge to privacy in the publication of networked data is that an individual's connections (i.e. the graph structure around them) can be distinguishing, and may be used to re-identify an otherwise anonymous individual. In this poster, we present a framework for assessing the privacy risk of sharing anonymized network data. After formalizing models of adversary knowledge, we measure re-identification risk in real social networks drawn from diverse domains. Our results show that common anonymization techniques are inadequate. We propose a novel graph transformation technique in which random edge perturbation is applied to a clustering of the original graph in order to obscure the distinguishing structural features of individuals while preserving global properties of the graph.
- Daniel Gyllstrom, Jagrati Agrawal, Yanlei Diao and Neil Immerman. An Automata-based Approach to Complex Event Processing over Streams. Click to show abstract.
Complex event processing is finding
application in a growing number of stream environments such as stock market analysis and RFID-based inventory management. Complex event processing in the streaming setting has unique features, such as event definition using complex predicates and event selection in various ways from the input stream,that fundamentally distinguish it from the classical problem of regular expression matching. In this poster, we present the design and implementation of an efficient event processing system. In particular, we present a compact language, SASE+, that can be used to define a wide variety of pattern queries, an automata-based approach to evaluate such queries over streams, and optimization techniques that exploit sharing for improved efficiency. We also summarize results of our performance analysis, which offer insights into the various factors on performance and demonstrate the initial benefits of our sharing techniques.
- Eric Prud\'hommeaux. Federated Semantic Query with MySQL. Click to show abstract.
Adding a SPARQL parser to MySQL decouples the database's public interface from the private schema. Further, it encourages the use of shared data models composed of distinct semantic identifiers, avoiding collisions on relation, attribute and view names.
SPARQL queries can be federated to other databases or other resources that support SPARQLfed without scripting, configuration, or unreasonable performance cost. Pharmaceutical investment is yielding semantic, federated query technology that will be useful in many domains.
- Yang Zhang, Bret Hull, Hari Balakrishnan, Sam Madden. ICEDB: Intermittently-Connected Continuous Query Processing. Click to show abstract.
CarTel is a mobile sensor computing system designed to collect, process, deliver, and visualize data from sensors located on mobile units such as automobiles. Due to their mobility, these sensor networks display intermittent and variable network connectivity, and often have to deliver large quantities of data relative to the bandwidth available during periods of connectivity. To deal with these issues, we have developed ICEDB, a continuous query processing system for mobile sensor networks, which provides a declarative query interface to allow users to specify how the mobile nodes should summarize, filter, and dynamically prioritize data. ICEDB incorporates two key ideas: (1) a delay-tolerant continuous query processor, coordinated by a central server and distributed across the mobile nodes, and (2) algorithms for prioritizing certain query results to improve application-defined "utility" metrics.
- Thanh Tran, Charles Sutton, Richard Cocci, Yanming Nie, Yanlei Diao and Prashant Shenoy. A Learning-based Approach to Processing RFID Streams with Mobile Readers. Click to show abstract.
Mobile readers have been recently deployed in retail, healthcare, pharmaceuticals and supply chain management. It is well known that raw RFID streams tend to be highly noisy and incomplete. The advent of mobile readers adds new challenges to RFID stream processing due to the inherent reader mobility and increased noise compared to static readers. In this poster, we present our work on building an inference and cleaning substrate specifically targeted for RFID streams generated by mobile readers. Specifically we propose a probabilistic model based on Dynamic Bayesian Networks to capture the mobility of the reader, object dynamics and noisy readings. Our model can self-calibrate by automatically estimating parameters from observed data using Expectation-Maximization algorithm. We also propose a sampling technique based on particle filtering for probabilistic inference. We employ factorization to enhance accuracy and efficiency of the inference and an indexing method to scale the inference to stream speeds. A detailed performance evaluation of our system shows the high accuracy of our method for real-world traces as well as its efficiency and scalability for large-scale synthetic data.
- Yang Du, Tian Xia, Yufei Tao, Donghui Zhang, Feng Zhu. On Multidimensional k-Anonymity with Local Recoding Generalization. Click to show abstract.
This poster presents the first theoretical study, on using local-recoding generalization (LRG) to compute a k-anonymous table with quality guarantees. First, we justify the importance of LRG, by showing that it permits more accurate aggregate analysis (on the original data) than the
previous generalization schemes. As a second step, we prove that it is NP-hard both to find the table with the maximum quality, and to discover a solution with an approximation ratio at most 5/4. Finally, we develop three algorithms with different tradeoffs between the approximation ratio and time complexity. The effectiveness of our solutions is verified by extensive experiments.
- Di Yang, Elke A. Rundensteiner, Matthew, O. Ward. Nugget Management for Intellegent Visual Exploration. Click to show abstract.
Visualization systems traditionally focus on graphical representation
of information. They tend not to provide integrated analytical
services that could aid users in tackling complex knowledge discovery
tasks. Users’ exploration in such environments is usually
impeded due to several problems: 1) valuable information is hard
to discover when too much data is visualized on the screen; 2) Users
have to manage and organize their discoveries off line, because no
systematic discovery management mechanism exists; 3) their discoveries
based on visual exploration alone may lack accuracy; 4)
and they have no convenient access to the important knowledge
learned by other users. To tackle these problems, it has been recognized
that analytical tools must be introduced into visualization
systems. In this paper, we present a novel analysis-guided exploration
system, called the Nugget Management System (NMS). It
leverages the collaborative effort of human comprehensibility and
machine computations to facilitate users’ visual exploration processes.
Specifically, NMS first extracts the valuable information
(nuggets) hidden in datasets based on the interests of users. Given
that similar nuggets may be re-discovered by different users, NMS
consolidates the nugget candidate set by clustering based on their
semantic similarity. To solve the problem of inaccurate discoveries,
localized data mining techniques are applied to refine the nuggets to
best represent the captured patterns in datasets. Lastly, the resulting
well-organized nugget pool is used to guide users’ exploration. To
evaluate the effectiveness of NMS, we integrated NMS into XmdvTool,
a freeware multivariate visualization system. User studies
were performed to compare the users’ efficiency and accuracy in
finishing tasks on real datasets, with and without the help of NMS.
Our user studies confirmed the effectiveness of NMS.
- Ryan Newton, Lewis Girod, Stan Rost, Sam Madden. WaveScope: Cross Platform Distributed Stream-Processing. Click to show abstract.
The WaveScope project provides a high-level domain specific language for stream processing programs, along with an optimizing, parallelizing compiler and runtime that supports architectures from multicore servers to networks of embedded sensors. This poster summarizes our recent work, including dynamic work scheduling algorithms, memory management techniques, new applications, and also a new effort to implement StreamSQL using WaveScope.
- Mingzhu Wei, Elke A. Rundensteiner and Murali Mani. Utility-driven Load Shedding for XML Stream Processing. Click to show abstract.
Because of the high volume and unpredictable arrival rate, stream
processing systems may not always be able to keep up with the input
data streams--- resulting in buffer overflow and uncontrolled loss
of data. Load shedding, the prevalent strategy for solving this
overflow problem, has so far only been considered for relational
stream processing, but not to XML. Shedding applied to XML stream
processing brings new challenges due to complex nested nature of XML
structures. In this paper, we tackle this unsolved XML shedding
problem using a three-pronged approach. First, we develop an XQuery
preference model that enables users to specify the relative
importance of preserving different subpatterns in the XML result
structure. This transforms shedding into the problem of rewriting
the user query into shedding queries that return approximate query
answers with utility as measured by the given user
preference model. Second, we develop a cost model to compare the
performance of alternate shedding queries. Third, we develop two
shedding algorithms, OptShed and FastShed. OptShed guarantees to
find an optimal solution however at the cost of exponential
complexity. FastShed, as confirmed by our experiments, efficiently
achieves a close-to-optimal result in a wide range of test cases.
Finally we describe the in-automaton shedding mechanism for XQuery
stream engines. The experimental results show that our proposed
utility-driven shedding solutions consistently achieve higher
utility results compared to the existing relational shedding
techniques.
- Alvin Cheung, Sam Madden. An Acquisitional Software Profiler. Click to show abstract.
Software profilers are important tools in software analysis as they provide useful runtime performance
feedback to the software designer. Currently available profilers can be classified into two broad categories. The first type includes profilers acting as a sandbox that, upon invocation, execute the targeted program a number of times within the sandbox and collect performance statistics. Unfortunately, such profilers cannot be used to profile continuously running systems such as webservers and databases. Another class of profilers attach themselves to
a running program, and instrument the running program to invoke profiler routines to collect statistics. While such profilers can be used to profile continuously running programs, the extra overhead involved in instrumenting the targeted program and subsequently running the instrumented program is often substantial (multifold the time needed to run the original program), making the results reported by the profiler to be questionable. In this work, we propose a new software profiler based on the program attachment approach. In particular, we make use of two techniques to mitigate the overhead in collecting statistics. First, we provide a declarative query interface for users to indicate what performance statistics to collect (e.g., running time about specific functions, information about functions under certain situations such as when the targeted program's memory usage is above a certain threshold). To reduce the number of functions to be instrumented, we adopt an acquisitional approach and only instrument functions that
are needed to fulfill the user's request. In addition, when evaluating the user's profiling request, we make use of program analysis techniques, such as control flow graph analysis, and traditional database query optimization techniques, such as predicate pushdown and join optimization, to further reduce the number of instrumented functions. We present some of our preliminary results of our prototype system in this poster.
- Arvind Thiagarajan, Sam Madden. Representing And Querying Regression Models in a DBMS. Click to show abstract.
Curve fitting is a widely employed, useful modeling tool in several financial, scientific/engineering and data mining applications, as well as sensor network applications that need to tolerate uncertain or noisy data. These applications need to both fit functions to their data using regression, and pose relational-style queries over regression models. Existing DBMSs are ill-suited for this task because they do not support creating, representing and querying functional data, short of brute-force discretization of functions into a collection of tuples. This poster will present FunctionDB, a novel DBMS that treats functions output by regression as first-class citizens that can be queried and manipulated like traditional relations. The key contributions of FunctionDB are a compact algebraic representation for regression models in a DBMS as piecewise functions, and a novel algebraic query processor with well-defined semantics over functional data. For the broad class of multivariate polynomial functions, FunctionDB executes relational queries directly on
this representation as combinations of symbolic operations like function inference, zero finding, substitution and integration. When closed form solutions are intractable, FunctionDB leverages symbolic approximation operations to improve performance. The poster will present evaluation results of FunctionDB on two real data sets: measurements from a
temperature sensor network, and a collection of traffic traces from cars driving on roads. Operating directly in the functional domain has substantial advantages in terms of accuracy (15-30%) and up to order of magnitude (10x-100x) performance gains over existing approaches that represent models as discrete collections of points.
- Stratos Papadomanolakis, Debabrata Dash, Anastasia Ailamaki. An Integer Linear Programming based Tool for Automated Database Design. Click to show abstract.
Automated database design is a major challenge in building self-tuning database
management systems. A major difficulty in the development of practical physical
design algorithms is dealing with the huge number of design features, such as
indexes that must be considered. A second difficulty is determining, given a
(pruned) search space of candidate features, a combination that provides
optimal performance for the workload that also satisfies resource constraints
such as available storage.
Existing index selection tools rely on heuristics to efficiently search within
the large space of alternative solutions and to minimize the overhead of using
the query optimizer for cost estimation. Index selection heuristics, despite
being practical, are hard to analyze and formally compute how close they get to
the optimal solution. In this poster we propose a tool for index selection
based on Integer Linear Programming (ILP), in the context of commercial
database systems. Our ILP-based tool offers higher solution quality,
efficiency and scalability without sacrificing any of the precision offered by
existing index selection tools. For instance, on a large TPC-H like workload,
it provides 24% better solution and 34 times speedup, when compared with
commercial solutions. More importantly, our model opens the way for the
application of a huge body of work in combinatorial optimization and operations
research, that is successfully deployed for real-world, large-scale
optimization problems to be applied to databases as well.
- Venkatesh Raghavan and Elke A Rundensteiner. ROM - Robust Ordering of Multi-Join Continuous Queries. Click to show abstract.
Continuous query optimization must aim to generate executions plans for user
queries which are adherent of the system resource requirements. Unlike the
traditional query optimization techniques this is not a minimization problems
but rather a complex constraint satisfaction problem. Besides the system resource
requirements the optimizer must also take into consideration that the input stream
parameters such as input rate, selectivities, etc. could potentially fluctuate there
by making a previous resource adherent query execution plan to now be a blocked query.
In such scenarios, re-optimization is the only option when the current query plan
runs over the available resources. Once the new optimal (or near-optimal) plan for
the new set of parameters is generated the partial results in various queues have
to be handled to ensure that there are no duplicate or missing results. As statistics
change there could be multiple such requests for re-optimization and each such
request will need a migration to a new plan. Thus proving to be potentially
expensive both in time and quality of result cater to the real-time response
requirements. In this poster we present the challenges and our ideas for generating a
robust execution plan for a multi-join continuous query.
- Tanu Malik, Xiaodan Wang, Randal Burns, Debabrata Dash, Anastasia Ailamaki. Automatic Physical Design in Database Caches. Click to show abstract.
Performance of a database is crucially dependent on its physical design. Current techniques, automated or otherwise, for physical design depend on the identification of a representative workload. However, in several emerging applications such techniques are inadequate as workload characteristics change rapidly over time. This is remarkably shown at the proxy cache of SkyQuery, a federation of Astronomy databases, which receives a continuously evolving workload. Using the proxy cache of the Skyquery federation as our case study, we present novel techniques for automated physical design of its database.
We improve physical design of the proxy cache by vertical partitioning. Unlike prior workload-based vertical partitioning techniques that are offline, online partitioning algorithm adapts to workload changes incrementally and balances the performance benefits of physical design decisions with the cost of implementing these decisions. Our formulation includes a competitive algorithm based on task systems and an incremental algorithm based on association rules. The algorithms are general in that they do not make assumptions about the incoming workload or the underlying physical schema. Experiments with SkyQuery workloads show that incorporating online vertical partitioning in the proxy cache of SkyQuery databases provides a significant improvement in cache response time and efficiency. The algorithms incur a low optimization overhead compared with offline vertical partitioning techniques.
- Nikos Hardavellas, Ryan Johnson, Ippokratis Pandis, Anastasia Ailamaki. To Share or Not to Share?. Click to show abstract.
Intuitively, aggressive work sharing among concurrent queries in a database system should always improve performance by eliminating redundant computation or data accesses. We show that, contrary to common intuition, this is not always the case in practice, especially in the highly parallel world of chip multiprocessors. As the number of cores in the system increases, a trade-off appears between exploiting work sharing opportunities and the available parallelism. To resolve the tradeoff, we develop an analytical approach that predicts the effect of work sharing in multi-core systems. Database systems can use the model to determine, statically or at runtime, whether work sharing is beneficial and apply it only when appropriate. We demonstrate that selective work sharing according to the model outperforms never-share static schemes by 20% on average and always-share ones by 2.5x.
- Nikos Hardavellas, Ryan Johnson, Ippokratis Pandis, Anastasia Ailamaki. Database Servers on Chip Multiprocessors: Limitations and Opportunities. Click to show abstract.
Prior research shows that database system performance is dominated by off-chip data stalls, resulting in a concerted effort to bring data into on-chip caches. At the same time, high levels of integration have enabled the advent of chip multiprocessors and increasingly large (and slow) on-chip caches. These two trends pose the imminent technical and research challenge of adapting high-performance data management software to a shifting hardware landscape. In this paper we characterize the performance of a commercial database server running on emerging chip multiprocessor technologies. We find that the major bottleneck of current software is data cache stalls, with L2 hit stalls rising from oblivion to become the dominant execution time component in some cases. We analyze the source of this shift and derive a list of features for future database designs to attain maximum performance.
- Ling Hu, Yang Du, Donghui Zhang. On Monitoring the top-k Unsafe Places. Click to show abstract.
In a city, protecting units like police cars move around and protect places such as banks and residential buildings. Different places may have different requirements in how many protecting units should be nearby. If any place has less protecting units around than it requires, it is an unsafe place. This paper studies the Continuous Top-k Unsafe Place (CTUP) query, which continuously monitors the k least safe places while the protecting units keep sending their location updates to the server. The CTUP query is a novel addition to the family of continuous location-based queries, an emerging area due to the recent advances in dynamic location-aware environments.
Solutions to existing continuous location-based queries and to the traditional top-k queries do not apply. This paper proposes
two solutions to this new query, the BasicCTUP scheme and the OptCTUP scheme. Experiments are conducted to evaluate the proposed solutions.
- Yuan Mei, Samuel Madden. Searching for Optimal Plans in Sequential Composite Event Detection. Click to show abstract.
Sequentiality (temporal ordering) of events is central to the expression of patterns in many pattern matching applications, such as detecting anomalies in stock prices, detecting intrusion in a network (where a specific sequence of malicious activities is detected), catching break points in debugging systems (where a sequence of function calls are made) or tracing a person's movement in a building (where a person moves through a series of rooms or spaces). Some recent work has already noted the importance of sequential event detection in the real world, but the NFA or regular expression based methodology they adopt construct composite events in a fixed order, and hence lacks flexibility that enables a number of optimizations, especially when searching for optimal plans to detect sequential composite events.
In this work, we show that the evaluation of sequences, conjunctions and disjunctions can all be considered different forms of the fundamental database operator join, which suggests that operator ordering and intermediate result materialization issues are also important. Hence our system seeks to adaptively adjust the order in which it detects sequential patterns based on the frequency of the events that constitute the pattern and based on the selectivity of their predicates. Also, it can materialize the intermediate result when necessary.
|