New England Database Day Posters
The poster session will include drinks and appetizers. It will be held in on the 4th floor of building 32 (same building as the main conference.)
List of accepted posters:
- Zeeshan Ahmed. Advance-Product Data Management (A-PDM) with Efficient Data Management, Intelligent IR & Flexible GUI. Click to show abstract.
Product Data Management (PDM) desktop and web based systems maintain the organization’s technical and managerial data to increase the quality of products by improving the processes of development, business process flows, change management, product structure management, project tracking and resource planning. Though PDM is heavily benefiting industry but PDM community is facing some serious unresolved issues in PDM system development i.e., unfriendly graphical user interfaces with massive controls and unintelligent way of searching records. PDM Systems offer different services and functionalities for different kinds of users e.g. Businessman, Project Manager, Engineer and staff member etc. but the graphical user interfaces of most of the PDM systems are not designed in a way that a user (especially a new user) can easily learn and use it. Moreover PDM Systems contain and manage heavy amount of data but the search mechanism of most of the systems is not intelligent which can process user’s natural language based queries to extract desired information. Currently available search mechanisms in almost all of the PDM systems are not very efficient and based on old ways of searching information by entering the relevant information to the respective fields of search forms to find out some specific information from attached repositories.
- Zeeshan. Ahmed, Saman Majeed, Wolfgang Eisenreich, Thomas Dandekar. Isotopo: Software towards Quantitative Mass Isotopomers Distribution Analysis, Visualization and Data Management. Click to show abstract.
Isotopo is an application with the ability of performing quantitative mass spectrometry to readily mixtures of materials labeled with stable isotopes. Isotopo is the ability of processing experimental isotopomers data (output of GC-MS) and provides obtained results (output) both in textual and visual formats, for better understanding and analysis. Processable (input) data consists of following elements i.e. Metabolite Information, Mass to Charge Ratio (M/Z) Values, Actual Relative Intensity (RI) Values and Standard Relative Intensity Values and Number of Carbon Atom Fragments. Isotopo is capable of analyzing three RIs value sets observed during GC-MS experiment against one M/Z value set with one Standard RI value set. The resultant (output) information is consists of following results i.e. Mass Calculations (Mo, M-1, M maximum), Natural Abundances (NA), Relative Abundances (RA), Fractional Molar Abundances (FRA), Absolute Enrichment, Mean RA, Standard Deviation RA and a Spectrum of calculated RA. It provides output of each processed RI along with average of three. Isotopo consists of seven integrated modules i.e. Analyzer, Fragment Viewer, Spectrum Viewer, Results Viewer, Relative Abundance 1, 2 and 3.
- Alex Rivilis. Poster for a Poster Session. Click to show abstract.
Composite Software, Inc. is the gold standard in data virtualization.
Global organizations with disparate, complex information environments increase their data agility, cut costs and reduce risk with Composite Software. Included among the hundreds of enterprises that have chosen the industry’s most proven data virtualization platform are:
• Ten of the top 20 banks;
• Six of the top ten pharmaceutical companies;
• Four of the top five energy firms;
• Major media and technology organizations; and
• Government agencies.
The Composite Data Virtualization Platform
Composite 6 is the latest release of the Composite Data Virtualization Platform. Composite 6 integrates data from multiple, disparate sources - anywhere across the extended enterprise - in a unified, logically virtualized manner for on-demand consumption by a wide range of front-end business solutions.
Backed by nearly a decade of pioneering R&D and enterprise scale deployments, the Composite Data Virtualization Platform solves the toughest data virtualization problems for the world’s leading global organizations.
The poster content:
Visual Diagram of Composite Server architecture
- Carlos Ordonez, Zhibo Chen. Optimizing Queries with Horizontal Aggregations. Click to show abstract.
SQL presents limitations to return aggregations
as tables with a horizontal layout.
A user generally needs to write separate queries
and data definition statements
to combine transposition with aggregation.
With that motivation in mind,
we introduce horizontal aggregations,
a complementary class of aggregations
to traditional (vertical) SQL aggregations.
Our proposed SQL syntax extension is minimal and
it significantly enhances the expressive power
and ease of use of SQL,
and it blurs the boundary between row values and column names.
Horizontal aggregations have many applications in
ad-hoc querying, OLAP cube processing and transformin data sets
for data mining.
Query optimization of horizontal aggregations
introduces new research challenges.
- Carlos Ordonez, Naveen Mohanam, Carlos Garcia-Alvarado. Efficient One-pass Algorithms for Data Mining based on UDFs. Click to show abstract.
Database research on data mining is extensive,
but most work has proposed efficient algorithms, data structures and techniques that
work outside a DBMS, mostly on flat files or database systems prototypes. In contrast, we present a data mining system that can work on
top of existing relational DBMSs, based on a combination of SQL queries and User-Defined Functions (UDFs), debuking the common perception that SQL is inefficient or inadequate for data mining. Our UDF-based algorithms can process a data set in one pass, exhibit linear scalability
and they can analyze large data sets significantly faster than external data mining tools (e.g. R Package, Weka). Moreover, our UDF-based algorithms are competitive with MapReduce programs.
- William Murakami-Brundage. Developing ‘The Drunken Master’: A Free, Open-Source Case-Based Reasoning System. Click to show abstract.
Knowledge engineering and systems design are two key fields in the overlap between Information Science and Systems Design. A case-based reasoning system named the ‘Drunken Master’ was developed using free, open-source technology. JColibri Studio, an open-source knowledge engineering application from the GAIA Artificial Intelligence Institute, was utilized to develop a case-based reasoning system for research articles from free PubMed medical research database articles. A simultaneous text/data extraction process was developed using RapidMiner 5.1, an open-source data mining and web extraction tool. Thus, an automated web scraping and self-propagating case-based knowledge system was developed using freely available, no-cost software applications. While the key components were developed using medical research, the system can be extended to any accessible structured database or dataset.
- Yin Huai, Rubao Lee, Simon Zhang, Cathy H. Xia, Xiaodong Zhang. Modelling and Optimizing Software Management for Big Data Analytics in Distributed Systems. Click to show abstract.
We present DOT, a matrix model for big data analytics software in a scalable and fault-tolerant manner. The name of the DOT model is represented by data sets (D), concurrent data processing operations (O), and data transformations (T), respectively. With the matrix representation, the DOT model can explicitly represent (1) the concurrent data processing and the movement of data through the ``matrix multiplication'', (2) the fact that no dependency among concurrent workers, and (3) the optimization opportunities of bid data analytics jobs represented by the DOT model. The goal of the DOT model is to provide a bridge among big data analytics software frameworks and applications running on top of these frameworks. We show that the DOT model provides a sufficient condition for the scalability and fault-tolerance of software frameworks for big data analytics in distributed systems due to its communication-restrictive nature. Also, with the DOT model, we generalize a set of framework and implementation independent optimization rules for applications on different software frameworks for big data analytics. Finally, we show the effectiveness of the DOT model through case studies.
- Rubao Lee, Tian Luo, Yin Huai, Yuan Yuan, Fusheng Wang, Yongqiang He, Xiaodong Zhang. YSmart: An Open Source SQL-to-MapReduce Translator. Click to show abstract.
We present YSmart, an open source SQL-to-MapReduce translator. YSmart differs from existing data processing system on top of MapReduce with a SQL-like query language (e.g.\ Hive and Pig) in two aspects. Firstly, YSmart can automatically detect and exploit intra-query correlations and thus is able to generate less number of MapReduce jobs than Hive and Pig for the same complex query. Thus, for complex queries, YSmart can generate MapReduce jobs with less execution time. Secondly, for a query, YSmart will generate number of files containing the real Java code for executing this query, instead of evaluating this query directly. This approach can help users understand how a query is executed. Also, YSmart can server as a useful code generator for users. Thus, users can avoid writing low-level MapReduce job without the loss of the flexibility on customizing their MapReduce jobs. Finally, YSmart is a general framework for translating high-level queries to low-level MapReduce jobs. Thus, additional query languages can be plugged into YSmart and more sophisticated low-level code in MapReduce jobs can be added into YSmart easily. YSmart is an open source software hosted in Google code site at http://code.google.com/ysmart/. A more comprehensive YSmart Web site can be accessed at http://ysmart.cse.ohio-state.edu/, where we update the new progress of the software, production results from its user community, and related research work. YSmart has been patched in Hive, an open source data warehousing system developed at Facebook, to ensure its high programming productivity and high execution performance. It will soon be committed into the repository of Hive.
- Kaibo Wang, Yin Huai, Rubao Li. Accelerating Algorithm Tuning/Validation for Digital Pathology Image Analysis on CPU/GPU Hybrid Systems. Click to show abstract.
Algorithm tuning and validation is a time-consuming and frequently invoked operation in digital pathology image analysis. By comparing the Jaccard similarity of different segmentation results, the effectiveness of a segmentation algorithm can be iteratively improved and verified. To support the computation of similarity, pathologists mainly depend on spatial databases to process the large amount of polygons generated during segmentations. Despite their wide application, the performance with spatial databases has not been satisfactory. The main reasons behind this are three-fold: 1) the access pattern of polygon data is streaming-like rather than repetitive; 2) the compute-intensive aggregate operators required in similarity computation are not well optimized in spatial databases; 3) spatial databases are not utilizing the rich resources of low-cost and high-performance computing and storage devices.
In this paper, we propose a hybrid software solution that harnesses the power of both CPU and GPU to significantly improve the performance of algorithm tuning and validation. We design GAPA, an optimized parallel algorithm that computes the areas of the intersection and the union of polygons on GPUs. GAPA eliminates the bottleneck of Jaccard similarity computation, greatly accelerating query execution. We address several technical issues and develop a software processing framework that fully exploits the rich task, data and pipeline parallelism in the workload. Moreover, we reveal a load balancing problem that is prevalent to any CPU/GPU hybrid system. Through dynamic application-level scheduling, the problem can be effectively solved which, combined with the software processing framework, further improves system throughput. The ideas and methods proposed in this paper are verified through intensive experiments with real-world pathology data sets. The performance of the algorithm tuning and validation operation is improved by over one order of magnitude.
- Ryan Wisnesky. A Monadic Query Language. Click to show abstract.
We present MQL, a Monadic Query Language whose type system ensures the full correctness and optimality of expressive, higher-order queries over finite and infinite data with integrity constraints. At a practical level, MQL's strong reasonining principles make it suitable as an intermediate form for a variety of query languages, including those in the relational, object-relational, xml, stream, key-value, map-reduce, and explicitly parallel traditions. MQL's categorical semantics formalizes large fragments of these languages in a uniform way by modeling collections as monads, queries as comprehensions, data as products and sums, and computation as folding.
At a theoretical level, our type-directed design of MQL solves a number of open, endemic problems in the functional query language tradition. Polymorphic, extensible, labelled records and choices enable semi-structured collection processing. Qualified, higher-rank types ensure a delicate semantic condition known as parametricity, enabling complete forms of deforestation which exploit the monadic structure of queries. Quotient types ensure the correct implementation of non-free collections such as sets in terms of underlying free datatypes such as lists. Dependent identity types allow equivalences between terms to be reflected into types and manipulated by the user; for example, to achieve compositional, open fold fusion or to reduce the number of loops in a comprehension (tableaux minimization).
MQL's computational semantics is given by translation into the Calculus of Inductive Constructions, the language of the Coq proof assistant. The correctness, but not executability, of an MQL program depends on a set of automatically-generated proof obligations, which may be proved semi-interactively by the programmer.
- Boduo Li, Edward Mazur, Yanlei Diao, Andrew McGregor, Prashant Shenoy. A Platform for Scalable One-Pass Analytics using MapReduce. Click to show abstract.
Today's one-pass analytics applications tend to be data-intensive in nature and require the ability to process high volumes of data efficiently. MapReduce is a popular programming model for processing large datasets using a cluster of machines. However, the traditional MapReduce model is not well-suited for one-pass analytics, since it is geared towards batch processing and requires the data set to be fully loaded into the cluster before running analytical queries. This paper examines, from a systems standpoint, what architectural design changes are necessary to bring the benefits of the MapReduce model to incremental one-pass analytics. Our empirical and theoretical analyses of Hadoop-based MapReduce systems show that the widely-used sort-merge implementation for partitioning and parallel processing poses a fundamental barrier to incremental one-pass analytics, despite various optimizations. To address these limitations, we propose a new data analysis platform that employs hash techniques to enable fast in-memory processing, and a new frequent key based technique to extend such processing to workloads that require a large key-state space. Evaluation of our Hadoop-based prototype using real-world workloads shows that our new platform significantly improves the progress of map tasks, allows the reduce progress to keep up with the map progress, with up to 3 orders of magnitude reduction of internal data spills, and enables results to be returned continuously during the job.
- Mohamed Y. Eltabakh, Yuanyuan Tian, Fatma Ozcan, Rainer Gemulla, Aljoscha Krettek, John McPherson. CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop. Click to show abstract.
Hadoop has become an attractive platform for large-scale data analytics. In this work, we identify a major performance bottleneck of Hadoop: its lack of ability to colocate related data on the same set of nodes. To overcome this bottleneck, we introduce CoHadoop, a lightweight extension of Hadoop that allows applications to control where data are stored. In contrast to previous approaches,
CoHadoop retains the flexibility of Hadoop in that it does not require users to convert their data to a certain format (e.g., a relational database or a specific file format). Instead, applications give hints to CoHadoop that some set of files are related and may be processed jointly; CoHadoop then tries to colocate these files for improved efficiency. Our approach is designed such that the strong fault tolerance properties of Hadoop are retained. Colocation can be used to improve the efficiency of many operations, including indexing, grouping, aggregation, columnar storage, joins, and sessionization. We conducted a detailed study of joins and sessionization in the context of log processing---a common use case for Hadoop---, and propose efficient map-only algorithms that exploit colocated data partitions.
In our experiments, we observed that CoHadoop outperforms both plain Hadoop and previous work. In particular, our approach not only performs better than repartition-based algorithms, but also outperforms map-only algorithms that do exploit data partitioning but not colocation.
- Jacob Nikom. Database Change Management with CatalogVersion Tool. Click to show abstract.
Database Change Management is process of making a change to an object in a
database. This process starts out with coding the change, applying the change,
auditing the change testing and finishes with rolling out the change to the
production environment. This talk describes CatalogVersion tool - Java program
that reads performs all those tasks for MySQL database. With CatalogVersion
tool, you can determine the state of a MySQL server and track changes in a
database to track the changes to database structures and static data; track the
version of a database in a specific environment; deploy new versions accurately
and reliably; determine who has made changes to database objects; create an
audit trail recording of changes and when the changes occurred; determine
server errors and back out corrupt versions to rollback to a known state of
integrity.
- Jennie Duggan, Ugur Cetintemel, Olga Papaemmanouil, Eli Upfal. Learning-based Modeling and Prediction of Concurrent Query Performance for General Analytical Workloads. Click to show abstract.
This poster proposes and evaluates learning-based modeling and prediction techniques for Concurrent Query-Performance-Prediction (cQPP), the ability to estimate how long individual queries running concurrently on the same hardware will each take to finish. Accurate cQPP has emerged as a key functionality as database systems are increasingly used in (i) interactive settings with user-facing applications, and (ii) cloud-based settings where Quality of Service (QoS) expectations are often expressed and regulated as functions of query execution latency. We leverage sophisticated machine-learning techniques to address cQPP for the general case of heterogeneous, ad hoc analytical workloads, which are common to many important application scenarios (e.g., data mining and exploration) and deployment modes (e.g., multi-tenant cloud systems). We use a combination of semantic feature selection, correlation analysis, dimensionality reduction, and clustering techniques on sample query executions to develop predictive cQPP models for new concurrent workloads that may include queries from previously unseen templates.
- Abhishek Mukherji, Elke A. Rundensteiner, Matthew O. Ward. Nugget-guided Confirmatory and Exploratory Hypothesis Testing. Click to show abstract.
The main aim of exploratory data mining is to automatically extract interesting patterns, termed as nuggets in our work, from large volumes of data using exploratory search. Purely automated data mining often suffers from extraneous nuggets due to limited knowledge of user interest. On the other hand, hypothesis testing is a well-known statistical tool for confirmatory data analysis. In contrast to exploratory data analysis, conventional hypothesis testing is performed in a hypothesis-driven manner. Traditional hypothesis testing suffers from too much dependence on the user expertise. While hypothesis testing can be made less user-dependent by borrowing from the exploratory nature of data mining, data mining can benefit from the target-oriented approach of hypothesis testing. In this work, we explore the possibility of integrating these two fields of exploratory data mining and confirmatory hypothesis analysis to develop enhanced mining and statistical tools by overcoming their respective limitations. In particular, we identify certain limitations of the existing evidence-based hypothesis testing techniques, such as
lack of support for interdependence between multiple hypotheses or not accounting for a single evidence contributing to multiple hypotheses. Therefore, we explore how multiple hypotheses can be tested iteratively using relationships between multiple evidences and multiple hypotheses. We propose a Multi-evidence Multi-hypothesis testing system for the same. Further, we propose to develop techniques for automatic generation of hypotheses, their testing and further analysis of the hypotheses. The generation of hypotheses is meant to exploit the exploratory nature of data mining. Testing of these hypotheses can be performed over the data records or on the extracted nuggets. The research is focused on three aspects, namely, a.) semantics - defining measures for validating hypotheses based on supporting nuggets, b.) computation - developing algorithms for multiple hypothesis testing and exloratory generation-testing-analysis of hypotheses, and c.) visualization - developing techniques for displaying relationships between the hypotheses and the evidences as well as the test scores to help the users better comprehend the results.
- Lei Cao, Elke Rundensteiner. Multi-Route Stream Query Processing With Correlation-Aware Partitioning. Click to show abstract.
A modern query optimizer typically picks a single query plan for all data based on overall data statistics. However, as real-life datasets tend to have non-uniform distributions and be highly correlated, selecting a single query plan may result in ineffective query execution for possibly large portions of the data streams. A promising approach towards tackling this problem is to support multiple routes (i.e., query plans), each designed for a particular subset
of the data with distinct statistical properties. In this paper we propose a practical multi-route approach named Multi-Route Stream Query Processing With Correlation-Aware Partitioning (or MacPro) which by exploring data correlations, efficiently discovers the data partitions and create multiple plans for a given query. First we divide a single stream into several substreams by exploring the data correlations. Then we decompose the initial user query into multiple partition-level child queries with distinct statistical substream properties. Further, we map and solve the Multi-Query optimization problem of finding optimal plans for the potential large number of child queries to a graph problem. Experimental results confirm that our MacPro study consistently provides better query execution performance compared to the state-of-the-art solutions.
- Nirmesh Malviya, Michael Stonebraker, Samuel Madden, Ariel Weisberg. Recovery Algorithms for In-Memory OLTP databases. Click to show abstract.
We examine different algorithms for recovery in a main-memory parallel database system and see how
their performances compare on typical OLTP workloads. We compare two approaches for recovery: command logging and traditional ARIES style logging. Our initial experimental results show that command logging outperforms ARIES in terms of run-time performance while still having competitive recovery times.
- Kajal Claypool, Kelly Moran, Kevin Nam. Knowledge Creation Services. Click to show abstract.
From identifying people of interest to reacting to technology trends to understanding and devising strategies for dealing with political turmoil in foreign countries, "holistic" intelligence is key to good decision making. While much of the technology emphasis has been on developing sensors to perceive the environment and on processing raw data, the process chain for producing holistic views and actionable intelligence based on the entirety of the information is very much still the purview of human analysts. The Knowledge Creation Services (KCS) project aims to (1) decrease the effort required to aggregate information into holistic views; (2) provide infrastructure to support human-in-the-loop development of actionable intelligence; and (3) support tasks 1 and 2 in a near real-time and interactive environment.
- Sean R. Spillane, Daniel Bokser, Daniel Kemp, Jeong-Hyon Hwang, Jeremey Birnbaum, Zhiting Yang. G*, a Parallel Graph Processing System. Click to show abstract.
Networks, such as the Internet, and Social Networks—and the graphs that model them—change over time. Therefore, we need to be able to handle sets of large graphs. Despite their generality, it is hard to express graph queries in RDBMSs. Worse—in an RDBMS—traversals over graphs require costly join operations. Current graph processing systems and libraries are not capable of efficiently handling sets of graphs. Our system, G*, overcomes these issues. G* not only provides a concise and declarative query language, it also stores graph data in a de-duplicated fashion, which virtually eliminates redundant computation and storage—even for distributed graphs and graph sets. In G*, queries are transformed into a network of operators, which then process—in parallel—distributed graph data. Our poster will provide a high-level summary of our system, along with a précis of our results.
- Jacek R. Ambroziak. High Performance XML storage in MongoDB. Click to show abstract.
We present a method for storing, indexing, and retrieval of XML documents using MongoDB. The work has been motivated by the needs of Custom Publishing of electronic books. eBook/eJournal content is typically represented as XML (XHTML, DocBook) and so is book metadata (ONIX, RDF). After parsing input XML documents once, we serialize them into binary BLOBs that can be very quickly 'unpacked' upon retrieval. At storage time we use compiled XSLT to extract medadata to accompany binary XML BLOBs into Mongo, to help index them. The metadata will be referred to in subsequent MongoDB queries. Binary XML BLOBs returned by these queries are unpacked and transformed by compiled XSLT/XPath in <1msec. Additionally, since the XML docs represent English text, we optionally full-text index the contents; the search engine we use returns xPaths of text paragraphs it locates.
- Thanh Tran, Yanlei Diao, Anna Liu. Supporting user-defined functions of uncertain data streams. Click to show abstract.
Scientific applications, such as astrophysics and weather monitoring, make intensive use of user-defined functions (UDFs) which are often complex and expensive to compute. Since uncertain data streams are native in these applications, the output of a UDF is uncertain and should be characterized by a distribution. In our work, we address a general case when the UDFs are treated as black boxes and provide a framework to compute them using nonparametric Bayesian techniques. More specifically, we employ Gaussian processes to approximate the UDFs and compute output distributions with approximation errors. Given the stream setting, we also explore techniques to improve performance such as approximation using local Gaussian processes. Our initial results show that when applied to many complex functions, the proposed technique is more efficient than the standard Monte Carlo simulation while providing comparable accuracy.
- Alper Okcan, Mirek Riedewald. Enabling Scientific Discovery through Scalable Search and Ranking. Click to show abstract.
Data-intensive science has emerged as a new paradigm that is concerned with collecting, archiving, and analyzing the vast amounts of data being produced and accumulated by modern science. Turning raw data into knowledge will be the key for future scientific discoveries. A major challenge during exploratory analysis is to find interesting patterns indicating possible relationships between the variables of a complex process. Such patterns form the basis for new hypotheses and facilitate discovery. Scientists would like to search for interesting patterns similar to how they search the Web, rather than through trial-and-error.
We propose Scolopax, a novel tool for discovery of interesting patterns in massive high-dimensional data sets. Scolopax is based on an optimization framework for efficient creation of massive pattern collections on multi-processor environments like Clouds. For a popular class of data mining models and a wide variety of pattern workloads, we improve computational cost of pattern creation and achieve speedups of several orders of magnitude compared to the state of the art. We also present novel parallel algorithms for finding groups of related patterns, e.g., to discover correlations between trends.
- Liping Peng, Yanlei Diao, Anna Liu. Efficient Uncertain Data Management under the Array Model. Click to show abstract.
Continuous uncertain data is prevalent in scientific applications such as computational astrophysics. Motivated by the fact that a lot of scientific data naturally resides in multi-dimensional arrays rather than relations, recent research has proposed an array data model and algebra, and has shown the performance benefits over relational technology. However, existing work on uncertain data management under the array model mainly focuses on the "value uncertainty". In this paper, we examine and classify common array operators on continuous uncertain data and seek to define their formal semantics. We further consider the "position uncertainty" and propose a range of storage schemes including one that is adaptively tunable according to the available storage, query workloads and performance requirements. We also devise evaluation strategies of array operators under each storage scheme.
- Alexander Thomson, Thaddeus Diamond, Philip Shao, Shu-chun Weng, Kun Ren, Daniel Nucleic Abadi. Fast Distributed Transactions for Partitioned Database Systems. Click to show abstract.
There are many distributed storage systems that achieve high data access throughput via partitioning and replication, each with its own advantages and tradeoffs. In order to achieve high scalability, however, today's systems generally reduce transactional support, disallowing single transactions from spanning multiple partitions. We introduce a practical transaction scheduling and data replication layer that uses a deterministic ordering guarantee to significantly reduce the normally prohibitive contention costs associated with distributed transactions. Unlike recent deterministic database system prototypes, this system supports disk-based storage, scales near-linearly on a cluster of commodity machines, and has no single point of failure. By replicating transaction inputs rather than effects, we are also able to support multiple consistency levels---including Paxos-based strong consistency across geographically distant replicas---with no additional cost to transactional throughput.
- Yingmei Qi, Medhabi Ray, Elke Rundensteiner, Chengcheng. Efficient Aggregation Computation in E-Cube. Click to show abstract.
Traditional online analytical processing (OLAP) systems are not designed for real-time complex pattern extraction, while state-of-art Complex Event Processing (CEP) systems designed for sequence detection do not support OLAP operations. The E-Cube system is the first to combine the CEP and OLAP techniques to support efficient multi-dimensional event pattern analysis at different abstraction levels. However, the base operation of OLAP, the aggregation computation, is not taken into consideration in current E-Cube model. In this work, we first propose a high-performance approach to achieve the COUNT aggregation computation in CEP environment. Then, we push this approach to E-Cube model to support aggregation computations sharing among queries at different pattern or concept levels. Finally, we design a cost-driven adaptive optimizer to achieve the optimal E-Cube hierarchy execution.
- Dazhi Zhang, Medhabi Ray, Elke Rundensteiner. Nested Event Processing Language and its Optimized Processing. Click to show abstract.
Complex Event Processing (CEP) has become increasingly important for online data analysis used in domains ranging from scientific research, equipment monitoring and sensor data analysis. These monitoring applications submit complex event queries to track sequences of events that match a given pattern. While the state-of-the-art CEP systems focus on execution of sequence queries, we propose an efficient execution strategy for a broader class of pattern queries specified by a nested combination of existent and non-existent events. Such nesting enables users to specify expressive information needs more succinctly.
In this poster, we will present a NEsted Event Language (NEEL) and a processing technique for this language. We will discuss several optimization techniques for faster processing of nested CEP queries.
- Bahar Qarabaqi, Mirek Riedewald. Interactive Search Queries for Online Communities. Click to show abstract.
We introduce interactive search queries for online communities (ISQ-problem) as a new research challenge. In online communities, users often try to leverage the combined community expertise to find answers for problems where neither a complete and accurate specification of the question nor the exploration of all possible answers is feasible. Some typical examples are identifying the species of an observed bird, disease diagnosis in a health-care community, finding the name of a nice restaurant where one ate several years ago during a vacation, etc. We formally define the ISQ-problem as a problem of minimizing user effort, while guaranteeing interactive system response time. To deal with uncertain and missing user responses, we develop a probability-based framework and show that it allows us to optimally rank possible answers. The framework is also the basis for an algorithm that decides about asking the right questions or presenting possible answers, such that it can home in on the right answer with minimal user effort. A core component of the algorithm is scalable technique for estimating the required probabilities while guaranteeing interactive response time even for very large problems.
- Alex Kalinin, Ugur Cetintemel, Stan Zdonik. Region-based Data Exploration. Click to show abstract.
We present a new spatial data exploration approach in which users express "regions" (i.e., parts of the underlying data space) of interest using standard DBMS-style queries enhanced with optimization functions. Our model enables a host of useful exploratory queries (e.g., "find a 3-by-3 region that has the max average brightness") that are difficult to express and optimize using standard DBMS techniques. We study this model in the context of SciDB, an array-oriented DBMS designed to support science applications.
- Stephen Tu. Analytical query processing on encrypted data. Click to show abstract.
This poster will discuss very preliminary work on executing OLAP style queries over encrypted data. This work builds on top of CryptDB, but optimizes techniques for analytical workloads. Hand-tuned performance results on the TPC-H benchmark will be shown, with a discussion of building a trace-based query optimizer as the next research step.
- Ron Hu. MarkLogic Server: A Highly Scalable DBMS for Unstructured Information. Click to show abstract.
Abstract:
MarkLogic Server is built based on a simple guideline, namely "use search engine like a database server". We support the important database properties such as rich query language, guaranteed correctness, ACID transaction properties, and high scalability. In this presentation, we focus on how we achieve high scalability.
MarkLogic Server is a purpose built database management system for unstructured data. Customers load documents of various file formats into MarkLogic Server with indexes created on the fly. We describe its shared-nothing architecture and its indexes for full text search. We will also demonstrate MarkMail, which utilizes MarkLogic Server to perform a very large public mailing list archive service that emphasizes high performance and search analytics.
- Barzan Mozafari, Carlo Curino, and Samuel Madden. Resource and Performance Prediction for OLTP Workloads in a Database-as-a-Service. Click to show abstract.
Resource and performance analysis and prediction in a transactional database---answering questions like “What would be the latency of my queries if the requests per second double?'' or ``Which resource will be the bottleneck, when increasing the load?''---is both important and difficult. The main challenge stems from the high degree of concurrency and the non-linearity caused by complex interactions between different transactions, competing the same resources. Nonetheless, such analysis is a key component in enabling database administrators to understand how their systems will scale under load and how much of each resource is used by different queries or tenants. Though some prior work has addressed analytical workloads, characterized by fewer (typically read-only) but heavier queries, there are no existing tools designed to answer the above questions for OLTP workloads. In this poster, we present our solution to this problem that is comprised of developing statistical models to provide resource and performance analysis and prediction for highly concurrent OLTP workloads.
|