New England Database Day Program
| ||||||||||||||||||||||||||||||||||||||||
| Time | Event | |
| 9:00 AM | Welcoming remarks | |
| 9:30 AM | Michael J. Franklin (UC Berkeley and Truviso). Continuous Analytics: Supercharging Query Performance with Stream Processing. | |
| 10:30 AM | Coffee Break | |
| Technical Session 1 | ||
|---|---|---|
| 11:00 AM | Yin Huai, Rubao Lee, Simon Zhang, Cathy H. Xia, Xiaodong Zhang. Modelling and Optimizing Software Management for Big Data Analytics in Distributed Systems. Click to toggle abstract. We present DOT, a matrix model for big data analytics software in a scalable and fault-tolerant manner. The name of the DOT model is represented by data sets (D), concurrent data processing operations (O), and data transformations (T), respectively. With the matrix representation, the DOT model can explicitly represent (1) the concurrent data processing and the movement of data through the ``matrix multiplication'', (2) the fact that no dependency among concurrent workers, and (3) the optimization opportunities of bid data analytics jobs represented by the DOT model. The goal of the DOT model is to provide a bridge among big data analytics software frameworks and applications running on top of these frameworks. We show that the DOT model provides a sufficient condition for the scalability and fault-tolerance of software frameworks for big data analytics in distributed systems due to its communication-restrictive nature. Also, with the DOT model, we generalize a set of framework and implementation independent optimization rules for applications on different software frameworks for big data analytics. Finally, we show the effectiveness of the DOT model through case studies. | |
| 11:25 AM | Carlos Ordonez, Naveen Mohanam, Carlos Garcia-Alvarado. Efficient One-pass Algorithms for Data Mining based on UDFs. Click to toggle abstract. Database research on data mining is extensive,
| |
| 11:50 AM | Jacek R. Ambroziak. High Performance XML storage in MongoDB. Click to toggle abstract. We present a method for storing, indexing, and retrieval of XML documents using MongoDB. The work has been motivated by the needs of Custom Publishing of electronic books. eBook/eJournal content is typically represented as XML (XHTML, DocBook) and so is book metadata (ONIX, RDF). After parsing input XML documents once, we serialize them into binary BLOBs that can be very quickly 'unpacked' upon retrieval. At storage time we use compiled XSLT to extract medadata to accompany binary XML BLOBs into Mongo, to help index them. The metadata will be referred to in subsequent MongoDB queries. Binary XML BLOBs returned by these queries are unpacked and transformed by compiled XSLT/XPath in <1msec. Additionally, since the XML docs represent English text, we optionally full-text index the contents; the search engine we use returns xPaths of text paragraphs it locates. | |
| 12:25 PM | Lunch | |
| Technical Session 2 | ||
| 1:15 PM | Alper Okcan, Mirek Riedewald. Enabling Scientific Discovery through Scalable Search and Ranking. Click to toggle abstract. Data-intensive science has emerged as a new paradigm that is concerned with collecting, archiving, and analyzing the vast amounts of data being produced and accumulated by modern science. Turning raw data into knowledge will be the key for future scientific discoveries. A major challenge during exploratory analysis is to find interesting patterns indicating possible relationships between the variables of a complex process. Such patterns form the basis for new hypotheses and facilitate discovery. Scientists would like to search for interesting patterns similar to how they search the Web, rather than through trial-and-error.
| |
| 1:40 PM | Nirmesh Malviya, Michael Stonebraker, Samuel Madden, Ariel Weisberg. Recovery Algorithms for In-Memory OLTP databases. Click to toggle abstract. We examine different algorithms for recovery in a main-memory parallel database system and see how
| |
| 2:05 PM | Yingmei Qi, Medhabi Ray, Elke Rundensteiner, Chengcheng. Efficient Aggregation Computation in E-Cube. Click to toggle abstract. Traditional online analytical processing (OLAP) systems are not designed for real-time complex pattern extraction, while state-of-art Complex Event Processing (CEP) systems designed for sequence detection do not support OLAP operations. The E-Cube system is the first to combine the CEP and OLAP techniques to support efficient multi-dimensional event pattern analysis at different abstraction levels. However, the base operation of OLAP, the aggregation computation, is not taken into consideration in current E-Cube model. In this work, we first propose a high-performance approach to achieve the COUNT aggregation computation in CEP environment. Then, we push this approach to E-Cube model to support aggregation computations sharing among queries at different pattern or concept levels. Finally, we design a cost-driven adaptive optimizer to achieve the optimal E-Cube hierarchy execution. | |
| 2:30 PM | Coffee Break | |
| 3:00 PM | Alon Halevy (Google). Structured data on the Web: where we are and where we can go
Click to toggle abstract. Though search on the World-Wide Web has focused mostly on unstructured text, there is an increasing amount of structured data on the Web and growing interest in harnessing such data. I will describe several current projects at Google whose overall goal is to leverage structured data and better expose it to our users. These projects include crawling the deep web, collecting and mining the HTML tables on the Web, and computing aspects for search queries to better organize answers. In each case, I will focus the lessons learned from the project and the opportunities that lie ahead. I will also discuss the opportunities relating to creating and managing data on the Web. | |
| 4:00 PM | Poster Session and Appetizers / Drinks (Building 32, R&D Area, 4th Floor) | |
| 6:00 PM | Adjourn | |