|
MACAQUE - Managing Ambiguity and Complexity in Acqusitional Query Environments
|
[Overview | Projects | People | Publications]
The vision of ubiquitous computing promises to spread information
technology throughout our lives. Though this vision can be compelling,
it also threatens to overwhelm us with a flood of information, much of
which is spurious, irrelevant, or misleading. Thus, the challenge of
realizing this vision is separating the relevant, timely, and useful
information out of this flood of data. The data management community
has made significant progress towards achieving this goal, by
providing tools that load and clean the data, languages and systems
that can query the data and algorithms that mine the data for patterns
and relationships that are of interest.
These efforts have largely been focused on mitigating data complexity
once it has been captured and stored inside of a traditional computing
infrastructure. In contrast, we propose a set of techniques designed
to take an active role in managing this wealth of data by managing
when, where, and with what frequency data is acquired from distributed
information systems. There are many modern systems where the
capability of local nodes to generate data far outstrips the resources
available to transmit or store that data. Nodes in a sensor network,
for example, typically have processors that run at several megahertz,
with data collection hardware capable of collecting many kilosamples
per second, but radios than only transmit kilobytes per second
aggregate across all of the nodes in the network. Worse yet, these
nodes are battery powered, and, when sampling at maximum rates, only
have sufficient energy to last for a few days.
In addition to limited resources, data from real world environments is
often noisy, lossy, and hard to interpret. This noise and uncertainty
can be misleading, particularly when the user is summarizing and
aggregating data using a high- level language like SQL.
In the MACAQUE (for "Management of Ambiguity and
Complexity in an Acquisitional QUery Environment") project,
we are developing several sytems is designed to
focus the resources of the computer system (e.g., network bandwidth or
battery capacity) and attention of the user on capturing, refining,
and interpreting portions of the data that are most relevant while de-
emphasizing and decreasing the captured resolution of less relevant
data. At the same time, we uses statistical and probabilistic
techniques to identify data that is spurious, incorrect, or
unreliable, and to infer missing data values.
Our effort on MACAQUE is divided into several related sub-projects,
including:
- Probablistic models for sensor networks: We have developed an energy-efficient
framework, called the 'sensor acquisition framework' (SAF), for
approximate querying and clustering of nodes in a sensor network. SAF
uses simple time series forecasting models to predict sensor
readings. The idea is to build these local models at each node,
transmit them to the root of the network (the "sink"), and use them to
approximately answer user queries. Our approach dramatically reduces
communication relative to previous approaches for querying sensor
networks by exploiting properties of these local models, since each
sensor communicates with the sink only when its local model varies due
to changes in the underlying data distribution.
- Model-based views: Just as traditional database views provide
logical data independence, model-based views provide independence from
the details of the underlying data generating mechanism and hide the
irregularities of the data by using models to present a consistent
view to the users.
We have developed a new kind of model-based view that represents a
user's data as a collection of functions, fit by regression, that are
stored in the database as an alternative representation to raw data.
Regression is a widely employed and useful modeling tool in several
financial, scientific, engineering and data mining applications, as
well as in applications like sensor networks that need to tolerate
missing and/or noisy data. These applications need to both fit
functions to their data using regression, and pose relational-style
queries over these functional models. Unfortunately, existing DBMSs
are ill suited for this task because they do not include support for
creating, representing and querying this functional data short of
brute-force discretization of the functions into a collection of
tuples. The system we have developed, FunctionDB,
is a novel DBMS that treats functions output by regression as
first-class citizens that can be queried and manipulated like
traditional relations.
- Water pipeline management: US water utilities are faced with
mounting operational and maintenance costs as a result of aging
pipeline infrastructures. Leaks and ruptures in water supply
pipelines and blockages and overflow events in sewer collectors cost
millions of dollars a year, and monitoring and repairing this
underground infrastructure presents a severe challenge. We have
developed a system called PipeNet for collecting hydraulic and
acoustic/vibration data at high sampling rates as well as modeling
techniques for detecting leaks and ruptures imminent in pipes
using this data. Though the specific analysis and detection
techniques are somewhat different
than those in other parts of the MACAQUE project, the high level
ideas of summarizing raw, noisy sensor data into models of
the state of pipes is in many ways similar.
- CarTel: CarTel is a
mobile sensor data collection platform that runs on a network of 9
private cars (belonging to members of the MIT research community) and
(as of this year) 27 taxis. It consists of a small car-powered Linux
based hardware platform that includes a WiFi radio, a GPS, an
accelerometer, (in some cases) a camera, and an interface to the on
board diagnostics network inside of most automobiles.
MACAQUE-relevant applications include techniques to summarize the
collected GPS data as models of traffic delays for route planning and
that detect road-surface anomalies (e.g., potholes) using on-board
accelerometer data.
People
Faculty
Students
Alumni
Publications
- Arvind Thiagarajan and Samuel Madden. Representing and Querying Regression Models in a DBMS. To Appear,
Proceedings of SIGMOD, 2008.
[PDF]
- Yang Zhang, Bret Hull, Hari Balakrishnan, Samuel Madden.
ICEDB: Intermittently Connected Continuous Query Processing.
In Proceedings of
ICDE, 2007.
[PDF]
- Ivan Stoianov, Lama Nachman, Samuel Madden, and Timur Tokmouline
PIPENET: A Wireless Sensor Network for Pipeline Monitoring.
In Proceedings of IPSN,
2007. [PDF]
- Daniela Tulone, Samuel Madden.
An Energy-efficient Querying Framework for Detecting Node Similarities in Sensor Networks.
In Proceedings of
ACM/IEEE International Symposium on Modeling, Analysis and Simulation in Sensor Networks (MSWiM), 2006.
[PDF]
- Bret Hull, Vladimir Bychkovskiy, Kevin Chen, Michel Goraczko, Eugene Shih, Yang Zhang, Hari Balakrishnan, Samuel Madden.
CarTel: A Distributed Mobile Sensor Computing System.
In Proceedings of
SenSys, 2006.
[PDF]
- Vladimir Bychkovskiy, Bret Hull, Allen Miu, Hari Balakrishnan, Samuel Madden.
A Measurement Study of Vehicular Internet Access Using Unplanned 802.11 Networks.
In Proceedings of
MOBICOM, 2006.
[PDF]
- Amol Deshpande, Samuel Madden.
MauveDB: Supporting Model-Based User Views in Database Systems.
In Proceedings of
SIGMOD, 2006.
[PDF]
- Daniela Tulone, Samuel Madden.
PAQ: Time series forecasting for approximate query answering in sensor networks.
In Proceedings of
EWSN, 2006.
[PDF]
- Amol Deshpande, Carlos Guestrin, Samuel Madden, Joseph Hellerstein, Wei Hong.
Model Driven Data Acquisition in Sensor Networks. (Best Paper Award).
In Proceedings of
VLDB, 2004.
[PDF]
This project is funded by the NSF award IIS-0448124.
[Overview | Projects | People | Publications]
|