Back to Macaque

FunctionDB: A DBMS With First-Class Support For Regression Functions

Arvind Thiagarajan & Samuel Madden

Motivation

Relational databases have traditionally taken the view that the data they store is a set of discrete observations. This is reasonable when storing individual facts, such as the salary of an employee or the description of a product. However, when representing time or space varying data, such as a series of temperature observations, the trajectory of a moving object, or a history of salaries over time, a set of discrete points is often neither the most intuitive nor compact representation. For researchers in many fields -- from social sciences to biology to aeronautics to computer science -- a common first step in understanding a set of data points is to model those points with a collection of curves, generated using some form of regression (i.e., curve fitting). Regression yields a compact representation of those points as a few parameters, and provides insight into the data by revealing trends and outliers.

Once in this curve domain, it is natural to ask questions over the fit data directly, looking for example for curves that intersect, are confined within a certain area, or that have the maximum slope. Unfortunately, existing DBMSs are ill suited for this task because they do not include support for creating, representing and querying functional data short of brute-force discretization of functions into collections of tuples.

An alternative to modeling might be to simply run queries over raw data points. Unfortunately, it is usually not desirable or even feasible to directly query raw data, because they are either missing (necessitating interpolation), noisy (necessitating smoothing) or unavailable (necessitating prediction via extrapolation). For example, in a sensor network that monitors an environmental variable like temperature or humidity using a number of sensors, it may be necessary to interpolate the data to predict temperature or humidity at locations where sensors are not physically deployed. Also, sensors may occasionally fail or malfunction, or report garbage values that must be smoothed or eliminated by filtering.

Existing database systems provide some support for fitting models, but do not support regression models as first-class objects. Some commercial DBMSs do provide modeling tools for data mining applications. For example, IBM's Intelligent Miner, which is part of IBM DB2, supports creating models using PMML (Predictive Model Markup Language). However, these tools do not export a relational interface to model data. Rather, models are viewed as standalone black boxes with specialized interfaces for fitting, querying and visualization.

Hence, we are building a system, FunctionDB, that allows users to directly query regression functions inside a database system. By pushing support for regression into the database, rather than requiring the use of an external curve fitting and analysis tool, users can manage these models just like any other data, providing the benefits of declarative queries, indexability, transactions, fault tolerance, recovery, and integration with existing database data.

Progress and Results

As part of FunctionDB, we have developed a simple, compact, algebraic representation for regression models inside a DBMS as piecewise continuous functions. We are developing an algebraic query processor that executes relational queries directly on this representation as combinations of algebraic operations like function inversion, zero finding and symbolic integration. As a very simple example, a selection query that finds times when the temperature of a sensor whose value is given by the equation x(t) = 2t + 3 crosses the line x = 5 can be evaluated by solving the equation 2t + 3 = 5. Similarly, the symbolic analogues for aggregation and join queries use definite integrals and function inversion respectively.

We have implemented FunctionDB as a standalone query processor in C++. The system is currently capable of executing arbitrary relational queries on piecewise linear functions of a single variable. We have also evaluated FunctionDB on two real data sets: measurements from a temperature sensor network deployed at Intel Research, Berkeley, and traffic traces from cars driving on Boston roads. Our results indicate that operating in the function domain has substantial advantages in terms of accuracy, and up to order of magnitude gains in query execution performance over existing approaches that brute-force discretize models into points.

We are currently working on extending our algebraic query processing approach to higher dimensions, to support queries on spatial and other higher-dimensional data. Other planned future work involves building a robust update infrastructure to automatically segment models into piecewise functions, and to update regression models as new raw data or updates come in to the system.

References

[1] Amol Deshpande and Samuel Madden. MauveDB: Supporting Model Based User Views in Database Systems. In Proceedings of SIGMOD Conference, 2006. Chicago, USA.

[2] IBM Intelligent Miner. http://www-306.ibm.com/software/data/iminer/.

Acknowledgements

Support for FunctionDB was provided by the NSF CAREER Program under grant number 0448124.

Back to Macaque