Source author record

Val Tannen

Val Tannen appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Databases Data Structures and Algorithms Computational Complexity Logic in Computer Science Machine Learning math.LO

Catalog footprint

What is connected

12works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

The Complexity of Finding Missing Answer Repairs

We investigate the problem of identifying database repairs for missing tuples in query answers. We show that when the query is part of the input - the combined complexity setting - determining whether or not a repair exists is polynomial-time is equivalent to the satisfiability problem for classes of queries admitting a weak form of projection and selection. We then identify the sub-classes of unions of conjunctive queries with negated atoms, defined by the relational algebra operations permitted to appear in the query, for which the minimal repair problem can be solved in polynomial time. In contrast, we show that the problem is NP-hard, as well as set cover-hard to approximate via strict reductions, whenever both projection and join are permitted in the input query. Additionally, we show that finding the size of a minimal repair for unions of conjunctive queries (with negated atoms permitted) is OptP[log(n)]-complete, while computing a minimal repair is possible with O($n^2$) queries to an NP oracle. With recursion permitted, the combined complexity of all of these variants increases significantly, with an EXP lower bound. However, from the data complexity perspective, we show that minimal repairs can be identified in polynomial time for all queries expressible as semi-positive datalog programs.

preprint2022arXiv

DBSP: Automatic Incremental View Maintenance for Rich Query Languages

Incremental view maintenance has been for a long time a central problem in database theory. Many solutions have been proposed for restricted classes of database languages, such as the relational algebra, or Datalog. These techniques do not naturally generalize to richer languages. In this paper we give a general solution to this problem in 3 steps: (1) we describe a simple but expressive language called DBSP for describing computations over data streams; (2) we give a general algorithm for solving the incremental view maintenance problem for arbitrary DBSP programs, and (3) we show how to model many rich database query languages (including the full relational queries, grouping and aggregation, monotonic and non-monotonic recursion, and streaming aggregation) using DBSP. As a consequence, we obtain efficient incremental view maintenance techniques for all these rich languages.

preprint2020arXiv

Generalized Absorptive Polynomials and Provenance Semantics for Fixed-Point Logic

Semiring provenance is a successful approach to provide detailed information on the combinations of atomic facts that are responsible for the result of a query. In particular, interpretations in general provenance semirings of polynomials or formal power series give precise descriptions of the successful evaluation strategies for the query. While provenance analysis in databases has, for a long time, been largely confined to negation-free query languages, a recent approach extends this to model checking problems for logics with full negation. Algebraically this relies on new quotient semirings of dual-indeterminate polynomials or power series. So far, this approach has been developed mainly for first-order logic and for the positive fragment of least fixed-point logic. What has remained open is an adequate treatment for fixed-point calculi that admit arbitrary interleavings of least and greatest fixed points. We show that an adequate framework for the provenance analysis of full fixed-point logics is provided by semirings that are (1) fully continuous, (2) absorptive, and (3) chain-positive. Full continuity guarantees that provenance values of least and greatest fixed-points are well-defined. Absorptive semirings provide a symmetry between least and greatest fixed-point computations and make sure that provenance values of greatest fixed points are informative. Finally, chain-positivity is responsible for having truth-preserving interpretations, which give non-zero values to all true formulae. We further identify semirings of generalized absorptive polynomials and prove universal properties that make them the most general appropriate semirings for LFP. We illustrate the power of provenance interpretations in these semirings by relating them to provenance values of plays and strategies in the associated model-checking games.

preprint2020arXiv

PrIU: A Provenance-Based Approach for Incrementally Updating Regression Models

The ubiquitous use of machine learning algorithms brings new challenges to traditional database problems such as incremental view update. Much effort is being put in better understanding and debugging machine learning models, as well as in identifying and repairing errors in training datasets. Our focus is on how to assist these activities when they have to retrain the machine learning model after removing problematic training samples in cleaning or selecting different subsets of training data for interpretability. This paper presents an efficient provenance-based approach, PrIU, and its optimized version, PrIU-opt, for incrementally updating model parameters without sacrificing prediction accuracy. We prove the correctness and convergence of the incrementally updated model parameters, and validate it experimentally. Experimental results show that up to two orders of magnitude speed-ups can be achieved by PrIU-opt compared to simply retraining the model from scratch, yet obtaining highly similar models.

preprint2016arXiv

Incremental View Maintenance For Collection Programming

In the context of incremental view maintenance (IVM), delta query derivation is an essential technique for speeding up the processing of large, dynamic datasets. The goal is to generate delta queries that, given a small change in the input, can update the materialized view more efficiently than via recomputation. In this work we propose the first solution for the efficient incrementalization of positive nested relational calculus (NRC+) on bags (with integer multiplicities). More precisely, we model the cost of NRC+ operators and classify queries as efficiently incrementalizable if their delta has a strictly lower cost than full re-evaluation. Then, we identify IncNRC+; a large fragment of NRC+ that is efficiently incrementalizable and we provide a semantics-preserving translation that takes any NRC+ query to a collection of IncNRC+ queries. Furthermore, we prove that incremental maintenance for NRC+ is within the complexity class NC0 and we showcase how recursive IVM, a technique that has provided significant speedups over traditional IVM in the case of flat queries [25], can also be applied to IncNRC+.

preprint2015arXiv

Algorithms for Provisioning Queries and Analytics

Provisioning is a technique for avoiding repeated expensive computations in what-if analysis. Given a query, an analyst formulates $k$ hypotheticals, each retaining some of the tuples of a database instance, possibly overlapping, and she wishes to answer the query under scenarios, where a scenario is defined by a subset of the hypotheticals that are "turned on". We say that a query admits compact provisioning if given any database instance and any $k$ hypotheticals, one can create a poly-size (in $k$) sketch that can then be used to answer the query under any of the $2^{k}$ possible scenarios without accessing the original instance. In this paper, we focus on provisioning complex queries that combine relational algebra (the logical component), grouping, and statistics/analytics (the numerical component). We first show that queries that compute quantiles or linear regression (as well as simpler queries that compute count and sum/average of positive values) can be compactly provisioned to provide (multiplicative) approximate answers to an arbitrary precision. In contrast, exact provisioning for each of these statistics requires the sketch size to be exponential in $k$. We then establish that for any complex query whose logical component is a positive relational algebra query, as long as the numerical component can be compactly provisioned, the complex query itself can be compactly provisioned. On the other hand, introducing negation or recursion in the logical component again requires the sketch size to be exponential in $k$. While our positive results use algorithms that do not access the original instance after a scenario is known, we prove our lower bounds even for the case when, knowing the scenario, limited access to the instance is allowed.

preprint2015arXiv

Decidability of Equivalence of Aggregate Count-Distinct Queries

We address the problem of equivalence of count-distinct aggregate queries, prove that the problem is decidable, and can be decided in the third level of Polynomial hierarchy. We introduce the notion of core for conjunctive queries with comparisons as an extension of the classical notion for relational queries, and prove that the existence of isomorphism among cores of queries is a sufficient and necessary condition for equivalence of conjunctive queries with comparisons similar to the classical relational setting. However, it is not a necessary condition for equivalence of count-distinct queries. We introduce a relaxation of this condition based on a new notion, which is a potentially new query equivalent to the initial query, introduced to capture the behavior of count-distinct operator.

preprint2015arXiv

Dynamic Sketching for Graph Optimization Problems with Applications to Cut-Preserving Sketches

In this paper, we introduce a new model for sublinear algorithms called \emph{dynamic sketching}. In this model, the underlying data is partitioned into a large \emph{static} part and a small \emph{dynamic} part and the goal is to compute a summary of the static part (i.e, a \emph{sketch}) such that given any \emph{update} for the dynamic part, one can combine it with the sketch to compute a given function. We say that a sketch is \emph{compact} if its size is bounded by a polynomial function of the length of the dynamic data, (essentially) independent of the size of the static part. A graph optimization problem $P$ in this model is defined as follows. The input is a graph $G(V,E)$ and a set $T \subseteq V$ of $k$ terminals; the edges between the terminals are the dynamic part and the other edges in $G$ are the static part. The goal is to summarize the graph $G$ into a compact sketch (of size poly$(k)$) such that given any set $Q$ of edges between the terminals, one can answer the problem $P$ for the graph obtained by inserting all edges in $Q$ to $G$, using only the sketch. We study the fundamental problem of computing a maximum matching and prove tight bounds on the sketch size. In particular, we show that there exists a (compact) dynamic sketch of size $O(k^2)$ for the matching problem and any such sketch has to be of size $Ω(k^2)$. Our sketch for matchings can be further used to derive compact dynamic sketches for other fundamental graph problems involving cuts and connectivities. Interestingly, our sketch for matchings can also be used to give an elementary construction of a \emph{cut-preserving vertex sparsifier} with space $O(kC^2)$ for $k$-terminal graphs; here $C$ is the total capacity of the edges incident on the terminals. Additionally, we give an improved lower bound (in terms of $C$) of $Ω(C/\log{C})$ on size of cut-preserving vertex sparsifiers.

preprint2011arXiv

On the Limitations of Provenance for Queries With Difference

The annotation of the results of database transformations was shown to be very effective for various applications. Until recently, most works in this context focused on positive query languages. The provenance semirings is a particular approach that was proven effective for these languages, and it was shown that when propagating provenance with semirings, the expected equivalence axioms of the corresponding query languages are satisfied. There have been several attempts to extend the framework to account for relational algebra queries with difference. We show here that these suggestions fail to satisfy some expected equivalence axioms (that in particular hold for queries on "standard" set and bag databases). Interestingly, we show that this is not a pitfall of these particular attempts, but rather every such attempt is bound to fail in satisfying these axioms, for some semirings. Finally, we show particular semirings for which an extension for supporting difference is (im)possible.

preprint2011arXiv

Provenance for Aggregate Queries

We study in this paper provenance information for queries with aggregation. Provenance information was studied in the context of various query languages that do not allow for aggregation, and recent work has suggested to capture provenance by annotating the different database tuples with elements of a commutative semiring and propagating the annotations through query evaluation. We show that aggregate queries pose novel challenges rendering this approach inapplicable. Consequently, we propose a new approach, where we annotate with provenance information not just tuples but also the individual values within tuples, using provenance to describe the values computation. We realize this approach in a concrete construction, first for "simple" queries where the aggregation operator is the last one applied, and then for arbitrary (positive) relational algebra queries with aggregation; the latter queries are shown to be more challenging in this context. Finally, we use aggregation to encode queries with difference, and study the semantics obtained for such queries on provenance annotated databases.

preprint2011arXiv

Putting Lipstick on Pig: Enabling Database-style Workflow Provenance

Workflow provenance typically assumes that each module is a "black-box", so that each output depends on all inputs (coarse-grained dependencies). Furthermore, it does not model the internal state of a module, which can change between repeated executions. In practice, however, an output may depend on only a small subset of the inputs (fine-grained dependencies) as well as on the internal state of the module. We present a novel provenance framework that marries database-style and workflow-style provenance, by using Pig Latin to expose the functionality of modules, thus capturing internal state and fine-grained dependencies. A critical ingredient in our solution is the use of a novel form of provenance graph that models module invocations and yields a compact representation of fine-grained workflow provenance. It also enables a number of novel graph transformation operations, allowing to choose the desired level of granularity in provenance querying (ZoomIn and ZoomOut), and supporting "what-if" workflow analytic queries. We implemented our approach in the Lipstick system and developed a benchmark in support of a systematic performance evaluation. Our results demonstrate the feasibility of tracking and querying fine-grained workflow provenance.

preprint2010arXiv

Faster Query Answering in Probabilistic Databases using Read-Once Functions

A boolean expression is in read-once form if each of its variables appears exactly once. When the variables denote independent events in a probability space, the probability of the event denoted by the whole expression in read-once form can be computed in polynomial time (whereas the general problem for arbitrary expressions is #P-complete). Known approaches to checking read-once property seem to require putting these expressions in disjunctive normal form. In this paper, we tell a better story for a large subclass of boolean event expressions: those that are generated by conjunctive queries without self-joins and on tuple-independent probabilistic databases. We first show that given a tuple-independent representation and the provenance graph of an SPJ query plan without self-joins, we can, without using the DNF of a result event expression, efficiently compute its co-occurrence graph. From this, the read-once form can already, if it exists, be computed efficiently using existing techniques. Our second and key contribution is a complete, efficient, and simple to implement algorithm for computing the read-once forms (whenever they exist) directly, using a new concept, that of co-table graph, which can be significantly smaller than the co-occurrence graph.

Val Tannen

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

The Complexity of Finding Missing Answer Repairs

DBSP: Automatic Incremental View Maintenance for Rich Query Languages

Generalized Absorptive Polynomials and Provenance Semantics for Fixed-Point Logic

PrIU: A Provenance-Based Approach for Incrementally Updating Regression Models

Incremental View Maintenance For Collection Programming

Algorithms for Provisioning Queries and Analytics

Decidability of Equivalence of Aggregate Count-Distinct Queries

Dynamic Sketching for Graph Optimization Problems with Applications to Cut-Preserving Sketches

On the Limitations of Provenance for Queries With Difference

Provenance for Aggregate Queries

Putting Lipstick on Pig: Enabling Database-style Workflow Provenance

Faster Query Answering in Probabilistic Databases using Read-Once Functions