Source author record

Benny Kimelfeld

Benny Kimelfeld appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Databases Computer Science and Game Theory Computational Complexity Artificial Intelligence Computation and Language Data Structures and Algorithms Human-Computer Interaction Information Retrieval Machine Learning Programming Languages Social and Information Networks

Catalog footprint

What is connected

15works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Database Views as Explanations for Relational Deep Learning

In recent years, there has been significant progress in the development of deep learning models over relational databases, including architectures based on heterogeneous graph neural networks (hetero-GNNs) and heterogeneous graph transformers. In effect, such architectures state how the database records and links (e.g., foreign-key references) translate into a large, complex numerical expression, involving numerous learnable parameters. This complexity makes it hard to explain, in human-understandable terms, how a model uses the available data to arrive at a given prediction. We present a novel framework for explaining machine-learning models over relational databases, where explanations are view definitions that highlight focused parts of the database that mostly contribute to the model's prediction. We establish such global abductive explanations by adapting the classic notion of determinacy by Nash, Segoufin, and Vianu (2010). In addition to tuning the tradeoff between determinacy and conciseness, the framework allows controlling the level of granularity by adopting different fragments of view definitions, such as ones highlighting whole columns, foreign keys between tables, relevant groups of tuples, and so on. We investigate the realization of the framework in the case of hetero-GNNs, and develop a model-specific approach via the notion of learnable masks. For comparison, we propose model-agnostic heuristic baselines and show that our approach is both more efficient and achieves better explanation quality in most cases. Our extensive empirical evaluation on the RelBench collection across diverse domains and record-level tasks demonstrates both the usefulness of our explanations and the efficiency of their generation.

preprint2026arXiv

The Importance of Parameters in Ranking Functions

How important is the weight of a given column in determining the ranking of tuples in a table? To address such an explanation question about a ranking function, we investigate the computation of SHAP scores for column weights, adopting a recent framework by Grohe et al.[ICDT'24]. The exact definition of this score depends on three key components: (1) the ranking function in use, (2) an effect function that quantifies the impact of using alternative weights on the ranking, and (3) an underlying weight distribution. We analyze the computational complexity of different instantiations of this framework for a range of fundamental ranking and effect functions, focusing on probabilistically independent finite distributions for individual columns. For the ranking functions, we examine lexicographic orders and score-based orders defined by the summation, minimum, and maximum functions. For the effect functions, we consider global, top-k, and local perspectives: global measures quantify the divergence between the perturbed and original rankings, top-k measures inspect the change in the set of top-k answers, and local measures capture the impact on an individual tuple of interest. Although all cases admit an additive fully polynomial-time randomized approximation scheme (FPRAS), we establish the complexity of exact computation, identifying which cases are solvable in polynomial time and which are #P-hard. We further show that all complexity results, lower bounds and upper bounds, extend to a related task of computing the Shapley value of whole columns (regardless of their weight).

preprint2022arXiv

Computing the Shapley Value of Facts in Query Answering

The Shapley value is a game-theoretic notion for wealth distribution that is nowadays extensively used to explain complex data-intensive computation, for instance, in network analysis or machine learning. Recent theoretical works show that query evaluation over relational databases fits well in this explanation paradigm. Yet, these works fall short of providing practical solutions to the computational challenge inherent to the Shapley computation. We present in this paper two practically effective solutions for computing Shapley values in query answering. We start by establishing a tight theoretical connection to the extensively studied problem of query evaluation over probabilistic databases, which allows us to obtain a polynomial-time algorithm for the class of queries for which probability computation is tractable. We then propose a first practical solution for computing Shapley values that adopts tools from probabilistic query evaluation. In particular, we capture the dependence of query answers on input database facts using Boolean expressions (data provenance), and then transform it, via Knowledge Compilation, into a particular circuit form for which we devise an algorithm for computing the Shapley values. Our second practical solution is a faster yet inexact approach that transforms the provenance to a Conjunctive Normal Form and uses a heuristic to compute the Shapley values. Our experiments on TPC-H and IMDB demonstrate the practical effectiveness of our solutions.

preprint2021arXiv

Computing the Extremal Possible Ranks with Incomplete Preferences

Various voting rules are based on ranking the candidates by scores induced by aggregating voter preferences. A winner (respectively, unique winner) is a candidate who receives a score not smaller than (respectively, strictly greater than) the remaining candidates. Examples of such rules include the positional scoring rules and the Bucklin, Copeland, and Maximin rules. When voter preferences are known in an incomplete manner as partial orders, a candidate can be a possible/necessary winner based on the possibilities of completing the partial votes. Past research has studied in depth the computational problems of determining the possible and necessary winners and unique winners. These problems are all special cases of reasoning about the range of possible positions of a candidate under different tiebreakers. We investigate the complexity of determining this range, and particularly the extremal positions. Among our results, we establish that finding each of the minimal and maximal positions is NP-hard for each of the above rules, including all positional scoring rules, pure or not. Hence, none of the tractable variants of necessary/possible winner determination remain tractable for extremal position determination. Tractability can be retained when reasoning about the top-$k$ positions for a fixed $k$. Yet, exceptional is Maximin where it is tractable to decide whether the maximal rank is $k$ for $k=1$ (necessary winning) but it becomes intractable for all $k>1$.

preprint2021arXiv

Probabilistic Inference of Winners in Elections by Independent Random Voters

We investigate the problem of computing the probability of winning in an election where voter attendance is uncertain. More precisely, we study the setting where, in addition to a total ordering of the candidates, each voter is associated with a probability of attending the poll, and the attendances of different voters are probabilistically independent. We show that the probability of winning can be computed in polynomial time for the plurality and veto rules. However, it is computationally hard (#P-hard) for various other rules, including $k$-approval and $k$-veto for $k>1$, Borda, Condorcet, and Maximin. For some of these rules, it is even hard to find a multiplicative approximation since it is already hard to determine whether this probability is nonzero. In contrast, we devise a fully polynomial-time randomized approximation scheme (FPRAS) for the complement probability, namely the probability of losing, for every positional scoring rule (with polynomial scores), as well as for the Condorcet rule.

preprint2020arXiv

Algorithmic Techniques for Necessary and Possible Winners

We investigate the practical aspects of computing the necessary and possible winners in elections over incomplete voter preferences. In the case of the necessary winners, we show how to implement and accelerate the polynomial-time algorithm of Xia and Conitzer. In the case of the possible winners, where the problem is NP-hard, we give a natural reduction to Integer Linear Programming (ILP) for all positional scoring rules and implement it in a leading commercial optimization solver. Further, we devise optimization techniques to minimize the number of ILP executions and, oftentimes, avoid them altogether. We conduct a thorough experimental study that includes the construction of a rich benchmark of election data based on real and synthetic data. Our findings suggest that, the worst-case intractability of the possible winners notwithstanding, the algorithmic techniques presented here scale well and can be used to compute the possible winners in realistic scenarios.

preprint2020arXiv

Approximate Denial Constraints

The problem of mining integrity constraints from data has been extensively studied over the past two decades for commonly used types of constraints including the classic Functional Dependencies (FDs) and the more general Denial Constraints (DCs). In this paper, we investigate the problem of mining approximate DCs (i.e., DCs that are "almost" satisfied) from data. Considering approximate constraints allows us to discover more accurate constraints in inconsistent databases, detect rules that are generally correct but may have a few exceptions, as well as avoid overfitting and obtain more general and less contrived constraints. We introduce the algorithm ADCMiner for mining approximate DCs. An important feature of this algorithm is that it does not assume any specific definition of an approximate DC, but takes the semantics as input. Since there is more than one way to define an approximate DC and different definitions may produce very different results, we do not focus on one definition, but rather on a general family of approximation functions that satisfies some natural axioms defined in this paper and captures commonly used definitions of approximate constraints. We also show how our algorithm can be combined with sampling to return results with high accuracy while significantly reducing the running time.

preprint2020arXiv

Geosocial Location Classification: Associating Type to Places Based on Geotagged Social-Media Posts

Associating type to locations can be used to enrich maps and can serve a plethora of geospatial applications. An automatic method to do so could make the process less expensive in terms of human labor, and faster to react to changes. In this paper we study the problem of Geosocial Location Classification, where the type of a site, e.g., a building, is discovered based on social-media posts. Our goal is to correctly associate a set of messages posted in a small radius around a given location with the corresponding location type, e.g., school, church, restaurant or museum. We explore two approaches to the problem: (a) a pipeline approach, where each message is first classified, and then the location associated with the message set is inferred from the individual message labels; and (b) a joint approach where the individual messages are simultaneously processed to yield the desired location type. We tested the two approaches over a dataset of geotagged tweets. Our results demonstrate the superiority of the joint approach. Moreover, we show that due to the unique structure of the problem, where weakly-related messages are jointly processed to yield a single final label, linear classifiers outperform deep neural network alternatives.

preprint2020arXiv

Supporting Hard Queries over Probabilistic Preferences

Preference analysis is widely applied in various domains such as social choice and e-commerce. A recently proposed framework augments the relational database with a preference relation that represents uncertain preferences in the form of statistical ranking models, and provides methods to evaluate Conjunctive Queries (CQs) that express preferences among item attributes. In this paper, we explore the evaluation of queries that are more general and harder to compute. The main focus of this paper is on a class of CQs that cannot be evaluated by previous work. These queries are provably hard since relate variables that represent items being compared. To overcome this hardness, we instantiate these variables with their domain values, rewrite hard CQs as unions of such instantiated queries, and develop several exact and approximate solvers to evaluate these unions of queries. We demonstrate that exact solvers that target specific common kinds of queries are far more efficient than general solvers. Further, we demonstrate that sophisticated approximate solvers making use of importance sampling can be orders of magnitude more efficient than exact solvers, while showing good accuracy. In addition to supporting provably hard CQs, we also present methods to evaluate an important family of count queries, and of top-k queries.

preprint2020arXiv

The Complexity of Determining the Necessary and Possible Top-k Winners in Partial Voting Profiles

When voter preferences are known in an incomplete (partial) manner, winner determination is commonly treated as the identification of the necessary and possible winners; these are the candidates who win in all completions or at least one completion, respectively, of the partial voting profile. In the case of a positional scoring rule, the winners are the candidates who receive the maximal total score from the voters. Yet, the outcome of an election might go beyond the absolute winners to the top-$k$ winners, as in the case of committee selection, primaries of political parties, and ranking in recruiting. We investigate the computational complexity of determining the necessary and possible top-$k$ winners over partial voting profiles. Our results apply to general classes of positional scoring rules and focus on the cases where $k$ is given as part of the input and where $k$ is fixed.

preprint2020arXiv

ViS-Á-ViS : Detecting Similar Patterns in Annotated Literary Text

We present a web-based system called ViS-Á-ViS aiming to assist literary scholars in detecting repetitive patterns in an annotated textual corpus. Pattern detection is made possible using distant reading visualizations that highlight potentially interesting patterns. In addition, the system uses time-series alignment algorithms, and in particular, dynamic time warping (DTW), to detect patterns automatically. We present a case-study where an ancient Hebrew poetry corpus was manually annotated with figurative language devices as metaphors and similes and then loaded into the system. Preliminary results confirm the effectiveness of the system in analyzing the annotated data and in detecting literary patterns and similarities.

preprint2019arXiv

The Impact of Negation on the Complexity of the Shapley Value in Conjunctive Queries

The Shapley value is a conventional and well-studied function for determining the contribution of a player to the coalition in a cooperative game. Among its applications in a plethora of domains, it has recently been proposed to use the Shapley value for quantifying the contribution of a tuple to the result of a database query. In particular, we have a thorough understanding of the tractability frontier for the class of Conjunctive Queries (CQs) and aggregate functions over CQs. It has also been established that a tractable (randomized) multiplicative approximation exists for every union of CQs. Nevertheless, all of these results are based on the monotonicity of CQs. In this work, we investigate the implication of negation on the complexity of Shapley computation, in both the exact and approximate senses. We generalize a known dichotomy to account for negated atoms. We also show that negation fundamentally changes the complexity of approximation. We do so by drawing a connection to the problem of deciding whether a tuple is "relevant" to a query, and by analyzing its complexity.

preprint2016arXiv

Flexible Caching in Trie Joins

Traditional algorithms for multiway join computation are based on rewriting the order of joins and combining results of intermediate subqueries. Recently, several approaches have been proposed for algorithms that are "worst-case optimal" wherein all relations are scanned simultaneously. An example is Veldhuizen's Leapfrog Trie Join (LFTJ). An important advantage of LFTJ is its small memory footprint, due to the fact that intermediate results are full tuples that can be dumped immediately. However, since the algorithm does not store intermediate results, recurring joins must be reconstructed from the source relations, resulting in excessive memory traffic. In this paper, we address this problem by incorporating caches into LFTJ. We do so by adopting recent developments on join optimization, tying variable ordering to tree decomposition. While the traditional usage of tree decomposition computes the result for each bag in advance, our proposed approach incorporates caching directly into LFTJ and can dynamically adjust the size of the cache. Consequently, our solution balances memory usage and repeated computation, as confirmed by our experiments over SNAP datasets.

preprint2016arXiv

Unambiguous Prioritized Repairing of Databases

In its traditional definition, a repair of an inconsistent database is a consistent database that differs from the inconsistent one in a "minimal way". Often, repairs are not equally legitimate, as it is desired to prefer one over another; for example, one fact is regarded more reliable than another, or a more recent fact should be preferred to an earlier one. Motivated by these considerations, researchers have introduced and investigated the framework of preferred repairs, in the context of denial constraints and subset repairs. There, a priority relation between facts is lifted towards a priority relation between consistent databases, and repairs are restricted to the ones that are optimal in the lifted sense. Three notions of lifting (and optimal repairs) have been proposed: Pareto, global, and completion. In this paper we investigate the complexity of deciding whether the priority relation suffices to clean the database unambiguously, or in other words, whether there is exactly one optimal repair. We show that the different lifting semantics entail highly different complexities. Under Pareto optimality, the problem is coNP-complete, in data complexity, for every set of functional dependencies (FDs), except for the tractable case of (equivalence to) one FD per relation. Under global optimality, one FD per relation is still tractable, but we establish $Π^{p}_{2}$-completeness for a relation with two FDs. In contrast, under completion optimality the problem is solvable in polynomial time for every set of FDs. In fact, we present a polynomial-time algorithm for arbitrary conflict hypergraphs. We further show that under a general assumption of transitivity, this algorithm solves the problem even for global optimality. The algorithm is extremely simple, but its proof of correctness is quite intricate.

preprint2015arXiv

Declarative Statistical Modeling with Datalog

Formalisms for specifying statistical models, such as probabilistic-programming languages, typically consist of two components: a specification of a stochastic process (the prior), and a specification of observations that restrict the probability space to a conditional subspace (the posterior). Use cases of such formalisms include the development of algorithms in machine learning and artificial intelligence. We propose and investigate a declarative framework for specifying statistical models on top of a database, through an appropriate extension of Datalog. By virtue of extending Datalog, our framework offers a natural integration with the database, and has a robust declarative semantics. Our Datalog extension provides convenient mechanisms to include numerical probability functions; in particular, conclusions of rules may contain values drawn from such functions. The semantics of a program is a probability distribution over the possible outcomes of the input database with respect to the program; these outcomes are minimal solutions with respect to a related program with existentially quantified variables in conclusions. Observations are naturally incorporated by means of integrity constraints over the extensional and intensional relations. We focus on programs that use discrete numerical distributions, but even then the space of possible outcomes may be uncountable (as a solution can be infinite). We define a probability measure over possible outcomes by applying the known concept of cylinder sets to a probabilistic chase procedure. We show that the resulting semantics is robust under different chases. We also identify conditions guaranteeing that all possible outcomes are finite (and then the probability space is discrete). We argue that the framework we propose retains the purely declarative nature of Datalog, and allows for natural specifications of statistical models.

Benny Kimelfeld

What is connected

Connect this record

See the researcher in context

Building this map preview

15 published item(s)

Database Views as Explanations for Relational Deep Learning

The Importance of Parameters in Ranking Functions

Computing the Shapley Value of Facts in Query Answering

Computing the Extremal Possible Ranks with Incomplete Preferences

Probabilistic Inference of Winners in Elections by Independent Random Voters

Algorithmic Techniques for Necessary and Possible Winners

Approximate Denial Constraints

Geosocial Location Classification: Associating Type to Places Based on Geotagged Social-Media Posts

Supporting Hard Queries over Probabilistic Preferences

The Complexity of Determining the Necessary and Possible Top-k Winners in Partial Voting Profiles

ViS-Á-ViS : Detecting Similar Patterns in Annotated Literary Text

The Impact of Negation on the Complexity of the Shapley Value in Conjunctive Queries

Flexible Caching in Trie Joins

Unambiguous Prioritized Repairing of Databases

Declarative Statistical Modeling with Datalog