Source author record

Kuldeep S. Meel

Kuldeep S. Meel appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Logic in Computer Science Artificial Intelligence Machine Learning Data Structures and Algorithms Distributed, Parallel, and Cluster Computing Applications Computational Complexity Computer Vision Cryptography and Security cs.CY Information Theory math.IT

Catalog footprint

What is connected

21works

12topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Quantifying Sensitivity for Tree Ensembles: A symbolic and compositional approach

Decision tree ensembles (DTE) are a popular model for a wide range of AI classification tasks, used in multiple safety critical domains, and hence verifying properties on these models has been an active topic of study over the last decade. One such verification question is the problem of sensitivity, which asks, given a DTE, whether a small change in subset of features can lead to misclassification of the input. In this work, our focus is to build a quantitative notion of sensitivity, tailored to DTEs, by discretizing the input space of the model and enumerating the regions which are susceptible to sensitivity. We propose a novel algorithmic technique that can perform this computation efficiently, within a certified error and confidence bound. Our approach is based on encoding the problem as an algebraic decision diagram (ADD), and further splitting it into subproblems that can be solved efficiently and make the computation compositional and scalable. We evaluate the performance of our technique over benchmarks of varying size in terms of number of trees and depth, comparing it against the performance of model counters over the same problem encoding. Experimental results show that our tool XCount achieves significant speedup over other approaches and can scale well with the increasing sizes of the ensembles.

preprint2022arXiv

A Scalable Shannon Entropy Estimator

We revisit the well-studied problem of estimating the Shannon entropy of a probability distribution, now given access to a probability-revealing conditional sampling oracle. In this model, the oracle takes as input the representation of a set $S$ and returns a sample from the distribution obtained by conditioning on $S$, together with the probability of that sample in the distribution. Our work is motivated by applications of such algorithms in Quantitative Information Flow analysis (QIF) in programming-language-based security. Here, information-theoretic quantities capture the effort required on the part of an adversary to obtain access to confidential information. These applications demand accurate measurements when the entropy is small. Existing algorithms that do not use conditional samples require a number of queries that scale inversely with the entropy, which is unacceptable in this regime, and indeed, a lower bound by Batu et al.(STOC 2002) established that no algorithm using only sampling and evaluation oracles can obtain acceptable performance. On the other hand, prior work in the conditional sampling model by Chakraborty et al.(SICOMP 2016) only obtained a high-order polynomial query complexity, $\mathcal{O}(\frac{m^7}{ε^8}\log\frac{1}δ)$ queries, to obtain additive $ε$-approximations on a domain of size $\mathcal{O}(2^m)$. We obtain multiplicative $(1+ε)$-approximations using only $\mathcal{O}(\frac{m}{ε^2}\log\frac{1}δ)$ queries to the probability-revealing conditional sampling oracle. Indeed, moreover, we obtain small, explicit constants, and demonstrate that our algorithm obtains a substantial improvement in practice over the previous state-of-the-art methods used for entropy estimation in QIF.

preprint2022arXiv

Algorithmic Fairness Verification with Graphical Models

In recent years, machine learning (ML) algorithms have been deployed in safety-critical and high-stake decision-making, where the fairness of algorithms is of paramount importance. Fairness in ML centers on detecting bias towards certain demographic populations induced by an ML classifier and proposes algorithmic solutions to mitigate the bias with respect to different fairness definitions. To this end, several fairness verifiers have been proposed that compute the bias in the prediction of an ML classifier--essentially beyond a finite dataset--given the probability distribution of input features. In the context of verifying linear classifiers, existing fairness verifiers are limited by accuracy due to imprecise modeling of correlations among features and scalability due to restrictive formulations of the classifiers as SSAT/SMT formulas or by sampling. In this paper, we propose an efficient fairness verifier, called FVGM, that encodes the correlations among features as a Bayesian network. In contrast to existing verifiers, FVGM proposes a stochastic subset-sum based approach for verifying linear classifiers. Experimentally, we show that FVGM leads to an accurate and scalable assessment for more diverse families of fairness-enhancing algorithms, fairness attacks, and group/causal fairness metrics than the state-of-the-art fairness verifiers. We also demonstrate that FVGM facilitates the computation of fairness influence functions as a stepping stone to detect the source of bias induced by subsets of features.

preprint2022arXiv

Efficient Learning of Interpretable Classification Rules

Machine learning has become omnipresent with applications in various safety-critical domains such as medical, law, and transportation. In these domains, high-stake decisions provided by machine learning necessitate researchers to design interpretable models, where the prediction is understandable to a human. In interpretable machine learning, rule-based classifiers are particularly effective in representing the decision boundary through a set of rules comprising input features. The interpretability of rule-based classifiers is in general related to the size of the rules, where smaller rules are considered more interpretable. To learn such a classifier, the brute-force direct approach is to consider an optimization problem that tries to learn the smallest classification rule that has close to maximum accuracy. This optimization problem is computationally intractable due to its combinatorial nature and thus, the problem is not scalable in large datasets. To this end, in this paper we study the triangular relationship among the accuracy, interpretability, and scalability of learning rule-based classifiers. The contribution of this paper is an interpretable learning framework IMLI, that is based on maximum satisfiability (MaxSAT) for synthesizing classification rules expressible in proposition logic. Despite the progress of MaxSAT solving in the last decade, the straightforward MaxSAT-based solution cannot scale. Therefore, we incorporate an efficient incremental learning technique inside the MaxSAT formulation by integrating mini-batch learning and iterative rule-learning. In our experiments, IMLI achieves the best balance among prediction accuracy, interpretability, and scalability. As an application, we deploy IMLI in learning popular interpretable classifiers such as decision lists and decision sets.

preprint2021arXiv

Predicting Forest Fire Using Remote Sensing Data And Machine Learning

Over the last few decades, deforestation and climate change have caused increasing number of forest fires. In Southeast Asia, Indonesia has been the most affected country by tropical peatland forest fires. These fires have a significant impact on the climate resulting in extensive health, social and economic issues. Existing forest fire prediction systems, such as the Canadian Forest Fire Danger Rating System, are based on handcrafted features and require installation and maintenance of expensive instruments on the ground, which can be a challenge for developing countries such as Indonesia. We propose a novel, cost-effective, machine-learning based approach that uses remote sensing data to predict forest fires in Indonesia. Our prediction model achieves more than 0.81 area under the receiver operator characteristic (ROC) curve, performing significantly better than the baseline approach which never exceeds 0.70 area under ROC curve on the same tasks. Our model's performance remained above 0.81 area under ROC curve even when evaluated with reduced data. The results support our claim that machine-learning based approaches can lead to reliable and cost-effective forest fire prediction systems.

preprint2020arXiv

Designing New Phase Selection Heuristics

CDCL-based SAT solvers have transformed the field of automated reasoning owing to their demonstrated efficiency at handling problems arising from diverse domains. The success of CDCL solvers is owed to the design of clever heuristics that enable the tight coupling of different components. One of the core components is phase selection, wherein the solver, during branching, decides the polarity of the branch to be explored for a given variable. Most of the state-of-the-art CDCL SAT solvers employ phase-saving as a phase selection heuristic, which was proposed to address the potential inefficiencies arising from far-backtracking. In light of the emergence of chronological backtracking in CDCL solvers, we re-examine the efficiency of phase saving. Our empirical evaluation leads to a surprising conclusion: The usage of phase saving and random selection of polarity during chronological backtracking leads to indistinguishable runtime performance in terms of instances solved and PAR-2 score. We introduce Decaying Polarity Score (DPS) to capture the trend of the polarities attained by the variable, and upon observing lack of performance improvement due to DPS, we turn to a more sophisticated heuristic seeking to capture the activity of literals and the trend of polarities: Literal State Independent Decaying Sum (LSIDS). We find the 2019 winning SAT solver, Maple_LCM_Dist_ChronoBTv3, augmented with LSIDS solves 6 more instances while achieving a reduction of over 125 seconds in PAR-2 score, a significant improvement in the context of the SAT competition.

preprint2020arXiv

Efficient Distance Approximation for Structured High-Dimensional Distributions via Learning

We design efficient distance approximation algorithms for several classes of structured high-dimensional distributions. Specifically, we show algorithms for the following problems: - Given sample access to two Bayesian networks $P_1$ and $P_2$ over known directed acyclic graphs $G_1$ and $G_2$ having $n$ nodes and bounded in-degree, approximate $d_{tv}(P_1,P_2)$ to within additive error $ε$ using $poly(n,ε)$ samples and time - Given sample access to two ferromagnetic Ising models $P_1$ and $P_2$ on $n$ variables with bounded width, approximate $d_{tv}(P_1, P_2)$ to within additive error $ε$ using $poly(n,ε)$ samples and time - Given sample access to two $n$-dimensional Gaussians $P_1$ and $P_2$, approximate $d_{tv}(P_1, P_2)$ to within additive error $ε$ using $poly(n,ε)$ samples and time - Given access to observations from two causal models $P$ and $Q$ on $n$ variables that are defined over known causal graphs, approximate $d_{tv}(P_a, Q_a)$ to within additive error $ε$ using $poly(n,ε)$ samples, where $P_a$ and $Q_a$ are the interventional distributions obtained by the intervention $do(A=a)$ on $P$ and $Q$ respectively for a particular variable $A$. Our results are the first efficient distance approximation algorithms for these well-studied problems. They are derived using a simple and general connection to distribution learning algorithms. The distance approximation algorithms imply new efficient algorithms for {\em tolerant} testing of closeness of the above-mentioned structured high-dimensional distributions.

preprint2020arXiv

IMLI: An Incremental Framework for MaxSAT-Based Learning of Interpretable Classification Rules

The wide adoption of machine learning in the critical domains such as medical diagnosis, law, education had propelled the need for interpretable techniques due to the need for end users to understand the reasoning behind decisions due to learning systems. The computational intractability of interpretable learning led practitioners to design heuristic techniques, which fail to provide sound handles to tradeoff accuracy and interpretability. Motivated by the success of MaxSAT solvers over the past decade, recently MaxSAT-based approach, called MLIC, was proposed that seeks to reduce the problem of learning interpretable rules expressed in Conjunctive Normal Form (CNF) to a MaxSAT query. While MLIC was shown to achieve accuracy similar to that of other state of the art black-box classifiers while generating small interpretable CNF formulas, the runtime performance of MLIC is significantly lagging and renders approach unusable in practice. In this context, authors raised the question: Is it possible to achieve the best of both worlds, i.e., a sound framework for interpretable learning that can take advantage of MaxSAT solvers while scaling to real-world instances? In this paper, we take a step towards answering the above question in affirmation. We propose IMLI: an incremental approach to MaxSAT based framework that achieves scalable runtime performance via partition-based training methodology. Extensive experiments on benchmarks arising from UCI repository demonstrate that IMLI achieves up to three orders of magnitude runtime improvement without loss of accuracy and interpretability.

preprint2020arXiv

Induction Models on \mathbb{N}

Mathematical induction is a fundamental tool in computer science and mathematics. Henkin initiated the study of formalization of mathematical induction restricted to the setting when the base case B is set to singleton set containing 0 and a unary generating function S. The usage of mathematical induction often involves wider set of base cases and k-ary generating functions with different structural restrictions. While subsequent studies have shown several Induction Models to be equivalent, there does not exist precise logical characterization of reduction and equivalence among different Induction Models. In this paper, we generalize the definition of Induction Model and demonstrate existence and construction of S for given B and vice versa. We then provide a formal characterization of the reduction among different Induction Models that can allow proofs in one Induction Models to be expressed as proofs in another Induction Models. The notion of reduction allows us to capture equivalence among Induction Models.

preprint2020arXiv

Manthan: A Data Driven Approach for Boolean Function Synthesis

Boolean functional synthesis is a fundamental problem in computer science with wide-ranging applications and has witnessed a surge of interest resulting in progressively improved techniques over the past decade. Despite intense algorithmic development, a large number of problems remain beyond the reach of the state of the art techniques. Motivated by the progress in machine learning, we propose Manthan, a novel data-driven approach to Boolean functional synthesis. Manthan views functional synthesis as a classification problem, relying on advances in constrained sampling for data generation, and advances in automated reasoning for a novel proof-guided refinement and provable verification. On an extensive and rigorous evaluation over 609 benchmarks, we demonstrate that Manthan significantly improves upon the current state of the art, solving 356 benchmarks in comparison to 280, which is the most solved by a state of the art technique; thereby, we demonstrate an increase of 76 benchmarks over the current state of the art. Furthermore, Manthan solves 60 benchmarks that none of the current state of the art techniques could solve. The significant performance improvements, along with our detailed analysis, highlights several interesting avenues of future work at the intersection of machine learning, constrained sampling, and automated reasoning.

preprint2020arXiv

Phase Transition Behavior in Knowledge Compilation

The study of phase transition behaviour in SAT has led to deeper understanding and algorithmic improvements of modern SAT solvers. Motivated by these prior studies of phase transitions in SAT, we seek to study the behaviour of size and compile-time behaviour for random k-CNF formulas in the context of knowledge compilation. We perform a rigorous empirical study and analysis of the size and runtime behavior for different knowledge compilation forms (and their corresponding compilation algorithms): d-DNNFs, SDDs and OBDDs across multiple tools and compilation algorithms. We employ instances generated from the random k-CNF model with varying generation parameters to empirically reason about the expected and median behavior of size and compilation-time for these languages. Our work is similar in spirit to the early work in CSP community on phase transition behavior in SAT/CSP. In a similar spirit, we identify the interesting behavior with respect to different parameters: clause density and solution density, a novel control parameter that we identify for the study of phase transition behavior in the context of knowledge compilation. Furthermore, we summarize our empirical study in terms of two concrete conjectures; a rigorous study of these conjectures will possibly require new theoretical tools.

preprint2020arXiv

Sparse Hashing for Scalable Approximate Model Counting: Theory and Practice

Given a CNF formula F on n variables, the problem of model counting or #SAT is to compute the number of satisfying assignments of F . Model counting is a fundamental but hard problem in computer science with varied applications. Recent years have witnessed a surge of effort towards developing efficient algorithmic techniques that combine the classical 2-universal hashing with the remarkable progress in SAT solving over the past decade. These techniques augment the CNF formula F with random XOR constraints and invoke an NP oracle repeatedly on the resultant CNF-XOR formulas. In practice, calls to the NP oracle calls are replaced a SAT solver whose runtime performance is adversely affected by size of XOR constraints. The standard construction of 2-universal hash functions chooses every variable with probability p = 1/2 leading to XOR constraints of size n/2 in expectation. Consequently, the challenge is to design sparse hash functions where variables can be chosen with smaller probability and lead to smaller sized XOR constraints. In this paper, we address this challenge from theoretical and practical perspectives. First, we formalize a relaxation of universal hashing, called concentrated hashing and establish a novel and beautiful connection between concentration measures of these hash functions and isoperimetric inequalities on boolean hypercubes. This allows us to obtain (log m) tight bounds on variance and dispersion index and show that p = O( log(m)/m ) suffices for design of sparse hash functions from {0, 1}^n to {0, 1}^m. We then use sparse hash functions belonging to this concentrated hash family to develop new approximate counting algorithms. A comprehensive experimental evaluation of our algorithm on 1893 benchmarks demonstrates that usage of sparse hash functions can lead to significant speedups.

preprint2016arXiv

Approximate Probabilistic Inference via Word-Level Counting

Hashing-based model counting has emerged as a promising approach for large-scale probabilistic inference on graphical models. A key component of these techniques is the use of xor-based 2-universal hash functions that operate over Boolean domains. Many counting problems arising in probabilistic inference are, however, naturally encoded over finite discrete domains. Techniques based on bit-level (or Boolean) hash functions require these problems to be propositionalized, making it impossible to leverage the remarkable progress made in SMT (Satisfiability Modulo Theory) solvers that can reason directly over words (or bit-vectors). In this work, we present the first approximate model counter that uses word-level hashing functions, and can directly leverage the power of sophisticated SMT solvers. Empirical evaluation over an extensive suite of benchmarks demonstrates the promise of the approach.

preprint2016arXiv

Distributed Phasers

A phaser is an expressive synchronization construct that unifies collective and point-to-point coordination with dynamic task parallelism. Each task can participate in a phaser as a signaler, a waiter, or both. The participants in a phaser may change over time as dynamic tasks are added and deleted. In this poster, we present a highly concurrent and scalable design of phasers for a distributed memory environment that is suitable for use with asynchronous partitioned global address space programming models. Our design for a distributed phaser employs a pair of skip lists augmented with the ability to collect and propagate synchronization signals. To enable a high degree of concurrency, addition and deletion of participant tasks are performed in two phases: a "fast single-link-modify" step followed by multiple hand-overhand "lazy multi-link-modify" steps. We show that the cost of synchronization and structural operations on a distributed phaser scales logarithmically, even in the presence of concurrent structural modifications. To verify the correctness of our design for distributed phasers, we employ the SPIN model checker. To address this issue of state space explosion, we describe how we decompose the state space to separately verify correct handling for different kinds of messages, which enables complete model checking of our phaser design.

preprint2015arXiv

Constrained Sampling and Counting: Universal Hashing Meets SAT Solving

Constrained sampling and counting are two fundamental problems in artificial intelligence with a diverse range of applications, spanning probabilistic reasoning and planning to constrained-random verification. While the theory of these problems was thoroughly investigated in the 1980s, prior work either did not scale to industrial size instances or gave up correctness guarantees to achieve scalability. Recently, we proposed a novel approach that combines universal hashing and SAT solving and scales to formulas with hundreds of thousands of variables without giving up correctness guarantees. This paper provides an overview of the key ingredients of the approach and discusses challenges that need to be overcome to handle larger real-world instances.

preprint2014arXiv

ADHA: Automatic Data layout framework for Heterogeneous Architectures

Data layouts play a crucial role in determining the performance of a given application running on a given architecture. Existing parallel programming frameworks for both multicore and heterogeneous systems leave the onus of selecting a data layout to the programmer. Therefore, shifting the burden of data layout selection to optimizing compilers can greatly enhance programmer productivity and application performance. In this work, we introduce {\ADHA}: a two-level hierarchal formulation of the data layout problem for modern heterogeneous architectures. We have created a reference implementation of ADHA in the Heterogeneous Habanero-C (H2C) parallel programming system. ADHA shows significant performance benefits of up to 6.92$\times$ compared to manually specified layouts for two benchmark programs running on a CPU+GPU heterogeneous platform.

preprint2014arXiv

Balancing Scalability and Uniformity in SAT Witness Generator

Constrained-random simulation is the predominant approach used in the industry for functional verification of complex digital designs. The effectiveness of this approach depends on two key factors: the quality of constraints used to generate test vectors, and the randomness of solutions generated from a given set of constraints. In this paper, we focus on the second problem, and present an algorithm that significantly improves the state-of-the-art of (almost-)uniform generation of solutions of large Boolean constraints. Our algorithm provides strong theoretical guarantees on the uniformity of generated solutions and scales to problems involving hundreds of thousands of variables.

preprint2014arXiv

Distribution-Aware Sampling and Weighted Model Counting for SAT

Given a CNF formula and a weight for each assignment of values to variables, two natural problems are weighted model counting and distribution-aware sampling of satisfying assignments. Both problems have a wide variety of important applications. Due to the inherent complexity of the exact versions of the problems, interest has focused on solving them approximately. Prior work in this area scaled only to small problems in practice, or failed to provide strong theoretical guarantees, or employed a computationally-expensive maximum a posteriori probability (MAP) oracle that assumes prior knowledge of a factored representation of the weight distribution. We present a novel approach that works with a black-box oracle for weights of assignments and requires only an {\NP}-oracle (in practice, a SAT-solver) to solve both the counting and sampling problems. Our approach works under mild assumptions on the distribution of weights of satisfying assignments, provides strong theoretical guarantees, and scales to problems involving several thousand variables. We also show that the assumptions can be significantly relaxed while improving computational efficiency if a factored representation of the weights is known.

preprint2014arXiv

Sampling Techniques for Boolean Satisfiability

Boolean satisfiability ({\SAT}) has played a key role in diverse areas spanning testing, formal verification, planning, optimization, inferencing and the like. Apart from the classical problem of checking boolean satisfiability, the problems of generating satisfying uniformly at random, and of counting the total number of satisfying assignments have also attracted significant theoretical and practical interest over the years. Prior work offered heuristic approaches with very weak or no guarantee of performance, and theoretical approaches with proven guarantees, but poor performance in practice. We propose a novel approach based on limited-independence hashing that allows us to design algorithms for both problems, with strong theoretical guarantees and scalability extending to thousands of variables. Based on this approach, we present two practical algorithms, {\UniformWitness}: a near uniform generator and {\approxMC}: the first scalable approximate model counter, along with reference implementations. Our algorithms work by issuing polynomial calls to {\SAT} solver. We demonstrate scalability of our algorithms over a large set of benchmarks arising from different application domains.

preprint2013arXiv

A Scalable and Nearly Uniform Generator of SAT Witnesses

Functional verification constitutes one of the most challenging tasks in the development of modern hardware systems, and simulation-based verification techniques dominate the functional verification landscape. A dominant paradigm in simulation-based verification is directed random testing, where a model of the system is simulated with a set of random test stimuli that are uniformly or near-uniformly distributed over the space of all stimuli satisfying a given set of constraints. Uniform or near-uniform generation of solutions for large constraint sets is therefore a problem of theoretical and practical interest. For Boolean constraints, prior work offered heuristic approaches with no guarantee of performance, and theoretical approaches with proven guarantees, but poor performance in practice. We offer here a new approach with theoretical performance guarantees and demonstrate its practical utility on large constraint sets.

preprint2013arXiv

A Scalable Approximate Model Counter

Propositional model counting} (#SAT), i.e., counting the number of satisfying assignments of a propositional formula, is a problem of significant theoretical and practical interest. Due to the inherent complexity of the problem, approximate model counting, which counts the number of satisfying assignments to within given tolerance and confidence level, was proposed as a practical alternative to exact model counting. Yet, approximate model counting has been studied essentially only theoretically. The only reported implementation of approximate model counting, due to Karp and Luby, worked only for DNF formulas. A few existing tools for CNF formulas are bounding model counters; they can handle realistic problem sizes, but fall short of providing counts within given tolerance and confidence, and, thus, are not approximate model counters. We present here a novel algorithm, as well as a reference implementation, that is the first scalable approximate model counter for CNF formulas. The algorithm works by issuing a polynomial number of calls to a SAT solver. Our tool, ApproxMC, scales to formulas with tens of thousands of variables. Careful experimental comparisons show that ApproxMC reports, with high confidence, bounds that are close to the exact count, and also succeeds in reporting bounds with small tolerance and high confidence in cases that are too large for computing exact model counts.

Kuldeep S. Meel

What is connected

Connect this record

See the researcher in context

Building this map preview

21 published item(s)

Quantifying Sensitivity for Tree Ensembles: A symbolic and compositional approach

A Scalable Shannon Entropy Estimator

Algorithmic Fairness Verification with Graphical Models

Efficient Learning of Interpretable Classification Rules

Predicting Forest Fire Using Remote Sensing Data And Machine Learning

Designing New Phase Selection Heuristics

Efficient Distance Approximation for Structured High-Dimensional Distributions via Learning

IMLI: An Incremental Framework for MaxSAT-Based Learning of Interpretable Classification Rules

Induction Models on \mathbb{N}

Manthan: A Data Driven Approach for Boolean Function Synthesis

Phase Transition Behavior in Knowledge Compilation

Sparse Hashing for Scalable Approximate Model Counting: Theory and Practice

Approximate Probabilistic Inference via Word-Level Counting

Distributed Phasers

Constrained Sampling and Counting: Universal Hashing Meets SAT Solving

ADHA: Automatic Data layout framework for Heterogeneous Architectures

Balancing Scalability and Uniformity in SAT Witness Generator

Distribution-Aware Sampling and Weighted Model Counting for SAT

Sampling Techniques for Boolean Satisfiability

A Scalable and Nearly Uniform Generator of SAT Witnesses

A Scalable Approximate Model Counter