Source author record

Tova Milo

Tova Milo appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Databases Computational Complexity Data Structures and Algorithms Information Retrieval

Catalog footprint

What is connected

11works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

FEDEX: An Explainability Framework for Data Exploration Steps

When exploring a new dataset, Data Scientists often apply analysis queries, look for insights in the resulting dataframe, and repeat to apply further queries. We propose in this paper a novel solution that assists data scientists in this laborious process. In a nutshell, our solution pinpoints the most interesting (sets of) rows in each obtained dataframe. Uniquely, our definition of interest is based on the contribution of each row to the interestingness of different columns of the entire dataframe, which, in turn, is defined using standard measures such as diversity and exceptionality. Intuitively, interesting rows are ones that explain why (some column of) the analysis query result is interesting as a whole. Rows are correlated in their contribution and so the interesting score for a set of rows may not be directly computed based on that of individual rows. We address the resulting computational challenge by restricting attention to semantically-related sets, based on multiple notions of semantic relatedness; these sets serve as more informative explanations. Our experimental study across multiple real-world datasets shows the usefulness of our system in various scenarios.

preprint2022arXiv

Selecting Sub-tables for Data Exploration

We present a framework for creating small, informative sub-tables of large data tables to facilitate the first step of data science: data exploration. Given a large data table table T, the goal is to create a sub-table of small, fixed dimensions, by selecting a subset of T's rows and projecting them over a subset of T's columns. The question is: which rows and columns should be selected to yield an informative sub-table? We formalize the notion of "informativeness" based on two complementary metrics: cell coverage, which measures how well the sub-table captures prominent association rules in T, and diversity. Since computing optimal sub-tables using these metrics is shown to be infeasible, we give an efficient algorithm which indirectly accounts for association rules using table embedding. The resulting framework can be used for visualizing the complete sub-table, as well as for displaying the results of queries over the sub-table, enabling the user to quickly understand the results and determine subsequent queries. Experimental results show that we can efficiently compute high-quality sub-tables as measured by our metrics, as well as by feedback from user-studies.

preprint2020arXiv

Just in Time: Personal Temporal Insights for Altering Model Decisions

The interpretability of complex Machine Learning models is coming to be a critical social concern, as they are increasingly used in human-related decision-making processes such as resume filtering or loan applications. Individuals receiving an undesired classification are likely to call for an explanation -- preferably one that specifies what they should do in order to alter that decision when they reapply in the future. Existing work focuses on a single ML model and a single point in time, whereas in practice, both models and data evolve over time: an explanation for an application rejection in 2018 may be irrelevant in 2019 since in the meantime both the model and the applicant's data can change. To this end, we propose a novel framework that provides users with insights and plans for changing their classification in particular future time points. The solution is based on combining state-of-the-art algorithms for (single) model explanations, ones for predicting future models, and database-style querying of the obtained explanations. We propose to demonstrate the usefulness of our solution in the context of loan applications, and interactively engage the audience in computing and viewing suggestions tailored for applicants based on their unique characteristic.

preprint2014arXiv

Answering Conjunctive Queries with Inequalities

In this paper, we study the complexity of answering conjunctive queries (CQ) with inequalities). In particular, we are interested in comparing the complexity of the query with and without inequalities. The main contribution of our work is a novel combinatorial technique that enables us to use any Select-Project-Join query plan for a given CQ without inequalities in answering the CQ with inequalities, with an additional factor in running time that only depends on the query. The key idea is to define a new projection operator, which keeps a small representation (independent of the size of the database) of the set of input tuples that map to each tuple in the output of the projection; this representation is used to evaluate all the inequalities in the query. Second, we generalize a result by Papadimitriou-Yannakakis [17] and give an alternative algorithm based on the color-coding technique [4] to evaluate a CQ with inequalities by using an algorithm for the CQ without inequalities. Third, we investigate the structure of the query graph, inequality graph, and the augmented query graph with inequalities, and show that even if the query and the inequality graphs have bounded treewidth, the augmented graph not only can have an unbounded treewidth but can also be NP-hard to evaluate. Further, we illustrate classes of queries and inequalities where the augmented graphs have unbounded treewidth, but the CQ with inequalities can be evaluated in poly-time. Finally, we give necessary properties and sufficient properties that allow a class of CQs to have poly-time combined complexity with respect to any inequality pattern. We also illustrate classes of queries where our query-plan-based technique outperforms the alternative approaches discussed in the paper.

preprint2014arXiv

Answering Regular Path Queries on Workflow Provenance

This paper proposes a novel approach for efficiently evaluating regular path queries over provenance graphs of workflows that may include recursion. The approach assumes that an execution g of a workflow G is labeled with query-agnostic reachability labels using an existing technique. At query time, given g, G and a regular path query R, the approach decomposes R into a set of subqueries R1, ..., Rk that are safe for G. For each safe subquery Ri, G is rewritten so that, using the reachability labels of nodes in g, whether or not there is a path which matches Ri between two nodes can be decided in constant time. The results of each safe subquery are then composed, possibly with some small unsafe remainder, to produce an answer to R. The approach results in an algorithm that significantly reduces the number of subqueries k over existing techniques by increasing their size and complexity, and that evaluates each subquery in time bounded by its input and output size. Experimental results demonstrate the benefit of this approach.

preprint2014arXiv

Uncertainty in Crowd Data Sourcing under Structural Constraints

Applications extracting data from crowdsourcing platforms must deal with the uncertainty of crowd answers in two different ways: first, by deriving estimates of the correct value from the answers; second, by choosing crowd questions whose answers are expected to minimize this uncertainty relative to the overall data collection goal. Such problems are already challenging when we assume that questions are unrelated and answers are independent, but they are even more complicated when we assume that the unknown values follow hard structural constraints (such as monotonicity). In this vision paper, we examine how to formally address this issue with an approach inspired by [Amsterdamer et al., 2013]. We describe a generalized setting where we model constraints as linear inequalities, and use them to guide the choice of crowd questions and the processing of answers. We present the main challenges arising in this setting, and propose directions to solve them.

preprint2013arXiv

On the Complexity of Mining Itemsets from the Crowd Using Taxonomies

We study the problem of frequent itemset mining in domains where data is not recorded in a conventional database but only exists in human knowledge. We provide examples of such scenarios, and present a crowdsourcing model for them. The model uses the crowd as an oracle to find out whether an itemset is frequent or not, and relies on a known taxonomy of the item domain to guide the search for frequent itemsets. In the spirit of data mining with oracles, we analyze the complexity of this problem in terms of (i) crowd complexity, that measures the number of crowd questions required to identify the frequent itemsets; and (ii) computational complexity, that measures the computational effort required to choose the questions. We provide lower and upper complexity bounds in terms of the size and structure of the input taxonomy, as well as the size of a concise description of the output itemsets. We also provide constructive algorithms that achieve the upper bounds, and consider more efficient variants for practical situations.

preprint2012arXiv

A Propagation Model for Provenance Views of Public/Private Workflows

We study the problem of concealing functionality of a proprietary or private module when provenance information is shown over repeated executions of a workflow which contains both `public' and `private' modules. Our approach is to use `provenance views' to hide carefully chosen subsets of data over all executions of the workflow to ensure G-privacy: for each private module and each input x, the module's output f(x) is indistinguishable from G -1 other possible values given the visible data in the workflow executions. We show that G-privacy cannot be achieved simply by combining solutions for individual private modules; data hiding must also be `propagated' through public modules. We then examine how much additional data must be hidden and when it is safe to stop propagating data hiding. The answer depends strongly on the workflow topology as well as the behavior of public modules on the visible data. In particular, for a class of workflows (which include the common tree and chain workflows), taking private solutions for each private module, augmented with a `public closure' that is `upstream-downstream safe', ensures G-privacy. We define these notions formally and show that the restrictions are necessary. We also study the related optimization problems of minimizing the amount of hidden data.

preprint2012arXiv

Labeling Workflow Views with Fine-Grained Dependencies

This paper considers the problem of efficiently answering reachability queries over views of provenance graphs, derived from executions of workflows that may include recursion. Such views include composite modules and model fine-grained dependencies between module inputs and outputs. A novel view-adaptive dynamic labeling scheme is developed for efficient query evaluation, in which view specifications are labeled statically (i.e. as they are created) and data items are labeled dynamically as they are produced during a workflow execution. Although the combination of fine-grained dependencies and recursive workflows entail, in general, long (linear-size) data labels, we show that for a large natural class of workflows and views, labels are compact (logarithmic-size) and reachability queries can be evaluated in constant time. Experimental results demonstrate the benefit of this approach over the state-of-the-art technique when applied for labeling multiple views.

preprint2011arXiv

Provenance Views for Module Privacy

Scientific workflow systems increasingly store provenance information about the module executions used to produce a data item, as well as the parameter settings and intermediate data items passed between module executions. However, authors/owners of workflows may wish to keep some of this information confidential. In particular, a module may be proprietary, and users should not be able to infer its behavior by seeing mappings between all data inputs and outputs. The problem we address in this paper is the following: Given a workflow, abstractly modeled by a relation R, a privacy requirement Γand costs associated with data. The owner of the workflow decides which data (attributes) to hide, and provides the user with a view R' which is the projection of R over attributes which have not been hidden. The goal is to minimize the cost of hidden data while guaranteeing that individual modules are Γ-private. We call this the "secureview" problem. We formally define the problem, study its complexity, and offer algorithmic solutions.

preprint2011arXiv

Putting Lipstick on Pig: Enabling Database-style Workflow Provenance

Workflow provenance typically assumes that each module is a "black-box", so that each output depends on all inputs (coarse-grained dependencies). Furthermore, it does not model the internal state of a module, which can change between repeated executions. In practice, however, an output may depend on only a small subset of the inputs (fine-grained dependencies) as well as on the internal state of the module. We present a novel provenance framework that marries database-style and workflow-style provenance, by using Pig Latin to expose the functionality of modules, thus capturing internal state and fine-grained dependencies. A critical ingredient in our solution is the use of a novel form of provenance graph that models module invocations and yields a compact representation of fine-grained workflow provenance. It also enables a number of novel graph transformation operations, allowing to choose the desired level of granularity in provenance querying (ZoomIn and ZoomOut), and supporting "what-if" workflow analytic queries. We implemented our approach in the Lipstick system and developed a benchmark in support of a systematic performance evaluation. Our results demonstrate the feasibility of tracking and querying fine-grained workflow provenance.

Tova Milo

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

FEDEX: An Explainability Framework for Data Exploration Steps

Selecting Sub-tables for Data Exploration

Just in Time: Personal Temporal Insights for Altering Model Decisions

Answering Conjunctive Queries with Inequalities

Answering Regular Path Queries on Workflow Provenance

Uncertainty in Crowd Data Sourcing under Structural Constraints

On the Complexity of Mining Itemsets from the Crowd Using Taxonomies

A Propagation Model for Provenance Views of Public/Private Workflows

Labeling Workflow Views with Fine-Grained Dependencies

Provenance Views for Module Privacy

Putting Lipstick on Pig: Enabling Database-style Workflow Provenance