Source author record

Dan Vilenchik

Dan Vilenchik appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Data Structures and Algorithms Information Theory math.CO math.IT math.PR math.ST Statistics Theory Artificial Intelligence Computational Complexity Computer Science and Game Theory Discrete Mathematics Distributed, Parallel, and Cluster Computing

Catalog footprint

What is connected

10works

13topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Opinion Spam Detection: A New Approach Using Machine Learning and Network-Based Algorithms

E-commerce is the fastest-growing segment of the economy. Online reviews play a crucial role in helping consumers evaluate and compare products and services. As a result, fake reviews (opinion spam) are becoming more prevalent and negatively impacting customers and service providers. There are many reasons why it is hard to identify opinion spammers automatically, including the absence of reliable labeled data. This limitation precludes an off-the-shelf application of a machine learning pipeline. We propose a new method for classifying reviewers as spammers or benign, combining machine learning with a message-passing algorithm that capitalizes on the users' graph structure to compensate for the possible scarcity of labeled data. We devise a new way of sampling the labels for the training step (active learning), replacing the typical uniform sampling. Experiments on three large real-world datasets from Yelp.com show that our method outperforms state-of-the-art active learning approaches and also machine learning methods that use a much larger set of labeled data for training.

preprint2022arXiv

Welfare vs. Representation in Participatory Budgeting

Participatory budgeting (PB) is a democratic process for allocating funds to projects based on the votes of members of the community. Different rules have been used to aggregate participants' votes. Past research has studied the trade-off between notions of social welfare and fairness in the multi-winner setting (a special case of participatory budgeting with identical project costs) by Lackner and Skowron (2020). But there is little understanding of this trade-off in the more general PB setting. This paper provides a theoretical and empirical study of the worst-case guarantees of several common rules to better understand the trade-off between social welfare, representation. We show that many of the guarantees from the multi-winner setting do not generalize to the PB setting, and that the introduction of costs leads to substantially worse guarantees, thereby exacerbating the welfare-representation trade-off. We extend our theoretical analysis to studying how the requirement of proportionality over voting rules affects this trade-off. We further study how the requirement of proportionality over voting rules effects the guarantees on social welfare and representation. We study the latter point also empirically, both on real and synthetic datasets. We show that variants of the recently suggested voting rule Rule-X (which satisfies proportionality) do very well in practice both with respect to social welfare and representation.

preprint2020arXiv

A greedy anytime algorithm for sparse PCA

The taxing computational effort that is involved in solving some high-dimensional statistical problems, in particular problems involving non-convex optimization, has popularized the development and analysis of algorithms that run efficiently (polynomial-time) but with no general guarantee on statistical consistency. In light of the ever-increasing compute power and decreasing costs, a more useful characterization of algorithms is by their ability to calibrate the invested computational effort with various characteristics of the input at hand and with the available computational resources. For example, design an algorithm that always guarantees statistical consistency of its output by increasing the running time as the SNR weakens. We propose a new greedy algorithm for the $\ell_0$-sparse PCA problem which supports the calibration principle. We provide both a rigorous analysis of our algorithm in the spiked covariance model, as well as simulation results and comparison with other existing methods. Our findings show that our algorithm recovers the spike in SNR regimes where all polynomial-time algorithms fail while running in a reasonable parallel-time on a cluster.

preprint2015arXiv

Do semidefinite relaxations solve sparse PCA up to the information limit?

Estimating the leading principal components of data, assuming they are sparse, is a central task in modern high-dimensional statistics. Many algorithms were developed for this sparse PCA problem, from simple diagonal thresholding to sophisticated semidefinite programming (SDP) methods. A key theoretical question is under what conditions can such algorithms recover the sparse principal components? We study this question for a single-spike model with an $\ell_0$-sparse eigenvector, in the asymptotic regime as dimension $p$ and sample size $n$ both tend to infinity. Amini and Wainwright [Ann. Statist. 37 (2009) 2877-2921] proved that for sparsity levels $k\geqΩ(n/\log p)$, no algorithm, efficient or not, can reliably recover the sparse eigenvector. In contrast, for $k\leq O(\sqrt{n/\log p})$, diagonal thresholding is consistent. It was further conjectured that an SDP approach may close this gap between computational and information limits. We prove that when $k\geqΩ(\sqrt{n})$, the proposed SDP approach, at least in its standard usage, cannot recover the sparse spike. In fact, we conjecture that in the single-spike model, no computationally-efficient algorithm can recover a spike of $\ell_0$-sparsity $k\geqΩ(\sqrt{n})$. Finally, we present empirical results suggesting that up to sparsity levels $k=O(\sqrt{n})$, recovery is possible by a simple covariance thresholding algorithm.

preprint2013arXiv

Edge distribution in generalized graph products

Given a graph $G=(V,E)$, an integer $k$, and a function $f_G:V^k \times V^k \to {0,1}$, the $k^{th}$ graph product of $G$ w.r.t $f_G$ is the graph with vertex set $V^k$, and an edge between two vertices $x=(x_1,...,x_k)$ and $y=(y_1,...,y_k)$ iff $f_G(x,y)=1$. Graph products are a basic combinatorial object, widely studied and used in different areas such as hardness of approximation, information theory, etc. We study graph products for functions $f_G$ of the form $f_G(x,y)=1$ iff there are at least $t$ indices $i \in [k]$ s.t. $(x_i,y_i)\in E$, where $t \in [k]$ is a fixed parameter in $f_G$. This framework generalizes the well-known graph tensor-product (obtained for $t=k$) and the graph or-product (obtained for $t=1$). The property that interests us is the edge distribution in such graphs. We show that if $G$ has a spectral gap, then the number of edges connecting "large-enough" sets in $G^k$ is "well-behaved", namely, it is close to the expected value, had the sets been random. We extend our results to bi-partite graph products as well. For a bi-partite graph $G=(X,Y,E)$, the $k^{th}$ bi-partite graph product of $G$ w.r.t $f_G$ is the bi-partite graph with vertex sets $X^k$ and $Y^k$ and edges between $x \in X^k$ and $y \in Y^k$ iff $f_G(x,y)=1$. Finally, for both types of graph products, optimality is asserted using the "Converse to the Expander Mixing Lemma" obtained by Bilu and Linial in 2006. A byproduct of our proof technique is a new explicit construction of a family of co-spectral graphs.

preprint2013arXiv

How Hard is Counting Triangles in the Streaming Model

The problem of (approximately) counting the number of triangles in a graph is one of the basic problems in graph theory. In this paper we study the problem in the streaming model. We study the amount of memory required by a randomized algorithm to solve this problem. In case the algorithm is allowed one pass over the stream, we present a best possible lower bound of $Ω(m)$ for graphs $G$ with $m$ edges on $n$ vertices. If a constant number of passes is allowed, we show a lower bound of $Ω(m/T)$, $T$ the number of triangles. We match, in some sense, this lower bound with a 2-pass $O(m/T^{1/3})$-memory algorithm that solves the problem of distinguishing graphs with no triangles from graphs with at least $T$ triangles. We present a new graph parameter $ρ(G)$ -- the triangle density, and conjecture that the space complexity of the triangles problem is $Ω(m/ρ(G))$. We match this by a second algorithm that solves the distinguishing problem using $O(m/ρ(G))$-memory.

preprint2013arXiv

Zero vs. epsilon Error in Interference Channels

Traditional studies of multi-source, multi-terminal interference channels typically allow a vanishing probability of error in communication. Motivated by the study of network coding, this work addresses the task of quantifying the loss in rate when insisting on zero error communication in the context of interference channels.

preprint2012arXiv

Getting directed Hamilton cycle twice faster

Consider the random graph process where we start with an empty graph on n vertices, and at time t, are given an edge e_t chosen uniformly at random among the edges which have not appeared so far. A classical result in random graph theory asserts that w.h.p. the graph becomes Hamiltonian at time (1/2+o(1))n log n. On the contrary, if all the edges were directed randomly, then the graph has a directed Hamilton cycle w.h.p. only at time (1+o(1))n log n. In this paper we further study the directed case, and ask whether it is essential to have twice as many edges compared to the undirected case. More precisely, we ask if at time t, instead of a random direction one is allowed to choose the orientation of e_t, then whether it is possible or not to make the resulting directed graph Hamiltonian at time earlier than n log n. The main result of our paper answers this question in the strongest possible way, by asserting that one can orient the edges on-line so that w.h.p., the resulting graph has a directed Hamilton cycle exactly at the time at which the underlying graph is Hamiltonian.

preprint2010arXiv

Complete convergence of message passing algorithms for some satisfiability problems

In this paper we analyze the performance of Warning Propagation, a popular message passing algorithm. We show that for 3CNF formulas drawn from a certain distribution over random satisfiable 3CNF formulas, commonly referred to as the planted-assignment distribution, running Warning Propagation in the standard way (run message passing until convergence, simplify the formula according to the resulting assignment, and satisfy the remaining subformula, if necessary, using a simple "off the shelf" heuristic) results in a satisfying assignment when the clause-variable ratio is a sufficiently large constant.

preprint2010arXiv

Smoothed Analysis of Balancing Networks

In a balancing network each processor has an initial collection of unit-size jobs (tokens) and in each round, pairs of processors connected by balancers split their load as evenly as possible. An excess token (if any) is placed according to some predefined rule. As it turns out, this rule crucially affects the performance of the network. In this work we propose a model that studies this effect. We suggest a model bridging the uniformly-random assignment rule, and the arbitrary one (in the spirit of smoothed-analysis). We start with an arbitrary assignment of balancer directions and then flip each assignment with probability $α$ independently. For a large class of balancing networks our result implies that after $\Oh(\log n)$ rounds the discrepancy is $\Oh( (1/2-α) \log n + \log \log n)$ with high probability. This matches and generalizes known upper bounds for $α=0$ and $α=1/2$. We also show that a natural network matches the upper bound for any $α$.

Dan Vilenchik

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

Opinion Spam Detection: A New Approach Using Machine Learning and Network-Based Algorithms

Welfare vs. Representation in Participatory Budgeting

A greedy anytime algorithm for sparse PCA

Do semidefinite relaxations solve sparse PCA up to the information limit?

Edge distribution in generalized graph products

How Hard is Counting Triangles in the Streaming Model

Zero vs. epsilon Error in Interference Channels

Getting directed Hamilton cycle twice faster

Complete convergence of message passing algorithms for some satisfiability problems

Smoothed Analysis of Balancing Networks