Source author record

Eli Upfal

Eli Upfal appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Distributed, Parallel, and Cluster Computing Databases Social and Information Networks Machine Learning Discrete Mathematics Applications Artificial Intelligence Cryptography and Security cs.CY eess.SY Methodology Performance Quantitative Methods Systems and Control

Catalog footprint

What is connected

25works

15topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Balanced allocation: considerations from large scale service environments

We study d-way balanced allocation, which assigns each incoming job to the lightest loaded among d randomly chosen servers. While prior work has extensively studied the performance of the basic scheme, there has been less published work on adapting this technique to many aspects of large-scale systems. Based on our experience in building and running planet-scale cloud applications, we extend the understanding of d-way balanced allocation along the following dimensions: (i) Bursts: Events such as breaking news can produce bursts of requests that may temporarily exceed the servicing capacity of the system. Thus, we explore what happens during a burst and how long it takes for the system to recover from such bursts. (ii) Priorities: Production systems need to handle jobs with a mix of priorities (e.g., user facing requests may be high priority while other requests may be low priority). We extend d-way balanced allocation to handle multiple priorities. (iii) Noise: Production systems are often typically distributed and thus d-way balanced allocation must work with stale or incorrect information. Thus we explore the impact of noisy information and their interactions with bursts and priorities. We explore the above using both extensive simulations and analytical arguments. Specifically we show, (i) using simulations, that d-way balanced allocation quickly recovers from bursts and can gracefully handle priorities and noise; and (ii) that analysis of the underlying generative models complements our simulations and provides insight into our simulation results.

preprint2022arXiv

How Inclusive Are Wikipedia's Hyperlinks in Articles Covering Polarizing Topics?

Wikipedia relies on an extensive review process to verify that the content of each individual page is unbiased and presents a neutral point of view. Less attention has been paid to possible biases in the hyperlink structure of Wikipedia, which has a significant influence on the user's exploration process when visiting more than one page. The evaluation of hyperlink bias is challenging because it depends on the global view rather than the text of individual pages. In this paper, we focus on the influence of the interconnect topology between articles describing complementary aspects of polarizing topics. We introduce a novel measure of exposure to diverse information to quantify users' exposure to different aspects of a topic throughout an entire surfing session, rather than just one click ahead. We apply this measure to six polarizing topics (e.g., gun control and gun right), and we identify cases in which the network topology significantly limits the exposure of users to diverse information on the topic, encouraging users to remain in a knowledge bubble. Our findings demonstrate the importance of evaluating Wikipedia's network structure in addition to the extensive review of individual articles.

preprint2021arXiv

RePBubLik: Reducing the Polarized Bubble Radius with Link Insertions

The topology of the hyperlink graph among pages expressing different opinions may influence the exposure of readers to diverse content. Structural bias may trap a reader in a polarized bubble with no access to other opinions. We model readers' behavior as random walks. A node is in a polarized bubble if the expected length of a random walk from it to a page of different opinion is large. The structural bias of a graph is the sum of the radii of highly-polarized bubbles. We study the problem of decreasing the structural bias through edge insertions. Healing all nodes with high polarized bubble radius is hard to approximate within a logarithmic factor, so we focus on finding the best $k$ edges to insert to maximally reduce the structural bias. We present RePBubLik, an algorithm that leverages a variant of the random walk closeness centrality to select the edges to insert. RePBubLik obtains, under mild conditions, a constant-factor approximation. It reduces the structural bias faster than existing edge-recommendation methods, including some designed to reduce the polarization of a graph.

preprint2016arXiv

ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with Rademacher Averages

We present ABRA, a suite of algorithms that compute and maintain probabilistically-guaranteed, high-quality, approximations of the betweenness centrality of all nodes (or edges) on both static and fully dynamic graphs. Our algorithms rely on random sampling and their analysis leverages on Rademacher averages and pseudodimension, fundamental concepts from statistical learning theory. To our knowledge, this is the first application of these concepts to the field of graph analysis. The results of our experimental evaluation show that our approach is much faster than exact methods, and vastly outperforms, in both speed and number of samples, current state-of-the-art algorithms with the same quality guarantees.

preprint2016arXiv

Controlling False Discoveries During Interactive Data Exploration

Recent tools for interactive data exploration significantly increase the chance that users make false discoveries. The crux is that these tools implicitly allow the user to test a large body of different hypotheses with just a few clicks thus incurring in the issue commonly known in statistics as the multiple hypothesis testing error. In this paper, we propose solutions to integrate multiple hypothesis testing control into interactive data exploration tools. A key insight is that existing methods for controlling the false discovery rate (such as FDR) are not directly applicable for interactive data exploration. We therefore discuss a set of new control procedures that are better suited and integrated them in our system called Aware. By means of extensive experiments using both real-world and synthetic data sets we demonstrate how Aware can help experts and novice users alike to efficiently control false discoveries.

preprint2016arXiv

Scalable Betweenness Centrality Maximization via Sampling

Betweenness centrality is a fundamental centrality measure in social network analysis. Given a large-scale network, how can we find the most central nodes? This question is of key importance to numerous important applications that rely on betweenness centrality, including community detection and understanding graph vulnerability. Despite the large amount of work on designing scalable approximation algorithms for betweenness centrality, estimating it on large-scale networks remains a computational challenge. In this paper, we study the Betweenness Centrality Maximization problem: given a graph $G=(V,E)$ and a positive integer $k$, find a set $S^* \subseteq V$ that maximizes betweenness centrality subject to the cardinality constraint $|S^*| \leq k$. We present an efficient randomized algorithm that provides a $(1-1/e-ε)$-approximation with high probability, where $ε>0$. Our results improve the current state-of-the-art result by Yoshida~\cite{yoshida2014almost}. Furthermore, we provide theoretical evidence for the validity of a crucial assumption in the literature of betweenness centrality estimation, namely that in real-world networks $O(|V|^2)$ shortest paths pass through the top-$k$ central nodes, where $k$ is a constant. On the experimental side, we perform an extensive experimental analysis of our method on real-world networks, demonstrate its accuracy and scalability, and study different properties of central nodes. Finally, we provide three graph mining applications of our method.

preprint2016arXiv

TRIÈST: Counting Local and Global Triangles in Fully-dynamic Streams with Fixed Memory Size

We present TRIÈST, a suite of one-pass streaming algorithms to compute unbiased, low-variance, high-quality approximations of the global and local (i.e., incident to each vertex) number of triangles in a fully-dynamic graph represented as an adversarial stream of edge insertions and deletions. Our algorithms use reservoir sampling and its variants to exploit the user-specified memory space at all times. This is in contrast with previous approaches which use hard-to-choose parameters (e.g., a fixed sampling probability) and offer no guarantees on the amount of memory they will use. We show a full analysis of the variance of the estimations and novel concentration bounds for these quantities. Our experimental results on very large graphs show that TRIÈST outperforms state-of-the-art approaches in accuracy and exhibits a small update time.

preprint2015arXiv

A Practical Parallel Algorithm for Diameter Approximation of Massive Weighted Graphs

We present a space and time efficient practical parallel algorithm for approximating the diameter of massive weighted undirected graphs on distributed platforms supporting a MapReduce-like abstraction. The core of the algorithm is a weighted graph decomposition strategy generating disjoint clusters of bounded weighted radius. Theoretically, our algorithm uses linear space and yields a polylogarithmic approximation guarantee; moreover, for important practical classes of graphs, it runs in a number of rounds asymptotically smaller than those required by the natural approximation provided by the state-of-the-art $Δ$-stepping SSSP algorithm, which is its only practical linear-space competitor in the aforementioned computational scenario. We complement our theoretical findings with an extensive experimental analysis on large benchmark graphs, which demonstrates that our algorithm attains substantial improvements on a number of key performance indicators with respect to the aforementioned competitor, while featuring a similar approximation ratio (a small constant less than 1.4, as opposed to the polylogarithmic theoretical bound).

preprint2015arXiv

Fast Distributed PageRank Computation

Over the last decade, PageRank has gained importance in a wide range of applications and domains, ever since it first proved to be effective in determining node importance in large graphs (and was a pioneering idea behind Google's search engine). In distributed computing alone, PageRank vector, or more generally random walk based quantities have been used for several different applications ranging from determining important nodes, load balancing, search, and identifying connectivity structures. Surprisingly, however, there has been little work towards designing provably efficient fully-distributed algorithms for computing PageRank. The difficulty is that traditional matrix-vector multiplication style iterative methods may not always adapt well to the distributed setting owing to communication bandwidth restrictions and convergence rates. In this paper, we present fast random walk-based distributed algorithms for computing PageRanks in general graphs and prove strong bounds on the round complexity. We first present a distributed algorithm that takes $O\big(\log n/\eps \big)$ rounds with high probability on any graph (directed or undirected), where $n$ is the network size and $\eps$ is the reset probability used in the PageRank computation (typically $\eps$ is a fixed constant). We then present a faster algorithm that takes $O\big(\sqrt{\log n}/\eps \big)$ rounds in undirected graphs. Both of the above algorithms are scalable, as each node sends only small ($\polylog n$) number of bits over each edge per round. To the best of our knowledge, these are the first fully distributed algorithms for computing PageRank vector with provably efficient running time.

preprint2015arXiv

Optimizing Static and Adaptive Probing Schedules for Rapid Event Detection

We formulate and study a fundamental search and detection problem, Schedule Optimization, motivated by a variety of real-world applications, ranging from monitoring content changes on the web, social networks, and user activities to detecting failure on large systems with many individual machines. We consider a large system consists of many nodes, where each node has its own rate of generating new events, or items. A monitoring application can probe a small number of nodes at each step, and our goal is to compute a probing schedule that minimizes the expected number of undiscovered items at the system, or equivalently, minimizes the expected time to discover a new item in the system. We study the Schedule Optimization problem both for deterministic and randomized memoryless algorithms. We provide lower bounds on the cost of an optimal schedule and construct close to optimal schedules with rigorous mathematical guarantees. Finally, we present an adaptive algorithm that starts with no prior information on the system and converges to the optimal memoryless algorithms by adapting to observed data.

preprint2015arXiv

Space and Time Efficient Parallel Graph Decomposition, Clustering, and Diameter Approximation

We develop a novel parallel decomposition strategy for unweighted, undirected graphs, based on growing disjoint connected clusters from batches of centers progressively selected from yet uncovered nodes. With respect to similar previous decompositions, our strategy exercises a tighter control on both the number of clusters and their maximum radius. We present two important applications of our parallel graph decomposition: (1) $k$-center clustering approximation; and (2) diameter approximation. In both cases, we obtain algorithms which feature a polylogarithmic approximation factor and are amenable to a distributed implementation that is geared for massive (long-diameter) graphs. The total space needed for the computation is linear in the problem size, and the parallel depth is substantially sublinear in the diameter for graphs with low doubling dimension. To the best of our knowledge, ours are the first parallel approximations for these problems which achieve sub-diameter parallel time, for a relevant class of graphs, using only linear space. Besides the theoretical guarantees, our algorithms allow for a very simple implementation on clustered architectures: we report on extensive experiments which demonstrate their effectiveness and efficiency on large graphs as compared to alternative known approaches.

preprint2015arXiv

Wiggins: Detecting Valuable Information in Dynamic Networks Using Limited Resources

Detecting new information and events in a dynamic network by probing individual nodes has many practical applications: discovering new webpages, analyzing influence properties in network, and detecting failure propagation in electronic circuits or infections in public drinkable water systems. In practice, it is infeasible for anyone but the owner of the network (if existent) to monitor all nodes at all times. In this work we study the constrained setting when the observer can only probe a small set of nodes at each time step to check whether new pieces of information (items) have reached those nodes. We formally define the problem through an infinite time generating process that places new items in subsets of nodes according to an unknown probability distribution. Items have an exponentially decaying novelty, modeling their decreasing value. The observer uses a probing schedule (i.e., a probability distribution over the set of nodes) to choose, at each time step, a small set of nodes to check for new items. The goal is to compute a schedule that minimizes the average novelty of undetected items. We present an algorithm, WIGGINS, to compute the optimal schedule through convex optimization, and then show how it can be adapted when the parameters of the problem must be learned or change over time. We also present a scalable variant of WIGGINS for the MapReduce framework. The results of our experimental evaluation on real social networks demonstrate the practicality of our approach.

preprint2014arXiv

Distributed Agreement in Dynamic Peer-to-Peer Networks

Motivated by the need for robust and fast distributed computation in highly dynamic Peer-to-Peer (P2P) networks, we study algorithms for the fundamental distributed agreement problem. P2P networks are highly dynamic networks that experience heavy node {\em churn}. Our goal is to design fast algorithms (running in a small number of rounds) that guarantee, despite high node churn rate, that almost all nodes reach a stable agreement. Our main contributions are randomized distributed algorithms that guarantee {\em stable almost-everywhere agreement} with high probability even under high adversarial churn in a polylogarithmic number of rounds: 1. An $O(\log^2 n)$-round ($n$ is the stable network size) randomized algorithm that achieves almost-everywhere agreement with high probability under up to {\em linear} churn {\em per round} (i.e., $εn$, for some small constant $ε> 0$), assuming that the churn is controlled by an oblivious adversary (that has complete knowledge and control of what nodes join and leave and at what time and has unlimited computational power, but is oblivious to the random choices made by the algorithm). Our algorithm requires only polylogarithmic in $n$ bits to be processed and sent (per round) by each node. 2. An $O(\log m\log^3 n)$-round randomized algorithm that achieves almost-everywhere agreement with high probability under up to $ε\sqrt{n}$ churn per round (for some small $ε> 0$), where $m$ is the size of the input value domain, that works even under an adaptive adversary (that also knows the past random choices made by the algorithm). This algorithm requires up to polynomial in $n$ bits (and up to $O(\log m)$ bits) to be processed and sent (per round) by each node.

preprint2014arXiv

The Melbourne Shuffle: Improving Oblivious Storage in the Cloud

We present a simple, efficient, and secure data-oblivious randomized shuffle algorithm. This is the first secure data-oblivious shuffle that is not based on sorting. Our method can be used to improve previous oblivious storage solutions for network-based outsourcing of data.

preprint2013arXiv

A Clustering Approach to Solving Large Stochastic Matching Problems

In this work we focus on efficient heuristics for solving a class of stochastic planning problems that arise in a variety of business, investment, and industrial applications. The problem is best described in terms of future buy and sell contracts. By buying less reliable, but less expensive, buy (supply) contracts, a company or a trader can cover a position of more reliable and more expensive sell contracts. The goal is to maximize the expected net gain (profit) by constructing a dose to optimum portfolio out of the available buy and sell contracts. This stochastic planning problem can be formulated as a two-stage stochastic linear programming problem with recourse. However, this formalization leads to solutions that are exponential in the number of possible failure combinations. Thus, this approach is not feasible for large scale problems. In this work we investigate heuristic approximation techniques alleviating the efficiency problem. We primarily focus on the clustering approach and devise heuristics for finding clusterings leading to good approximations. We illustrate the quality and feasibility of the approach through experimental data.

preprint2013arXiv

Accurate Computation of Survival Statistics in Genome-wide Studies

A key challenge in genomics is to identify genetic variants that distinguish patients with different survival time following diagnosis or treatment. While the log-rank test is widely used for this purpose, nearly all implementations of the log-rank test rely on an asymptotic approximation that is not appropriate in many genomics applications. This is because: the two populations determined by a genetic variant may have very different sizes; and the evaluation of many possible variants demands highly accurate computation of very small p-values. We demonstrate this problem for cancer genomics data where the standard log-rank test leads to many false positive associations between somatic mutations and survival time. We develop and analyze a novel algorithm, Exact Log-rank Test (ExaLT), that accurately computes the p-value of the log-rank statistic under an exact distribution that is appropriate for any size populations. We demonstrate the advantages of ExaLT on data from published cancer genomics studies, finding significant differences from the reported p-values. We analyze somatic mutations in six cancer types from The Cancer Genome Atlas (TCGA), finding mutations with known association to survival as well as several novel associations. In contrast, standard implementations of the log-rank test report dozens-hundreds of likely false positive associations as more significant than these known associations.

preprint2013arXiv

Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees

The tasks of extracting (top-$K$) Frequent Itemsets (FI's) and Association Rules (AR's) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI's and AR's are sufficient for most practical uses, and a number of recent works explored the application of sampling for fast discovery of approximate solutions to the problems. However, these works do not provide satisfactory performance guarantees on the quality of the approximation, due to the difficulty of bounding the probability of under- or over-sampling any one of an unknown number of frequent itemsets. In this work we circumvent this issue by applying the statistical concept of \emph{Vapnik-Chervonenkis (VC) dimension} to develop a novel technique for providing tight bounds on the sample size that guarantees approximation within user-specified parameters. Our technique applies both to absolute and to relative approximations of (top-$K$) FI's and AR's. The resulting sample size is linearly dependent on the VC-dimension of a range space associated with the dataset to be mined. The main theoretical contribution of this work is a proof that the VC-dimension of this range space is upper bounded by an easy-to-compute characteristic quantity of the dataset which we call \emph{d-index}, and is the maximum integer $d$ such that the dataset contains at least $d$ transactions of length at least $d$ such that no one of them is a superset of or equal to another. We show that this bound is strict for a large class of datasets.

preprint2013arXiv

Storage and Search in Dynamic Peer-to-Peer Networks

We study robust and efficient distributed algorithms for searching, storing, and maintaining data in dynamic Peer-to-Peer (P2P) networks. P2P networks are highly dynamic networks that experience heavy node churn (i.e., nodes join and leave the network continuously over time). Our goal is to guarantee, despite high node churn rate, that a large number of nodes in the network can store, retrieve, and maintain a large number of data items. Our main contributions are fast randomized distributed algorithms that guarantee the above with high probability (whp) even under high adversarial churn: 1. A randomized distributed search algorithm that (whp) guarantees that searches from as many as $n - o(n)$ nodes ($n$ is the stable network size) succeed in ${O}(\log n)$-rounds despite ${O}(n/\log^{1+δ} n)$ churn, for any small constant $δ> 0$, per round. We assume that the churn is controlled by an oblivious adversary (that has complete knowledge and control of what nodes join and leave and at what time, but is oblivious to the random choices made by the algorithm). 2. A storage and maintenance algorithm that guarantees (whp) data items can be efficiently stored (with only $Θ(\log{n})$ copies of each data item) and maintained in a dynamic P2P network with churn rate up to ${O}(n/\log^{1+δ} n)$ per round. Our search algorithm together with our storage and maintenance algorithm guarantees that as many as $n - o(n)$ nodes can efficiently store, maintain, and search even under ${O}(n/\log^{1+δ} n)$ churn per round. Our algorithms require only polylogarithmic in $n$ bits to be processed and sent (per round) by each node. To the best of our knowledge, our algorithms are the first-known, fully-distributed storage and search algorithms that provably work under highly dynamic settings (i.e., high churn rates per step).

preprint2011arXiv

Infectious Random Walks

We study the dynamics of information (or virus) dissemination by $m$ mobile agents performing independent random walks on an $n$-node grid. We formulate our results in terms of two scenarios: broadcasting and gossiping. In the broadcasting scenario, the mobile agents are initially placed uniformly at random among the grid nodes. At time 0, one agent is informed of a rumor and starts a random walk. When an informed agent meets an uninformed agent, the latter becomes informed and starts a new random walk. We study the broadcasting time of the system, that is, the time it takes for all agents to know the rumor. In the gossiping scenario, each agent is given a distinct rumor at time 0 and all agents start random walks. When two agents meet, they share all rumors they are aware of. We study the gossiping time of the system, that is, the time it takes for all agents to know all rumors. We prove that both the broadcasting and the gossiping times are $\tildeΘ(n/\sqrt{m})$ w.h.p., thus achieving a tight characterization up to logarithmic factors. Previous results for the grid provided bounds which were weaker and only concerned average times. In the context of virus infection, a corollary of our results is that static and dynamically moving agents are infected at about the same speed.

preprint2011arXiv

Space-Round Tradeoffs for MapReduce Computations

This work explores fundamental modeling and algorithmic issues arising in the well-established MapReduce framework. First, we formally specify a computational model for MapReduce which captures the functional flavor of the paradigm by allowing for a flexible use of parallelism. Indeed, the model diverges from a traditional processor-centric view by featuring parameters which embody only global and local memory constraints, thus favoring a more data-centric view. Second, we apply the model to the fundamental computation task of matrix multiplication presenting upper and lower bounds for both dense and sparse matrix multiplication, which highlight interesting tradeoffs between space and round complexity. Finally, building on the matrix multiplication results, we derive further space-round tradeoffs on matrix inversion and matching.

preprint2011arXiv

The VC-Dimension of Queries and Selectivity Estimation Through Sampling

We develop a novel method, based on the statistical concept of the Vapnik-Chervonenkis dimension, to evaluate the selectivity (output cardinality) of SQL queries - a crucial step in optimizing the execution of large scale database and data-mining operations. The major theoretical contribution of this work, which is of independent interest, is an explicit bound to the VC-dimension of a range space defined by all possible outcomes of a collection (class) of queries. We prove that the VC-dimension is a function of the maximum number of Boolean operations in the selection predicate and of the maximum number of select and join operations in any individual query in the collection, but it is neither a function of the number of queries in the collection nor of the size (number of tuples) of the database. We leverage on this result and develop a method that, given a class of queries, builds a concise random sample of a database, such that with high probability the execution of any query in the class on the sample provides an accurate estimate for the selectivity of the query on the original large database. The error probability holds simultaneously for the selectivity estimates of all queries in the collection, thus the same sample can be used to evaluate the selectivity of multiple queries, and the sample needs to be refreshed only following major changes in the database. The sample representation computed by our method is typically sufficiently small to be stored in main memory. We present extensive experimental results, validating our theoretical analysis and demonstrating the advantage of our technique when compared to complex selectivity estimation techniques used in PostgreSQL and the Microsoft SQL Server.

preprint2011arXiv

Tight Bounds on Information Dissemination in Sparse Mobile Networks

Motivated by the growing interest in mobile systems, we study the dynamics of information dissemination between agents moving independently on a plane. Formally, we consider $k$ mobile agents performing independent random walks on an $n$-node grid. At time $0$, each agent is located at a random node of the grid and one agent has a rumor. The spread of the rumor is governed by a dynamic communication graph process ${G_t(r) | t \geq 0}$, where two agents are connected by an edge in $G_t(r)$ iff their distance at time $t$ is within their transmission radius $r$. Modeling the physical reality that the speed of radio transmission is much faster than the motion of the agents, we assume that the rumor can travel throughout a connected component of $G_t$ before the graph is altered by the motion. We study the broadcast time $T_B$ of the system, which is the time it takes for all agents to know the rumor. We focus on the sparse case (below the percolation point $r_c \approx \sqrt{n/k}$) where, with high probability, no connected component in $G_t$ has more than a logarithmic number of agents and the broadcast time is dominated by the time it takes for many independent random walks to meet each other. Quite surprisingly, we show that for a system below the percolation point the broadcast time does not depend on the relation between the mobility speed and the transmission radius. In fact, we prove that $T_B = \tilde{O}(n / \sqrt{k})$ for any $0 \leq r < r_c$, even when the transmission range is significantly larger than the mobility range in one step, giving a tight characterization up to logarithmic factors. Our result complements a recent result of Peres et al. (SODA 2011) who showed that above the percolation point the broadcast time is polylogarithmic in $k$.

preprint2010arXiv

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent itemset mining. Specifically, we develop a novel methodology to identify a meaningful support threshold s* for a dataset, such that the number of itemsets with support at least s* represents a substantial deviation from what would be expected in a random dataset with the same number of transactions and the same individual item frequencies. These itemsets can then be flagged as statistically significant with a small false discovery rate. We present extensive experimental results to substantiate the effectiveness of our methodology.

preprint2010arXiv

MADMX: A Novel Strategy for Maximal Dense Motif Extraction

We develop, analyze and experiment with a new tool, called MADMX, which extracts frequent motifs, possibly including don't care characters, from biological sequences. We introduce density, a simple and flexible measure for bounding the number of don't cares in a motif, defined as the ratio of solid (i.e., different from don't care) characters to the total length of the motif. By extracting only maximal dense motifs, MADMX reduces the output size and improves performance, while enhancing the quality of the discoveries. The efficiency of our approach relies on a newly defined combining operation, dubbed fusion, which allows for the construction of maximal dense motifs in a bottom-up fashion, while avoiding the generation of nonmaximal ones. We provide experimental evidence of the efficiency and the quality of the motifs returned by MADMX

preprint2010arXiv

Mining Top-K Frequent Itemsets Through Progressive Sampling

We study the use of sampling for efficiently mining the top-K frequent itemsets of cardinality at most w. To this purpose, we define an approximation to the top-K frequent itemsets to be a family of itemsets which includes (resp., excludes) all very frequent (resp., very infrequent) itemsets, together with an estimate of these itemsets' frequencies with a bounded error. Our first result is an upper bound on the sample size which guarantees that the top-K frequent itemsets mined from a random sample of that size approximate the actual top-K frequent itemsets, with probability larger than a specified value. We show that the upper bound is asymptotically tight when w is constant. Our main algorithmic contribution is a progressive sampling approach, combined with suitable stopping conditions, which on appropriate inputs is able to extract approximate top-K frequent itemsets from samples whose sizes are smaller than the general upper bound. In order to test the stopping conditions, this approach maintains the frequency of all itemsets encountered, which is practical only for small w. However, we show how this problem can be mitigated by using a variation of Bloom filters. A number of experiments conducted on both synthetic and real bench- mark datasets show that using samples substantially smaller than the original dataset (i.e., of size defined by the upper bound or reached through the progressive sampling approach) enable to approximate the actual top-K frequent itemsets with accuracy much higher than what analytically proved.

Eli Upfal

What is connected

Connect this record

See the researcher in context

Building this map preview

25 published item(s)

Balanced allocation: considerations from large scale service environments

How Inclusive Are Wikipedia's Hyperlinks in Articles Covering Polarizing Topics?

RePBubLik: Reducing the Polarized Bubble Radius with Link Insertions

ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with Rademacher Averages

Controlling False Discoveries During Interactive Data Exploration

Scalable Betweenness Centrality Maximization via Sampling

TRIÈST: Counting Local and Global Triangles in Fully-dynamic Streams with Fixed Memory Size

A Practical Parallel Algorithm for Diameter Approximation of Massive Weighted Graphs

Fast Distributed PageRank Computation

Optimizing Static and Adaptive Probing Schedules for Rapid Event Detection

Space and Time Efficient Parallel Graph Decomposition, Clustering, and Diameter Approximation

Wiggins: Detecting Valuable Information in Dynamic Networks Using Limited Resources

Distributed Agreement in Dynamic Peer-to-Peer Networks

The Melbourne Shuffle: Improving Oblivious Storage in the Cloud

A Clustering Approach to Solving Large Stochastic Matching Problems

Accurate Computation of Survival Statistics in Genome-wide Studies

Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees

Storage and Search in Dynamic Peer-to-Peer Networks

Infectious Random Walks

Space-Round Tradeoffs for MapReduce Computations

The VC-Dimension of Queries and Selectivity Estimation Through Sampling

Tight Bounds on Information Dissemination in Sparse Mobile Networks

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

MADMX: A Novel Strategy for Maximal Dense Motif Extraction

Mining Top-K Frequent Itemsets Through Progressive Sampling